Querying Recursive Data Structures with SQL and Ecto
In one of the projects I've worked on in the past, we were tasked to implement an attribute based access control microservice for managing permissions for different types of entities across different resources shared between different organizations.
Problem Overview & Background
For the purposes of this post, we'll simplify greatly and use the following description of the system:
- One organization can have many subordinate organizations.
- One organization can own many resources.
- One organization can have many admins.
- Admins in an organization have access to resources belonging to that organization, or any subordinate organizations.
- Organizations can be arbitrarily deeply nested
We modelled this with the following database schema:
from_id | relation | to_id |
---|---|---|
org_1 | superior_of | org_2 |
org_1 | superior_of | org_3 |
admin_1 | manages | org_1 |
admin_2 | manages | org_3 |
Using this structure to store relations, we're able to store arbitrary relations between arbitrary entities, allowing us to do complex queries like getting all entities in all organizations that can be managed by an admin at a given tier of this hierarchy—in the example above, admin_1
would be able to manage all assets of all organizations and sub-organizations of org_1
.
During the MVP phase of the project, this microservice was thrown together pretty quickly and generally it worked well. It was built for simplicity with not much special focus on optimality however so after using this service in production for 14 months we started to notice that for particularly large organization hierarchies queries would end up taking quite awhile.
A common query we performed was to get a list of all organizations
, and recursively get all of the subordinate organizations
for those organizations
. The original implementation looked a little like this:
def traverse_hierarchy(organization_id) do
{:ok, children} = get_children(organization_id)
{organization_id, Enum.map(children, &traverse_hierarchy/1)}
end
defp get_children(organization_id) do
children =
App.Repo.all(
from r in Relation,
where: r.from_id == ^organization_id,
where: r.relation == @superior_org_relation,
select: r.to_id
)
{:ok, children}
end
This implementation of traverse_hierarchy/1
essentially finds children via a depth first search, which is exhaustive and correct, simple and small; however it also performs N+1 queries because we'd be making a call to get the children for the current node we were working on, and N calls to get the children of each child node. As our hierarchies grew large, the number of queries this would end up doing would increase exponentially.
Building a solution
After doing some research, we discovered that SQL has a mechanism which allows us to recursively query data called Common Table Expressions (CTEs). These expressions act somewhat like closures in that they allow us to run 'functions' on temporary results inside an existing SQL query, allowing you to further query based on the result of the previous. This essentially reduce our multiple SQL queries into a single one following the mantra of letting your database do the heavy lifting
Writing the query in SQL
was pretty simple since there are plenty of tutorials on recursive CTEs. I ended up with something like:
CTEs are detailed and documented quite thoroughly, so it was really easy to come up with an initial CTE that would do what we wanted:
WITH RECURSIVE tree(depth, from_id, relation, to_id) AS (
SELECT 0, from_id, relation, to_id FROM relations WHERE from_id = 'root_org_id' AND relation = 'is_superior_org'
UNION SELECT
depth + 1,
relations.from_id,
relations.relation,
relations.to_id
FROM
relations JOIN tree
ON relations.from_id = tree.to_id
)
SELECT depth, from_id as parent, to_id as child
FROM tree ORDER BY depth
When the above SQL query is executed, the output returned is pretty much exactly what we want:
depth | parent | child |
---|---|---|
0 | root_org_id | sub_org_a |
0 | root_org_id | sub_org_b |
1 | sub_org_a | sub_sub_org_a |
1 | sub_org_a | sub_sub_org_b |
1 | sub_org_b | sub_sub_org_c |
2 | sub_sub_org_c | sub_sub_sub_org_a |
We could do some light postprocessing on this returned dataset and we would quite easily build a response which matched the return of the function snippet above.
Unfortunately Ecto
doesn't provide any API over querying using CTEs. There are a few ways of calling out raw SQL queries and getting Ecto's database adapter to execute them but this practice is generally frowned upon for good reason. Thankfully we could hackily use a fragment to do what we wanted, like so:
defp traverse_hierarchy(organization_id) do
hierarchy =
App.Repo.all(
from(r in Relation,
left_join:
cte in fragment(
"""
(
WITH RECURSIVE tree(depth, from_id, relation, to_id) AS (
SELECT 0, from_id, relation, to_id FROM relations WHERE from_id = ? AND rel = 'is_superior_org'
UNION SELECT
depth + 1,
relations.from_id,
relations.relation,
relations.to_id
FROM
relations JOIN tree
ON relations.from_id = tree.to_id
)
SELECT * FROM tree ORDER BY depth
)
""",
^organization_id
),
where: r.from_id == cte.from_id and r.to_id == cte.to_id and r.rel == cte.rel
)
)
# For brevity I won't actually include code to postprocess these results
# but it would be quite simply to do so here.
process_data(hierarchy)
end
This implementation performed much faster than the original although admittedly it is quite ugly! The original implementation was significantly more readable, but at the end of the day this approach worked out quite well and solved the problem we were trying to solve.
I'm not sure if this is the best way of dealing with such recursive data structures, or if there is a more idiomatic way of doing it, but it was an interesting dive about leveraging SQL instead of programming logic whenever we can.
I'm not sure why Ecto doesn't provide any capability to execute CTEs but maybe one could hack together a PR or a library to do just that
Edit:
Looks like between when this was implemented on the project I was working on and the time I wrote this article, Ecto now has includes an API for using CTEs! You can check out the documentation here.