Neo4J 2 – Understanding Collections

In Cypher, to fetch a Person with their email addresses and phone numbers, you might write a straight forward query like this:

MATCH (person:Person)
MATCH person-[:HAS_EMAIL]->email
MATCH person-[:HAS_PHONE]->phone
RETURN person,
collect(distinct email),
collect(distinct phone)

In this article, we will try to get a better understanding of collections in Cypher. At first, we will look under the hood to see how this query is run, understand why it is an expensive job and finally, how we can optimise it and avoid this undesired behaviour in the future.

Please refer to Basics of MATCH clause and Joins in Cypher for a refresher on Neo4j 2 topics.

Setting the scene

We have four people in our database, Tom, John, Mary and Kate. The image below is of a query that returns all the people with their phone and email. Given Tom is the only one with phone and email, the result-set should only return Tom with his email addresses and phone number. Each coloured MATCH clause in the query is responsible for returning the nodes with the same colour in the result-set. When we look at the query and then the resulting graph, everything looks OK or is it?

The problem

On the surface, it looks all right but what is it actually doing under the hood? Best way to see that is by analysing the result-set before it is aggregated, that means not executing the COLLECT and DISTINCT functions. Ideally you want the query to return 1 row for Tom, but in the table below, the result-set includes 2 rows.

http://console.neo4j.org/r/8rvxrj

Let’s see why it returns two rows for tom. Tom has one email address and two phone numbers. Because of the two phone numbers he has, the result-set includes two rows. Please note that that even though there is only one email address, it is repeated in the second row. If we do not perform a DISTINCT inside our COLLECT function, it will include the same node twice.

The breaking point

If Tom were to have 8 email addresses and 10 phone numbers, with what you have seen earlier, how many rows do you expect the query to return?

Did you guess 80 rows? As you can see in the table below, the result-set represents a Cartesian product of email and phone.

http://console.neo4j.org/r/49zqt6

Under the hood

Extracted from Neo4j Docs:

Each query is turned into an execution plan by something called the execution planner. The execution plan tells Neo4j which operations to perform when executing the query. Two different execution planning strategies are included in Neo4j:

Rule This planner has rules that are used to produce execution plans. The planner considers available indexes, but does not use statistical information to guide the query compilation.

Cost This planner uses the statistics service in Neo4j to assign cost to alternative plans and picks the cheapest one. While this should lead to superior execution plans in most cases, it is still under development. By default, Neo4j 2.2 will use the cost planner for some queries, but not all.

In this section we will analyse the query execution plans generated by Rule based and Cost based planners to see if they can highlight why it returns this much data. The output of the Cost based planner is displayed here below.

Compiler CYPHER 2.2
Planner COST

Projection
|
+Expand(All)(0)
|
+Expand(All)(1)
|
+NodeByLabelScan

+—————–+—————+——+——–+——————————————+——————————-+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+—————–+—————+——+——–+——————————————+——————————-+
| Projection | 25 | 80 | 0 | anon[36], anon[69], email, person, phone | person; email; phone |
| Expand(All)(0) | 25 | 80 | 89 | anon[36], anon[69], email, person, phone | (person)-[:HAS_PHONE]-(phone) |
| Expand(All)(1) | 9 | 9 | 13 | anon[36], email, person | (person)-[:HAS_EMAIL]-(email) |
| NodeByLabelScan | 4 | 4 | 5 | person | :Person |
+—————–+—————+——+——–+——————————————+——————————-+

Understanding the execution plan

Step 1

Search for all persons in the database.

There are 4 people in the database so returns 4 rows.

Step 2

For each person, get all the email addresses.

There are 8 email addresses for Tom and 1 for Mary.

Those who don’t have an email address will be excluded from the result-set.

So the total number of rows returned is 9.

Step 3

For each person, returned in the previous result, get all of their phone numbers.

Mary doesn’t have any phone numbers so she will be excluded from the result-set.

The intermediate result-set for this step will become 8 rows.

Tom has 10 phone numbers. So for each row in the result-set it will add all the 10 phone numbers.

Tom | email 1 | phone 1
Tom | email 1 | phone 2
Tom | email 1 | phone 3
etc

Given that there are 8 rows. It will result in 80 rows returned from the query.

The solution

The idea that comes to mind is to perform COLLECT early on in the query so that there are fewer rows passed on to the next statement.

http://console.neo4j.org/r/jn7ofz

The console above shows the optimised Cypher query returning only 1 row, just like the query at the start of this page. To know whether we have made any actual improvements to the execution plan, we need to take a look under the hood again.

Compiler CYPHER 2.2
Planner COST

EagerAggregation(0)
|
+Expand(All)(0)
|
+EagerAggregation(1)
|
+Expand(All)(1)
|
+NodeByLabelScan

+———————+—————+——+——–+———————————-+——————————-+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+———————+—————+——+——–+———————————-+——————————-+
| EagerAggregation(0) | 3 | 1 | 0 | emails, person, phones | person, emails |
| Expand(All)(0) | 8 | 10 | 12 | anon[108], emails, person, phone | (person)-[:HAS_PHONE]-(phone) |
| EagerAggregation(1) | 3 | 2 | 0 | emails, person | person |
| Expand(All)(1) | 9 | 9 | 13 | anon[36], email, person | (person)-[:HAS_EMAIL]-(email) |
| NodeByLabelScan | 4 | 4 | 5 | person | :Person |
+———————+—————+——+——–+———————————-+——————————-+

As you can see in the query execution plan above, in Step 3 we are performing an EargerAggregation. This is the COLLECT that we are performing early on with the emails. Instead of passing all the 9 rows from Step 2, it now reduces it to only 2 rows. Now this smaller result-set is passed on to the phones MATCH clause. This in turn fetches the 10 phone numbers and then perfroms another EargerAggregation, which in turn reduces the result-set back to 1 row.

Rule of thumb: Aggregate collections early on in the query to avoid undesired performance gotchas.