cubicweb/schemas/base.py · 5f9ed9d600c78f0bd523a8abe12c02ffb97c81ca · cubicweb / cubicweb

fix(RQLExpression)!: performance issue on RQLExpressions using EXISTS() · 496aca08d600
Philippe Pepiot authored Jan 27, 2021
BREAKING CHANGE: explicitly use EXISTS in RQLExpression for permissions

This backout the changeset dfcc3f74b3c8 which introduced wrapping all
{E,R}RQLExpression where clause with EXISTS().

It appear to have very bad performance on PostgreSQL on queries already using
EXISTS(), in this case rql was generating a query with double EXISTS(), leading
to a very bad query plan:

=# create table t as select * from generate_series(1, 1000000) as id;
SELECT 1000000
=# create unique index on t(id);
CREATE INDEX

For the RQLExpression "EXISTS(X identity X)" the generated sql was: Any X WHERE EXISTS(EXISTS(X identity X)), X eid %(eid)s, which is equivalent to:

=# explain analyze select id from t where exists(select 1 where exists(select 1 from t as x where t.id = x.id) and t.id = 42);
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..8469425.00 rows=500000 width=4) (actual time=0.277..1426.342 rows=1 loops=1)
   Filter: (SubPlan 2)
   Rows Removed by Filter: 999999
   SubPlan 2
     ->  Result  (cost=8.45..8.46 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1000000)
           One-Time Filter: ($1 AND (t.id = 42))
           InitPlan 1 (returns $1)
             ->  Index Only Scan using t_id_idx on t x  (cost=0.42..8.44 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1000000)
                   Index Cond: (id = t.id)
                   Heap Fetches: 1000000
 Planning Time: 0.190 ms
 Execution Time: 1426.384 ms


The planner wasn't able to optimise this (bad written) query, it produce a full table read (Seq Scan).

With a single EXISTS, the query perform much better: Any X WHERE EXISTS(X identity X), X eid %(eid)s, which is equivalent to:

=# explain analyze select id from t where exists(select 1 from t as x where t.id = x.id) and t.id = 42;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.85..16.90 rows=1 width=4) (actual time=0.093..0.095 rows=1 loops=1)
   ->  Index Only Scan using t_id_idx on t  (cost=0.42..8.44 rows=1 width=4) (actual time=0.081..0.082 rows=1 loops=1)
         Index Cond: (id = 42)
         Heap Fetches: 1
   ->  Index Only Scan using t_id_idx on t x  (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=1)
         Index Cond: (id = 42)
         Heap Fetches: 1
 Planning Time: 0.383 ms
 Execution Time: 0.136 ms


An alternative patch would be to insert the EXISTS only if the original query
doesn't already contains a EXISTS. But I think it's better to drop the magic
here and let user control what they really want in their expressions.

I added a note to the 3.27 changelog for this.

Also add some tests to RQLExpressionTC with EXISTS().
496aca08d600