This worklog has been replaced with mariadb.org/jira

This site is here for historical purposes only. Do not add or edit tasks here!

 
 
 

WorkLog Frontpage Log in / Register
High-Level Description | Task Dependencies | High-Level Specification | Low-Level Design | File Attachments | User Comments | Time Estimates | Funding and Votes | Progress Reports

 Subqueries: cost-based choice between Materialization and IN->EXISTS transformation
Title
Task ID89
Queue
Version
Status
Priority
Copies toIgor
Psergey
Timour

Created byPsergey28 Feb 2010Done
Supervisor   
Lead Architect    
Architecture Review  
Implementor  
Code Review  
QA  
Documentation  
 High-Level Description
For uncorrelated IN subqueries that can't be converted to semijoins it is 
necessary to make a cost-based choice between IN->EXISTS and Materialization
strategies. Subqueries that can't be converted to semijoins are called
'unflattened'.

The optimization of unflattened subqueries is performed as follows:

1. Semantic analysis

During the prepare phase of query processing, the optimizer analyzes
recursively all subqueries for the possible execution strategies.  Let
'outer_join' is a query that contains some subquery predicate
'subs_pred'. Let 'inner_join' is the query object for the subquery
under that predicate. The semantic analysis phase analyzes
non-flattened subqueries as follows:

outer_join.prepare()
{
  outer_join.where_clause.fix_fields() -> subs_pred->fix_fields() ->
    -> subs_engine->prepare() -> inner_join->prepare() ->
       check_and_do_in_subquery_rewrites()
       {
         if (subs_pred == IN && SEMIJOIN is NOT applicable)
         {
           if (MATERIALIZATION is applicable)
             subs_predicate.in_strategy.add(MATERIALIZATION)
           if (IN_TO_EXISTS is applicable)
             subs_predicate.in_strategy.add(IN_TO_EXISTS)
         }

         Perform MIN/MAX transformations(*) or wrap subs_predicate in an
         Item_in_optimizer
       }
}

(*) The MIN/MAX transformations has been moved out the prepare phase by
    MWL#148.

In the end of the JOIN::prepare phase for the outer-most query, the
optimizer has analyzed all subqueries, and if they could not be
flattened into semijoins, they have been prepare()-d and it has been
decided which execution methods are possible.


2. Cost-based choice of execution strategies

The optimization of all non-flattened subqueries is performed in two
steps. First JOIN::optimize for the outer query calls JOIN::optimize
for all its direct child subqueries. Once the join optimizer for each
subquery has generated a QEP based on the subquery as it is, it calls
JOIN::choose_subquery_plan() that chooses an optimal execution strategy
for the subquery predicate. Currently the choice of strategies is
between MATERIALIZATION and IN_TO_EXISTS. This is summarized in the
following pseudocode:

outer_join.optimize()
{
  ...
  optimize_unflattened_subqueries() -> inner_join.optimize()
  ...
}

inner_join.optimize()
{
  ...
  make_join_statistics()
  {
    ...
    choose_subquery_plan()
    {
      - reoptimize this JOIN with extra IN=>EXISTS predicates
      - compare the costs of the non-modified and the EXISTS plans
      - if the non-modified plan is cheaper
          revert the plan to the original plan, and create
          MATERIALIZATION related structures
        else
          inject the IN=>EXISTS predicates into the reoptimized
          plan
    }
    ...
  }
  ...
  rewrite_to_index_subquery_engine()
  {
    For single-table subqueries, if the only table access method is a
    JT_EQ_REF, or JT_REF, or JT_REF_OR_NULL, change the subquery engine
    to subselect_[unique | index]subquery_engine.
  }
  ...
}


At the end of the JOIN::optimize phase, all non-flattened IN/ALL/ANY
subqueries have a chosen strategy for execution, and are fully
optimized according to this strategy.


Notice:

Scalar subqueries are still optimized on demand at their first
evaluation. This may happen both during the JOIN::optimize phase, and
during the JOIN::exec() phase. Scalar subquery predicates are
represented by Item_singlerow_subselect and Item_maxmin_subselect.
 Task Dependencies
Others waiting for Task 89Task 89 is waiting forGraph
91 MariaDB 5.3
168 Remove lazy subquery optimization 
 
 High-Level Specification
Why need two optimizations
--------------------------
Consider a query with subquery:

  SELECT 
    oe IN (SELECT ie FROM inner_tbl WHERE inner_cond)
  FROM outer_tbl
  WHERE outer_cond

If we use Materialization strategy, the costs will be 

  cost of accessing outer_tbl + 
  materialization_cost + 
  #records(outer_tbl w/o outer_cond) * lookup_cost

where 

  materialization_cost= 
    cost of executing the (SELECT ie FROM inner_tbl WHERE inner_cond)

On the other hand, for IN->EXISTS strategy, the subquery will be rewritten into

  SELECT 
    EXISTS (SELECT 1 FROM inner_tbl WHERE inner_cond AND oe=ie)
  FROM outer_tbl
  WHERE outer_cond

and the costs will be 

  cost of accessing outer_tbl + 
  #records(outer_tbl w/o outer_cond) * exists_select_cost

where
  exists_select_cost= 
    cost of executing the (SELECT 1 FROM inner_tbl WHERE inner_cond AND oe=ie)

So, we'll need to compute both exists_select_cost and materialization_cost.

Difficulty with the need to run select optimization two times
-------------------------------------------------------------
The problem is in this scenario:
1. We compute materialization_cost by running optimization for the original
   subquery select.
2. We compute exists_select_cost by running optimization for the subquery's
   select with "oe=ie" injected into WHERE
3. Then we find that cost #1 is less and want to execute the materialization
   strategy.

The problem is that once one injects "oe=ie", it can trigger some optimization
steps that are not possible to undo.
- Example1:  outer->inner join conversion
- non-Example: according to Igor, "oe=ie" won't participate in equality propagation.
- ... what else ?

 Low-Level Design
Based on discussions with Timour:

<contents>
1. Substitution problem
1.1 Current solution
2. Approaches to doing two optimizations on the same join
2.1 Generic approach, undoable optimization
2.2 Specific-case approach, optimizing with extra equality
</contents>

1. Substitution problem
-----------------------
IN->EXISTS rewrite requires that the Item_in_subselect object is wrapped
with an Item_in_optimizer object.

This was very easy to do when the rewrites were performed at PREPARE stage: 
it is expected that item may substitute itself in Item::fix_fields() call, so
Item_in_subselect::fix_fields() could create and return its Item_in_optimizer
as substitution, and that would ensure that from that point everyone would use
Item_in_subselect only through Item_in_optimizer.

Doing such wrapping at execution stage is more challenging because there can
be multiple places that keep pointers to the same Item, for example, the same
Item could be referred from:
- Query's WHERE condition in JOIN::conds and in SELECT_LEX::where
- table's JOIN_TAB::select_cond (actually, the same item can be present in
  conditions attached to multiple tables).
- pushed index condition (the storage engine has Item* pointer, and if BKA is 
  used, JOIN_CACHE object will have an Item* pointer, too)
- ... etc (TODO what else?)

1.1 Current solution
~~~~~~~~~~~~~~~~~~~~
Create a wrapper object very early (either in fix_fields() or in the parser).
The wrapper object will have a field that will tell it whether it should work 
as Item_in_optimizer, or as call redirector, or whatever is needed.

Note: Sanja ought to have a similar problem with his subquery value cache
wrapper item. Check what approach he's using.


2. Approaches to doing two optimizations on the same join
---------------------------------------------------------

2.1 Generic approach, undoable optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is what Timour is trying to implement.

He's trying to reach a state where one could
- optimize a SELECT and a query plan for it
- undo the optimization
- change the WHERE clause to include the injected IN-equality
- optimize the SELECT again, get another query plan
- [possibly] undo the second optimization and switch back to the first one
  (Q: how is the switch-back part implemented? do we run the optimization
   again?)

The task of optimizing/re-optimizing a SELECT has been solved for prepared
statements, and here we try to reuse PS code's solutions:
- JOIN object is discarded when undoing the optimization, each "optimize the
  SELECT" step operates on its own JOIN object.
- Query_arena is used for its ability to undo Item tree changes (grep for
  thd->change_item_tree)

2.1.1 Questioning this approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are optimizations that do rewrites that are not undoable (at all!) and
are kept for the entire lifetime of the statement. For example,
- outer->inner join conversion
- subquery-to-semijoin conversion
are done at first execution and their result is kept till the statement is
destroyed. This is possible because these rewrites do not depend on values of
PS parameters or contents of the table.

However, the difference between IN->EXISTS's select and Materialization's 
select is that IN->EXISTS has an extra equality injected into the WHERE 
clause.  Extra equality can be used to justify non-unodable conversions like
outer->inner join convertion and subquery-to-semi-join conversion. These
involve changing TABLE_LIST objects so Query_arena mechanism won't be able to
undo them.

2.2 Specific-case approach, optimizing with extra equality
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Igor's suggestion:

Run join optimization with extra parameter meaning that optimization should
be done for both with-equality and without-equality cases.

First, proceed with optimization without the equality. After we've came to the
point where we've picked a join plan, do the following:

- save join->join_tab and join->best_positions[]
- call add_key_fields() for the IN-equality so that KEYUSE array is updated
  with the IN-equality
- run choose_plan(), i.e. join optimization, again.

This approach is rather limited, but it is much easier to implement.

 File Attachments
 NameTypeSizeByDate
 User Comments
 Time Estimates
NameHours WorkedLast Updated
All Sub Tasks0 
Total0 
 Hrs WorkedProgressCurrentOriginal
Sub Tasks01616
Total01616
 
 Funding and Votes
Votes: 0: 0%
 Make vote: Useless    Nice to have    Important    Very important    

Funding: 0 offers, total 0 Euro
 Progress Reports
(Timour - Mon, 06 Dec 2010, 12:57
    
Dependency created: WL#89 now depends on WL#168

(Psergey - Sat, 09 Oct 2010, 10:53
    
Dependency created: WL#91 now depends on WL#89

(Timour - Wed, 06 Oct 2010, 09:30
    
High Level Description modified.
--- /tmp/wklog.89.old.32065	2010-10-06 09:30:28.000000000 +0000
+++ /tmp/wklog.89.new.32065	2010-10-06 09:30:28.000000000 +0000
@@ -1,11 +1,105 @@
-For uncorrelated IN subqueries that can't be converted to semi-joins it is 
+For uncorrelated IN subqueries that can't be converted to semijoins it is 
 necessary to make a cost-based choice between IN->EXISTS and Materialization
-strategies.
+strategies. Subqueries that can't be converted to semijoins are called
+'unflattened'.
 
-Both strategies handle two cases:
-1. A simple case w/o NULLs handling
-2. Handling NULLs.
+The optimization of unflattened subqueries is performed as follows:
 
-This WL is about making cost-based decision for #1.
+1. Semantic analysis
+
+During the prepare phase of query processing, the optimizer analyzes
+recursively all subqueries for the possible execution strategies.  Let
+'outer_join' is a query that contains some subquery predicate
+'subs_pred'. Let 'inner_join' is the query object for the subquery
+under that predicate. The semantic analysis phase analyzes
+non-flattened subqueries as follows:
+
+outer_join.prepare()
+{
+  outer_join.where_clause.fix_fields() -> subs_pred->fix_fields() ->
+    -> subs_engine->prepare() -> inner_join->prepare() ->
+       check_and_do_in_subquery_rewrites()
+       {
+         if (subs_pred == IN && SEMIJOIN is NOT applicable)
+         {
+           if (MATERIALIZATION is applicable)
+             subs_predicate.in_strategy.add(MATERIALIZATION)
+           if (IN_TO_EXISTS is applicable)
+             subs_predicate.in_strategy.add(IN_TO_EXISTS)
+         }
+
+         Perform MIN/MAX transformations(*) or wrap subs_predicate in an
+         Item_in_optimizer
+       }
+}
+
+(*) The MIN/MAX transformations has been moved out the prepare phase by
+    MWL#148.
+
+In the end of the JOIN::prepare phase for the outer-most query, the
+optimizer has analyzed all subqueries, and if they could not be
+flattened into semijoins, they have been prepare()-d and it has been
+decided which execution methods are possible.
+
+
+2. Cost-based choice of execution strategies
+
+The optimization of all non-flattened subqueries is performed in two
+steps. First JOIN::optimize for the outer query calls JOIN::optimize
+for all its direct child subqueries. Once the join optimizer for each
+subquery has generated a QEP based on the subquery as it is, it calls
+JOIN::choose_subquery_plan() that chooses an optimal execution strategy
+for the subquery predicate. Currently the choice of strategies is
+between MATERIALIZATION and IN_TO_EXISTS. This is summarized in the
+following pseudocode:
+
+outer_join.optimize()
+{
+  ...
+  optimize_unflattened_subqueries() -> inner_join.optimize()
+  ...
+}
+
+inner_join.optimize()
+{
+  ...
+  make_join_statistics()
+  {
+    ...
+    choose_subquery_plan()
+    {
+      - reoptimize this JOIN with extra IN=>EXISTS predicates
+      - compare the costs of the non-modified and the EXISTS plans
+      - if the non-modified plan is cheaper
+          revert the plan to the original plan, and create
+          MATERIALIZATION related structures
+        else
+          inject the IN=>EXISTS predicates into the reoptimized
+          plan
+    }
+    ...
+  }
+  ...
+  rewrite_to_index_subquery_engine()
+  {
+    For single-table subqueries, if the only table access method is a
+    JT_EQ_REF, or JT_REF, or JT_REF_OR_NULL, change the subquery engine
+    to subselect_[unique | index]subquery_engine.
+  }
+  ...
+}
+
+
+At the end of the JOIN::optimize phase, all non-flattened IN/ALL/ANY
+subqueries have a chosen strategy for execution, and are fully
+optimized according to this strategy.
+
+
+Notice:
+
+Scalar subqueries are still optimized on demand at their first
+evaluation. This may happen both during the JOIN::optimize phase, and
+during the JOIN::exec() phase. Scalar subquery predicates are
+represented by Item_singlerow_subselect and Item_maxmin_subselect.
 
 

(Psergey - Mon, 05 Jul 2010, 03:30
    
Low Level Design modified.
--- /tmp/wklog.89.old.13585	2010-07-05 03:30:15.000000000 +0000
+++ /tmp/wklog.89.new.13585	2010-07-05 03:30:15.000000000 +0000
@@ -1 +1,100 @@
 
+Based on discussions with Timour:
+
+<contents>
+1. Substitution problem
+1.1 Current solution
+2. Approaches to doing two optimizations on the same join
+2.1 Generic approach, undoable optimization
+2.2 Specific-case approach, optimizing with extra equality
+</contents>
+
+1. Substitution problem
+-----------------------
+IN->EXISTS rewrite requires that the Item_in_subselect object is wrapped
+with an Item_in_optimizer object.
+
+This was very easy to do when the rewrites were performed at PREPARE stage: 
+it is expected that item may substitute itself in Item::fix_fields() call, so
+Item_in_subselect::fix_fields() could create and return its Item_in_optimizer
+as substitution, and that would ensure that from that point everyone would use
+Item_in_subselect only through Item_in_optimizer.
+
+Doing such wrapping at execution stage is more challenging because there can
+be multiple places that keep pointers to the same Item, for example, the same
+Item could be referred from:
+- Query's WHERE condition in JOIN::conds and in SELECT_LEX::where
+- table's JOIN_TAB::select_cond (actually, the same item can be present in
+  conditions attached to multiple tables).
+- pushed index condition (the storage engine has Item* pointer, and if BKA is 
+  used, JOIN_CACHE object will have an Item* pointer, too)
+- ... etc (TODO what else?)
+
+1.1 Current solution
+~~~~~~~~~~~~~~~~~~~~
+Create a wrapper object very early (either in fix_fields() or in the parser).
+The wrapper object will have a field that will tell it whether it should work 
+as Item_in_optimizer, or as call redirector, or whatever is needed.
+
+Note: Sanja ought to have a similar problem with his subquery value cache
+wrapper item. Check what approach he's using.
+
+
+2. Approaches to doing two optimizations on the same join
+---------------------------------------------------------
+
+2.1 Generic approach, undoable optimization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This is what Timour is trying to implement.
+
+He's trying to reach a state where one could
+- optimize a SELECT and a query plan for it
+- undo the optimization
+- change the WHERE clause to include the injected IN-equality
+- optimize the SELECT again, get another query plan
+- [possibly] undo the second optimization and switch back to the first one
+  (Q: how is the switch-back part implemented? do we run the optimization
+   again?)
+
+The task of optimizing/re-optimizing a SELECT has been solved for prepared
+statements, and here we try to reuse PS code's solutions:
+- JOIN object is discarded when undoing the optimization, each "optimize the
+  SELECT" step operates on its own JOIN object.
+- Query_arena is used for its ability to undo Item tree changes (grep for
+  thd->change_item_tree)
+
+2.1.1 Questioning this approach
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+There are optimizations that do rewrites that are not undoable (at all!) and
+are kept for the entire lifetime of the statement. For example,
+- outer->inner join conversion
+- subquery-to-semijoin conversion
+are done at first execution and their result is kept till the statement is
+destroyed. This is possible because these rewrites do not depend on values of
+PS parameters or contents of the table.
+
+However, the difference between IN->EXISTS's select and Materialization's 
+select is that IN->EXISTS has an extra equality injected into the WHERE 
+clause.  Extra equality can be used to justify non-unodable conversions like
+outer->inner join convertion and subquery-to-semi-join conversion. These
+involve changing TABLE_LIST objects so Query_arena mechanism won't be able to
+undo them.
+
+2.2 Specific-case approach, optimizing with extra equality
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Igor's suggestion:
+
+Run join optimization with extra parameter meaning that optimization should
+be done for both with-equality and without-equality cases.
+
+First, proceed with optimization without the equality. After we've came to the
+point where we've picked a join plan, do the following:
+
+- save join->join_tab and join->best_positions[]
+- call add_key_fields() for the IN-equality so that KEYUSE array is updated
+  with the IN-equality
+- run choose_plan(), i.e. join optimization, again.
+
+This approach is rather limited, but it is much easier to implement.
+
+

(Guest - Sun, 13 Jun 2010, 16:58
    
Dependency deleted: WL#91 no longer depends on WL#89

(Timour - Fri, 12 Mar 2010, 09:17
    
Status updated.
--- /tmp/wklog.89.old.13018	2010-03-12 09:17:25.000000000 +0000
+++ /tmp/wklog.89.new.13018	2010-03-12 09:17:25.000000000 +0000
@@ -1 +1 @@
-Assigned
+In-Progress

(Igor - Wed, 10 Mar 2010, 21:48
    
Category updated.
--- /tmp/wklog.89.old.778	2010-03-10 21:48:08.000000000 +0000
+++ /tmp/wklog.89.new.778	2010-03-10 21:48:08.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint

(Igor - Wed, 10 Mar 2010, 21:48
    
Status updated.
--- /tmp/wklog.89.old.778	2010-03-10 21:48:08.000000000 +0000
+++ /tmp/wklog.89.new.778	2010-03-10 21:48:08.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned

(Psergey - Sun, 28 Feb 2010, 16:34
    
High-Level Specification modified.
--- /tmp/wklog.89.old.24497	2010-02-28 16:34:05.000000000 +0000
+++ /tmp/wklog.89.new.24497	2010-02-28 16:34:05.000000000 +0000
@@ -36,8 +36,8 @@
 
 So, we'll need to compute both exists_select_cost and materialization_cost.
 
-Difficulty with computing the two costs
----------------------------------------
+Difficulty with the need to run select optimization two times
+-------------------------------------------------------------
 The problem is in this scenario:
 1. We compute materialization_cost by running optimization for the original
    subquery select.
@@ -46,4 +46,10 @@
 3. Then we find that cost #1 is less and want to execute the materialization
    strategy.
 
+The problem is that once one injects "oe=ie", it can trigger some optimization
+steps that are not possible to undo.
+- Example1:  outer->inner join conversion
+- non-Example: according to Igor, "oe=ie" won't participate in equality propagation.
+- ... what else ?
+
 

(Psergey - Sun, 28 Feb 2010, 16:08
    
High-Level Specification modified.
--- /tmp/wklog.89.old.24098	2010-02-28 16:08:56.000000000 +0000
+++ /tmp/wklog.89.new.24098	2010-02-28 16:08:56.000000000 +0000
@@ -36,3 +36,14 @@
 
 So, we'll need to compute both exists_select_cost and  materialization_cost.
 
+Difficulty with computing the two costs
+---------------------------------------
+The problem is in this scenario:
+1. We compute materialization_cost by running optimization for the original
+   subquery select.
+2. We compute exists_select_cost by running optimization for the subquery's
+   select with "oe=ie" injected into WHERE
+3. Then we find that cost #1 is less and want to execute the materialization
+   strategy.
+
+

(Psergey - Sun, 28 Feb 2010, 15:57
    
High-Level Specification modified.
--- /tmp/wklog.89.old.24045	2010-02-28 15:57:49.000000000 +0000
+++ /tmp/wklog.89.new.24045	2010-02-28 15:57:49.000000000 +0000
@@ -1 +1,38 @@
+Why need two optimizations
+--------------------------
+Consider a query with subquery:
+
+  SELECT 
+    oe IN (SELECT ie FROM inner_tbl WHERE inner_cond)
+  FROM outer_tbl
+  WHERE outer_cond
+
+If we use Materialization strategy, the costs will be 
+
+  cost of accessing outer_tbl + 
+  materialization_cost + 
+  #records(outer_tbl w/o outer_cond) * lookup_cost
+
+where 
+
+  materialization_cost= 
+    cost of executing the (SELECT ie FROM inner_tbl WHERE inner_cond)
+
+On the other hand, for IN->EXISTS strategy, the subquery will be rewritten into
+
+  SELECT 
+    EXISTS (SELECT 1 FROM inner_tbl WHERE inner_cond AND oe=ie)
+  FROM outer_tbl
+  WHERE outer_cond
+
+and the costs will be 
+
+  cost of accessing outer_tbl + 
+  #records(outer_tbl w/o outer_cond) * exists_select_cost
+
+where
+  exists_select_cost= 
+    cost of executing the (SELECT 1 FROM inner_tbl WHERE inner_cond AND oe=ie)
+
+So, we'll need to compute both exists_select_cost and  materialization_cost.
 

(Psergey - Sun, 28 Feb 2010, 15:07
    
Dependency created: WL#91 now depends on WL#89


Report Generator:
 
Saved Reports:

WorkLog v4.0.0
  © 2010  Sergei Golubchik and Monty Program AB
  © 2004  Andrew Sweger <yDNA@perlocity.org> and Addnorya
  © 2003  Matt Wagner <matt@mysql.com> and MySQL AB