This worklog has been replaced with mariadb.org/jira

This site is here for historical purposes only. Do not add or edit tasks here!

 
 
 

WorkLog Frontpage Log in / Register
High-Level Description | Task Dependencies | High-Level Specification | Low-Level Design | File Attachments | User Comments | Time Estimates | Funding and Votes | Progress Reports

 Transaction coordinator plugin
Title
Task ID132
Queue
Version
Status
Priority
Copies to
Created byKnielsen24 Aug 2010Done
Supervisor   
Lead Architect    
Architecture Review  
Implementor  
Code Review  
QA  
Documentation  
 High-Level Description
A part of the replication API project, MWL#107.

This worklog describes an API that allows a replication plugin to hook deep
into the commit mechanism of transactions in mysqld and implement its own
transaction coordinator (TC).

The existing binary log already implements a TC. This is used to make writes
to the binary log happen in 2-phase commit with transactional engines. It is
also used during crash recovery to get the list of prepared-but-not-committed
transactions from the binary log and ensure that those transactions are
committed (and any others rolled back), to avoid inconsistencies between the
contents of binary log and engines.

Thus the TC needs to be part of a replication API that allows the existing
binary log implementation to be written as a plugin, or to allow the
implementation of a replacement binary log.

In addition, something like Galera synchronous replication needs more control
over the commit process. In particular, it needs to be able to control the
order in which transactions are committed. This worklog extends the TC
interface to also allow TC to control commit order. This part of the worklog
builds on the work in MWL#116, in particular on the prepare_ordered() and
commit_ordered() extensions to the storage engine API.
 Task Dependencies
Others waiting for Task 132Task 132 is waiting forGraph
107 New replication APIs
116 Efficient group commit for binary log 
 
 High-Level Specification
Transaction coordinators
------------------------

The transaction coordinator in MariaDB currently has a number of
responsibilities:

1. TC decides whether a transaction is committed or rolled back during COMMIT
(or autocommit). This happens in the log_xid() method of TC. Up until a
successfully completed call to log_xid(), the transaction may still be rolled
back for whatever reason, but after log_xid() returns success, the transaction
is considered durably committed.

2. TC implements the "middle engine" optimisation in two-phase commit: When N
engines participate in a commit, only (N-1) engines need to run the "prepare"
phase. The last engine, following (N-1) successful prepares, can run the
commit phase directly. In current MariaDB, this "middle engine" commit is
implemented in the TC, and in practise it is the binary log engine.

3. TC is responsible for deciding when a transaction is committed or rolled
back, in case of crash recovery. This is used when mysqld crashes after a
transaction goes through prepare() in all engines and log_xid(), but before all
engines have had time to run commit(). In this case, TC provides the list of
transactions that were durably committed prior to the crash, so that engines
can commit those transactions and rollback any other still prepared
transactions that were never durably committed.

In current MariaDB, we have two different TC implementations (as well as a
"dummy" empty implementation that is not actually used).

Binary log
----------

One implementation is the binary log. When a transaction involving a
transactional storage engine (which supports 2-phase commit) is committed, the
write to the binary log is done in the binary log TC implementation in the
log_xid() call. The 2-phase commit is used to ensure that engines and binary
log on a master will still be consistent after a crash (otherwise we could end
up with transactions in the binary log which were missing in the engine, or
vice versa). This only works if durable commits are enabled, meaning
--innodb_flush_log_at_trx_commit=1 and --sync-binlog=1.

Another reason that the binary log implements TC is that it does not support
prepare/rollback. Thus, the only way it can participate in 2-phase commit is
to implement the "middle engine" optimisation where it never needs to rollback
(but all other engines may in case of binlog failure).

The binary log implements also a "fake" storage engine. This is used to
register the binlog as an XA-capable transaction participant. This way, when
changes are made to tables in another XA-capable engine (such as InnoDB or
PBXT), the number of participating XA-capable engines becomes greater than
one, causing MySQL/MariaDB to use internal 2-phase commit.

Mmap TC
-------

When the binary log is not enabled, a different implementation of TC is used,
TC_MMAP. This keeps a score file recording which transactions are prepared and
committed, and is only used to be able to do crash recovery for transactions
with multiple participating transactional engines (eg. InnoDB+PBXT).

When there is only one transactional engine in a transaction (in particular no
binary log, as the binary log counts as a transactional engine), the TC is not
involved in a commit at all, as the "middle engine" optimisation reduces to
simply commit() without a leading prepare().

Generalised TC that controls commit order
-----------------------------------------

In the current MariaDB, the order in which commit happens is a bit random. The
InnoDB storage engine takes a lock across the 2-phase commit (taken in
prepare() and released in commit()); this ensures same commit order in InnoDB
and the binary log, but with the unfortunate consequence of preventing the
implementation of group commit. Other than that, commit order is mostly
determined by random thread scheduling.

The MWL#116 lists some reasons that a controlled commit order is desirable,
and describes an extension of the storage engine API with prepare_ordered()
and commit_ordered() methods that provides such controlled order without
breaking group commit. However, the design in MWL#116 just allows different
participating engines to see the same consistent ordering of commits. The
actual ordering between different transactions is still determined by random
thread scheduling.

But something like the Galera replication needs to be able to dictate commit
order based on the replication process. Galera requires a consistent commit
order among participating nodes, and this order is only determined as part of
replicating transactions during the commit phase.

In MWL#116, commit order is determined by the sequence of calls to
prepare_ordered(), and a guarantee is made that log_xid() and commit_ordered()
will be called in the same sequence. The idea is to generalise the TC
interface so that TC has the responsibility of invoking prepare_ordered() and
commit_ordered() in the order it wants defined.

Thus the generalised TC interface has a method log_and_order(). This method
works similarly to the existing log_xid(), but in addition has the
responsibility of invoking any prepare_ordered() and commit_ordered() handler
methods in the desired sequence.

Thus implementing this more general method would allow Galera or other plugin
to enforce any desired commit order.


Primary redundancy plugin
-------------------------

Note that there can be only one TC active at any one time. This is inherent in
what it does:

 - There can be only one entity deciding on commit order.

 - There can be only one entity deciding on which transactions to recover
   after a crash.

 - There can be only one engine benefitting from the "middle engine"
   optimisation.

So for example, if the user wants to install a new replication plugin that
implements a TC, then the user will not be able to simultaneously use the
existing binary log or any other replication plugin that implements its own
TC.

For two binary log-like plugins to co-exist, at least one of them must support
transactions using prepare+commit (which the current binary log does not do).
Otherwise there is no 2-phase commit, and no way to prevent inconsistencies in
case of a crash.

The term "primary redundancy plugin" has been used in discussions for this
kind of replication plugin that implements its own TC.
 Low-Level Design
This is the generalised interface for TC plugins:

    class TC_LOG
    {
    public:
      int using_heuristic_recover();
      TC_LOG() {}
      virtual ~TC_LOG() {}

      virtual int open(const char *opt_name)=0;
      virtual void close()=0;
      virtual int log_and_order(THD *thd, my_xid xid, bool all,
				bool need_prepare_ordered,
				bool need_commit_ordered) = 0;
      virtual void unlog(ulong cookie, my_xid xid)=0;

    protected:
      void run_prepare_ordered(THD *thd, bool all);
      void run_commit_ordered(THD *thd, bool all);
    };

The only change compared to the old interface is the replacement of log_xid()
with the more general log_and_order(). The run_prepare_ordered() and
run_commit_ordered() are provided for TC implementations to be able to invoke
any {prepare,commit}_ordered handler methods in desired order.

A TC implementation should override the abstract methods:

open()
    Aside from the obvious of initialising the TC, this method also has the
    responsibility of handling recovery. If the TC cannot determine that the
    last shutdown was clean, it must initiate recovery, collecting the list (a
    hash really) of xids identifying transactions that must be committed in
    all handlers (if not committed already). This list is passed to
    ha_recover() to do the actual engine recovery.

close()
    Clean up ...

log_and_order()
    Requests a decision to commit (non-zero return) or rollback (zero return)
    of the transaction. At this point, the transaction has been successfully
    prepared in all engines.

    The method must call run_prepare_ordered(), in a way so that calls in
    different threads happen in the order that the transactions are
    committed. This call must be protected by the global LOCK_prepare_ordered
    mutex.

    The method must then call run_commit_ordered(), protected by
    LOCK_commit_ordered, again so that different threads are called in the
    order that transactions are committed.

    The idea with prepare_ordered() is to call it as early as possible after
    commit order has been decided, for example to release locks early. In
    particular, a transaction can still be rolled back after prepare_ordered()
    (for example in case of a crash). In contrast, commit_ordered() may only
    be called after the transaction is durably committed in the TC.

    If need_prepare_ordered or need_commit_ordered is passed as FALSE, then
    the corresponding call need not be done. It is safe to do it anyway,
    however omitting it avoids the need to take a global mutex.

unlog()
    This is called after all participating engines have finished commit() for
    a transaction. Thus the TC is free to forget about the transaction (in
    terms of crash recovery) after this call. The "cookie" parameter receives
    the value returned previously from log_and_order(), and can be used at the
    TC's discretion.
 File Attachments
 NameTypeSizeByDate
 User Comments
 Time Estimates
NameHours WorkedLast Updated
All Sub Tasks74 
Total74 
 Hrs WorkedProgressCurrentOriginal
Sub Tasks7400
Total7400
 
 Funding and Votes
Votes: 0: 0%
 Make vote: Useless    Nice to have    Important    Very important    

Funding: 0 offers, total 0 Euro
 Progress Reports
(Knielsen - Mon, 01 Nov 2010, 15:56
    
Version updated.
--- /tmp/wklog.132.old.27829	2010-11-01 15:56:19.000000000 +0000
+++ /tmp/wklog.132.new.27829	2010-11-01 15:56:19.000000000 +0000
@@ -1,2 +1,2 @@
-Server-9.x
+Server-5.3
 

(Knielsen - Mon, 01 Nov 2010, 15:56
    
Status updated.
--- /tmp/wklog.132.old.27829	2010-11-01 15:56:19.000000000 +0000
+++ /tmp/wklog.132.new.27829	2010-11-01 15:56:19.000000000 +0000
@@ -1,2 +1,2 @@
-Assigned
+Code-Review
 

(Knielsen - Mon, 01 Nov 2010, 15:56
    
Lead Architect updated:  -> Knielsen
Code Review updated:  -> Serg

(Knielsen - Mon, 01 Nov 2010, 12:44
    
Low Level Design modified.
--- /tmp/wklog.132.old.20444	2010-11-01 12:44:08.000000000 +0000
+++ /tmp/wklog.132.new.20444	2010-11-01 12:44:08.000000000 +0000
@@ -69,90 +69,3 @@
     TC's discretion.
 
 
-Alternative interfaces
-----------------------
-
-The general interface above allows the TC plugin to decide commit order as it
-wants, but requires it to handle inter-thread synchronisation by itself. Two
-alternative (and simpler) interfaces are provided for TC plugins that do not
-need to decide on commit order. One TC_LOG_unordered for TCs that do not care
-about commit order at all, and one TC_LOG_group_commit for TCs that need to
-efficiently commit transactions in same order as engines, whatever that order
-is.
-
-TC_LOG_unordered is as follows:
-
-    class TC_LOG_unordered: public TC_LOG
-    {
-    public:
-      TC_LOG_unordered();
-      ~TC_LOG_unordered();
-      int log_and_order(THD *thd, my_xid xid, bool all,
-			bool need_prepare_ordered, bool need_commit_ordered);
-    protected:
-      virtual int log_xid(THD *thd, my_xid xid)=0;
-
-    private:
-      ...
-    }
-
-This interface is identical to the old/current one in MariaDB.
-
-A TC based on this simpler interface overrides log_xid() instead of
-log_and_order(). log_xid() does not have to deal with
-{prepare,commit}_ordered(), this is handled by the TC_LOG_unordered()
-implementation. There is no guarantee on the ordering of log_xid() calls among
-different threads as compared to transaction commit order.
-
-TC_LOG_unordered is used by TC_LOG_MMAP.
-
-TC_LOG_group_commit is as follows:
-
-    class TC_LOG_group_commit: public TC_LOG
-    {
-    public:
-      TC_LOG_group_commit();
-      ~TC_LOG_group_commit();
-
-      int log_and_order(THD *thd, my_xid xid, bool all,
-			bool need_prepare_ordered, bool need_commit_ordered);
-    protected:
-      struct TC_group_commit_entry
-      {
-	struct TC_group_commit_entry *next;
-	THD *thd;
-	int xid_error;
-	...
-      };
-      virtual void group_log_xid(TC_group_commit_entry *first) = 0;
-      virtual int xid_log_after(TC_group_commit_entry *entry) = 0;
-    private:
-      ...
-    }
-
-A TC based on this interface overrides group_log_xid() and xid_log_after()
-instead of log_and_order(), and again does not need to deal with any
-{prepare,commit}_ordered().
-
-The group_log_xid() call is guaranteed to be used only single-threadedly; no
-two calls can be active in different threads at the same time. The call will
-receive a linked list of all transactions that queued up while waiting for the
-previous call to finish. This allows to easily do a group commit of all of
-them at once, eg. using a single fsync() call to make all the transactions in
-the list durable. The call must set individual error code for each transaction
-in the xid_error field of list members; it is possible to set failure for one
-transaction (which will then rollback) and simultaneously set success for
-another (which will then be committed).
-
-After group_log_xid() completes for a list of transactions, the
-xid_log_after() method will be called for each transaction in the list; these
-calls run in parallel, each in the thread corresponding to the transaction,
-with no ordering guarantee among them. The xid_log_after() call must return
-the same value that log_xid() and log_and_order() does: a non-zero cookie in
-the non-error case (which will be passed back in unlog(), or zero in the error
-case. The error cases among group_log_xid() and xid_log_after() _must_ be
-consistent!
-
-TC_LOG_group_commit is used by the binary log TC.
-
-

(Knielsen - Mon, 01 Nov 2010, 12:43
    
High-Level Specification modified.
--- /tmp/wklog.132.old.20410	2010-11-01 12:43:40.000000000 +0000
+++ /tmp/wklog.132.new.20410	2010-11-01 12:43:40.000000000 +0000
@@ -25,7 +25,7 @@
 transactions that were never durably committed.
 
 In current MariaDB, we have two different TC implementations (as well as a
-"dummy" empty implementation that I do not know if is used).
+"dummy" empty implementation that is not actually used).
 
 Binary log
 ----------
@@ -44,10 +44,11 @@
 to implement the "middle engine" optimisation where it never needs to rollback
 (but all other engines may in case of binlog failure).
 
-The binary log implements also a "fake" storage engine, mainly to hook into
-the commit (and prepare) phase of transaction processing. This is mainly used
-for statements in non-transactional engines, which are "committed" and written
-to the binary log outside of the TC and log_xid() framework.
+The binary log implements also a "fake" storage engine. This is used to
+register the binlog as an XA-capable transaction participant. This way, when
+changes are made to tables in another XA-capable engine (such as InnoDB or
+PBXT), the number of participating XA-capable engines becomes greater than
+one, causing MySQL/MariaDB to use internal 2-phase commit.
 
 Mmap TC
 -------
@@ -99,41 +100,6 @@
 Thus implementing this more general method would allow Galera or other plugin
 to enforce any desired commit order.
 
-TC interface subclasses
------------------------
-
-The MWL#116 has two different algorithms for handling commit order and
-invoking prepare_ordered() and commit_ordered() handler methods:
-
- - One used with TC_MMAP, which needs no correspondance between engines and
-   TC. This uses the existing log_xid() interface.
-
- - One used with the binary log TC, which ensures same commit order in engines
-   and binary log, and which uses a new single-threaded group_log_xid() TC
-   interface to efficiently do group commit.
-
-In the prototype patch for MWL#116, these two methods are mixed with each
-other in the function ha_commit_trans(), and the logic is quite complex. Using
-the log_and_order() TC generalisation provides a nice cleanup of this.
-
-We implement two subclasses of the TC interface:
-
- - One class TC_LOG_unordered for the method used with TC_MMAP. This
-   implements the old log_xid() interface.
-
- - One class TC_LOG_group_commit for the method used for the binary log. This
-   implements the new group_log_xid() interface.
-
-Each subclass implements the corresponding algorithm for invoking
-prepare_ordered() and commit_ordered(), using the same mechanisms as in
-MWL#116, but implemented in a cleaner way. The ha_commit_trans() function then
-has no details about prepare_ordered() or commit_ordered(), it just calls into
-tc_log->log_and_order(), which handles the necessary details.
-
-Thus a simple TC plugin similar to the binary log or TC_MMAP can implement one
-of the simple interfaces log_xid() or group_log_xid(), without having to worry
-about prepare_ordered() and commit_ordered(). But a plugin like Galera that
-needs to do more can implement the more general interface.
 
 Primary redundancy plugin
 -------------------------

(Knielsen - Mon, 01 Nov 2010, 12:43
    
High Level Description modified.
No change.

(Sergei - Wed, 25 Aug 2010, 13:12
    
Dependency created: WL#132 now depends on WL#116

(Knielsen - Wed, 25 Aug 2010, 11:33
    
Low Level Design modified.
--- /tmp/wklog.132.old.13714	2010-08-25 11:33:24.000000000 +0000
+++ /tmp/wklog.132.new.13714	2010-08-25 11:33:24.000000000 +0000
@@ -1,2 +1,158 @@
+This is the generalised interface for TC plugins:
+
+    class TC_LOG
+    {
+    public:
+      int using_heuristic_recover();
+      TC_LOG() {}
+      virtual ~TC_LOG() {}
+
+      virtual int open(const char *opt_name)=0;
+      virtual void close()=0;
+      virtual int log_and_order(THD *thd, my_xid xid, bool all,
+				bool need_prepare_ordered,
+				bool need_commit_ordered) = 0;
+      virtual void unlog(ulong cookie, my_xid xid)=0;
+
+    protected:
+      void run_prepare_ordered(THD *thd, bool all);
+      void run_commit_ordered(THD *thd, bool all);
+    };
+
+The only change compared to the old interface is the replacement of log_xid()
+with the more general log_and_order(). The run_prepare_ordered() and
+run_commit_ordered() are provided for TC implementations to be able to invoke
+any {prepare,commit}_ordered handler methods in desired order.
+
+A TC implementation should override the abstract methods:
+
+open()
+    Aside from the obvious of initialising the TC, this method also has the
+    responsibility of handling recovery. If the TC cannot determine that the
+    last shutdown was clean, it must initiate recovery, collecting the list (a
+    hash really) of xids identifying transactions that must be committed in
+    all handlers (if not committed already). This list is passed to
+    ha_recover() to do the actual engine recovery.
+
+close()
+    Clean up ...
+
+log_and_order()
+    Requests a decision to commit (non-zero return) or rollback (zero return)
+    of the transaction. At this point, the transaction has been successfully
+    prepared in all engines.
+
+    The method must call run_prepare_ordered(), in a way so that calls in
+    different threads happen in the order that the transactions are
+    committed. This call must be protected by the global LOCK_prepare_ordered
+    mutex.
+
+    The method must then call run_commit_ordered(), protected by
+    LOCK_commit_ordered, again so that different threads are called in the
+    order that transactions are committed.
+
+    The idea with prepare_ordered() is to call it as early as possible after
+    commit order has been decided, for example to release locks early. In
+    particular, a transaction can still be rolled back after prepare_ordered()
+    (for example in case of a crash). In contrast, commit_ordered() may only
+    be called after the transaction is durably committed in the TC.
+
+    If need_prepare_ordered or need_commit_ordered is passed as FALSE, then
+    the corresponding call need not be done. It is safe to do it anyway,
+    however omitting it avoids the need to take a global mutex.
+
+unlog()
+    This is called after all participating engines have finished commit() for
+    a transaction. Thus the TC is free to forget about the transaction (in
+    terms of crash recovery) after this call. The "cookie" parameter receives
+    the value returned previously from log_and_order(), and can be used at the
+    TC's discretion.
+
+
+Alternative interfaces
+----------------------
+
+The general interface above allows the TC plugin to decide commit order as it
+wants, but requires it to handle inter-thread synchronisation by itself. Two
+alternative (and simpler) interfaces are provided for TC plugins that do not
+need to decide on commit order. One TC_LOG_unordered for TCs that do not care
+about commit order at all, and one TC_LOG_group_commit for TCs that need to
+efficiently commit transactions in same order as engines, whatever that order
+is.
+
+TC_LOG_unordered is as follows:
+
+    class TC_LOG_unordered: public TC_LOG
+    {
+    public:
+      TC_LOG_unordered();
+      ~TC_LOG_unordered();
+      int log_and_order(THD *thd, my_xid xid, bool all,
+			bool need_prepare_ordered, bool need_commit_ordered);
+    protected:
+      virtual int log_xid(THD *thd, my_xid xid)=0;
+
+    private:
+      ...
+    }
+
+This interface is identical to the old/current one in MariaDB.
+
+A TC based on this simpler interface overrides log_xid() instead of
+log_and_order(). log_xid() does not have to deal with
+{prepare,commit}_ordered(), this is handled by the TC_LOG_unordered()
+implementation. There is no guarantee on the ordering of log_xid() calls among
+different threads as compared to transaction commit order.
+
+TC_LOG_unordered is used by TC_LOG_MMAP.
+
+TC_LOG_group_commit is as follows:
+
+    class TC_LOG_group_commit: public TC_LOG
+    {
+    public:
+      TC_LOG_group_commit();
+      ~TC_LOG_group_commit();
+
+      int log_and_order(THD *thd, my_xid xid, bool all,
+			bool need_prepare_ordered, bool need_commit_ordered);
+    protected:
+      struct TC_group_commit_entry
+      {
+	struct TC_group_commit_entry *next;
+	THD *thd;
+	int xid_error;
+	...
+      };
+      virtual void group_log_xid(TC_group_commit_entry *first) = 0;
+      virtual int xid_log_after(TC_group_commit_entry *entry) = 0;
+    private:
+      ...
+    }
+
+A TC based on this interface overrides group_log_xid() and xid_log_after()
+instead of log_and_order(), and again does not need to deal with any
+{prepare,commit}_ordered().
+
+The group_log_xid() call is guaranteed to be used only single-threadedly; no
+two calls can be active in different threads at the same time. The call will
+receive a linked list of all transactions that queued up while waiting for the
+previous call to finish. This allows to easily do a group commit of all of
+them at once, eg. using a single fsync() call to make all the transactions in
+the list durable. The call must set individual error code for each transaction
+in the xid_error field of list members; it is possible to set failure for one
+transaction (which will then rollback) and simultaneously set success for
+another (which will then be committed).
+
+After group_log_xid() completes for a list of transactions, the
+xid_log_after() method will be called for each transaction in the list; these
+calls run in parallel, each in the thread corresponding to the transaction,
+with no ordering guarantee among them. The xid_log_after() call must return
+the same value that log_xid() and log_and_order() does: a non-zero cookie in
+the non-error case (which will be passed back in unlog(), or zero in the error
+case. The error cases among group_log_xid() and xid_log_after() _must_ be
+consistent!
+
+TC_LOG_group_commit is used by the binary log TC.
 
 

(Knielsen - Tue, 24 Aug 2010, 14:09
    
Dependency created: WL#107 now depends on WL#132

(Knielsen - Tue, 24 Aug 2010, 14:09
    
High Level Description modified.
--- /tmp/wklog.132.old.26281	2010-08-24 14:09:12.000000000 +0000
+++ /tmp/wklog.132.new.26281	2010-08-24 14:09:12.000000000 +0000
@@ -1,3 +1,5 @@
+A part of the replication API project, MWL#107.
+
 This worklog describes an API that allows a replication plugin to hook deep
 into the commit mechanism of transactions in mysqld and implement its own
 transaction coordinator (TC).
-- View All Progress Notes (12 total) --


Report Generator:
 
Saved Reports:

WorkLog v4.0.0
  © 2010  Sergei Golubchik and Monty Program AB
  © 2004  Andrew Sweger <yDNA@perlocity.org> and Addnorya
  © 2003  Matt Wagner <matt@mysql.com> and MySQL AB