This worklog has been replaced with mariadb.org/jira

This site is here for historical purposes only. Do not add or edit tasks here!

 
 
 

WorkLog Frontpage Log in / Register
High-Level Description | Task Dependencies | High-Level Specification | Low-Level Design | File Attachments | User Comments | Time Estimates | Funding and Votes | Progress Reports

 Efficient group commit for binary log
Title
Task ID116
Queue
Version
Status
Priority
Copies toSergei

Created byKnielsen26 Apr 2010Done
Supervisor   
Lead Architect    
Architecture Review  
Implementor  
Code Review  
QA 17 Feb 2011
Documentation  
 High-Level Description
Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).

This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.

The idea is to

 - Implement/fix group commit to work properly with the binary log enabled.

 - provide the foundations for implementing MWL#164. The idea in MWL#164 is to
   avoid the need to fsync() in the engine, and instead rely on replaying any
   lost transactions from the binary log against the engine during crash
   recovery.

For background see these articles:

    http://kristiannielsen.livejournal.com/12254.html
    http://kristiannielsen.livejournal.com/12408.html
    http://kristiannielsen.livejournal.com/12553.html
    http://kristiannielsen.livejournal.com/12810.html

----

Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:

1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).

2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.

3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.

While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:

 - InnoDB hot backup needs to be able to extract a binlog position that is
   consistent with the hot backup to be able to provision a new slave, and
   this is impossible without imposing at least partial consistent ordering
   between InnoDB and binlog.

 - Other backup methods could have similar needs, eg. XtraBackup or
   `mysqldump --single-transaction`, to have consistent commit order between
   binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
   or similar expensive blocking operation. Without such consistent ordering,
   we may get a backup where transaction A is committed but B is not, while in
   the binlog transaction B is logged before A. This makes the backup
   unsuitable for provisioning a new slave. (Other backup methods, like LVM
   snapshot, don't need consistent commit order, as they can restore
   out-of-order commits from the binlog during crash recovery using XA).

 - If we have consistent commit order, we can think about optimising commit to
   need only one fsync (for binlog); lost commits in storage engines can then
   be recovered from the binlog at crash recovery by re-playing against the
   engine from a particular point in the binlog. (Consistent commit order may
   not be absolutely needed for this, but it does simplify the implementation
   a lot).

 - With consistent commit order, we can get better semantics for START
   TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
   could even get it to return also a matching binlog position). Currently,
   this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
   engines.

 - In InnoDB, the performance in the presense of hotspots can be improved if
   we can release row locks early in the commit phase, but this requires that
   we release them in the same order as commits in the binlog to ensure
   consistency between master and slaves. (The Facebook patch does this,
   however more research is needed to understand exactly what they do and in
   which circumstances it is safe).

 - There was some discussions around Galera [1] synchroneous replication and
   global transaction ID that it needed consistent commit order among
   participating engines.

 - I believe there could be other applications for guaranteed consistent
   commit order, and that the architecture described in this worklog can
   implement such guarantee with reasonable overhead.


References:

[1] Galera: http://www.codership.com/products/galera_replication
 Task Dependencies
Others waiting for Task 116Task 116 is waiting forGraph
132 Transaction coordinator plugin
136 Cross-engine consistency for START TRANSACTION WITH CONSISTENT SNAPSHOT
163 Implement release of row locks in InnoDB during prepare() phase
164 Extend crash recovery to recover non-prepared transactions from binlog
165 Extend group commit to also handle non-transactional events
 
 High-Level Specification
The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.

So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.

To do this, we add two new handlerton methods:

    void (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
    void (*commit_ordered)(handlerton *hton, THD *thd, bool all);

The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.

The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.

The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.

As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().  After
commit_ordered() returns, the transaction should be recorded (in memory) as
committed before any transactions that later call commit_ordered(), and in
MVCC engines the transaction should become visible to newly started
transactions. There is however no need to write anything to disk in
prepare_ordered() (in case of a crash the commit will be recovered during
crash recovery).

As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.

Note that prepare_ordered() and commit_ordered() are only called when internal
2-phase commit is used, that is when at least two XA-capable storage engines
(binlog included) participate in a transaction. For eg. an InnoDB-only
transaction with binlog disabled, the two new methods will not be called (just
like for the existing prepare() handlerton method).

In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. MWL#132 introduces the log_and_order() call for TC_LOG to handle
this decision, as well as invoking prepare_ordered() and commit_ordered() in
correct sequence. We need to implement log_and_order() for the binlog and MMAP
TC implementations.

The new methods prepare_ordered() and commit_ordered() are optional. A storage
engine is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB currently needs no ordering of the
prepare phase, so can omit implementing prepare_ordered(). But the Facebook
InnoDB extension of releasing locks early needs ordering in the prepare step
to preserve statement-based replication correctness; so this might be
implemented by adding prepare_ordered().

This means that all existing storage engines will work unmodified with this
new feature. They will still be able to take advantage of binlog group commit
(at least as long as they do not actively take steps to prevent it, like the
InnoDB prepare_commit_mutex does). But they will of course not get the
guarantee of consistent commit ordering with binlog, just as they do not have
this guarantee with MySQL or MariaDB without this feature.


Binlog implementation
---------------------

For the binlog, we implement a queue of transactions waiting to commit to the
binlog (waiting because a previous transaction is active writing to the binlog
or fsync()'ing the data so written, while holding the LOCK_log mutex). The
first transaction to enter the queue is the group leader, the rest are group
participants.

Only the group leader locks the LOCK_log mutex. When it obtains the mutex, it
writes all of the participants in the queue into the binary log (as well as
itself), and does a single fsync() call for all of them, this implementing
group commit for the binlog. The leader then runs the commit_ordered() method
for all transactions in the queue in the right order. This ensures the same
commit order in binlog and engines that implement commit_ordered(). Once
commit_ordered() has been run, the leader signals the participants to wakeup
and finish the commit in each thread individually, returning from
log_and_order() and proceeding to invoke the commit() handlerton methods in
engines (or rollback() in case of error).

Doing things this way makes for efficient thread synchronisation and
scheduling. Each thread doing commit only has to do one wait and one wakeup
(the leader will wait for LOCK_log and wakeup when it obtains that mutex; the
participants will wait for a signal from the leader). If there is no
contention for committing to the binary log, there will be no waiting or
wakeup at all.

Also, by doing the serial part of commit in a single thread (the leader), we
avoid imposing any particular scheduling order between the threads
participating in group commit. Ie. it is not a requirement that if thread A
commits before thread B (in the binlog and inside engines), then the wakeup of
B must wait for A to wakeup and handle its own commit first. This makes the OS
kernel free to decide on the best scheduling order, which I believe will help
with issues like CPU affinity and NUMA (though I have not benchmarked this).

Note that with this algorithm, the write()/fsync() to binlog and the
commit_ordered() calls happen in the group leader thread, which may be
different from the participant thread normally handling a given
transaction. To support this usage, we specify the interface for
commit_ordered() to be that it can be called in a different thread than the
normal thread handling the transaction and associated with THD. We specify
similar out-of-thread behaviour for prepare_ordered(); such behaviour is not
done in the binlog group commit implementation, but may be useful for similar
reasons in something like Galera.

Since the binlog commit and commit_ordered() calls can happen in a different
thread, the corresponding code can not use thread local variables. In
particular, it can not use my_error() and similar functions. Therefore, any
reporting of errors that happen in this part of the code must be delayed to
the last part of the commit processing: after wakeup of participants for the
binlog part, and in the commit() handlerton method for storage engines. To
emphasise this, the prepare_ordered() and commit_ordered() calls are declared
as returning void, rather than returning an error code; any error code must be
saved and returned instead from the log_and_order() method (for errors during
binlog write/fsync()) or the commit() method (for errors during
commit_ordered). (Of course, normally the storage engine should never have to
return error during commit after successful prepare(), as that breaks 2-phase
commit guarantees).

The restriction that thread local storage is not available in
prepare_ordered() and commit_ordered() should otherwise not be a big problem,
as storage engines have the THD available for accessing any necessary
per-thread information.


TC_LOG_MMAP implementation
--------------------------

The MMAP transaction coordinator does not need any ordering of commits, as it
only needs to record the set of transactions to be recovered after crash,
nothing about the sequence in which they committed or anything. And it already
supports group commit. So as such it does not need to be modified. However, as
per MWL#132, it does need to implement the new log_and_order() rather than the
old log_xid() method.

Also, the new API introduced here guarantees to storage engines that the
sequence of calls to prepare_ordered() among transactions will be the same as
the sequence of calls to commit_ordered(). So if the storage engine(s) in a
transaction implement both prepare_ordered() and commit_ordered(), the
TC_LOG_MMAP log_and_order() needs to ensure this by proper thread
synchronisation. If prepare_ordered() or commit_ordered() (or both) are
unimplemented in participating storage engines, there is no ordering
requirements at all.

When consistent order is required, the algorithm in
TC_LOG_MMAP::log_and_order() runs as follows:

 - When prepare_ordered() is called at the start, transactions are added to a
   list in the same order that they called prepare_ordered() (with suitable
   mutex protection).

 - The threads independently execute in parallel the logging part of
   TC_LOG_MMAP, using the old TC_LOG_MMAP::log_xid() code.

 - Each thread in the list finally goes to sleep waiting for the previous
   thread to complete commit_ordered(). When it is signalled by the previous
   thread, it runs its own commit_ordered() methods, and subsequently signal
   the next thread in the list to wake up. (We also need to handle the cases
   of the first and last thread in the list, of course).


Overview of work needed for this worklog
----------------------------------------

 - In ha_commit_trans(), change the code to use log_and_order() instead of
   log_xid(). (This is arguably part of MWL#132; the two worklogs are closely
   related in this respect).

 - In XtraDB, use the new commit_ordered() call to remove the
   prepare_commit_mutex (and resurrect group commit) without loosing the
   consistency with binlog commit order.

 - In log.cc (binlog module), implement the queueing of transactions and
   leader/participant distinction in log_and_order() to do group commit of
   multiple transactions to the binlog with a single shared fsync() call.

 - Similarly, in log.cc, modify TC_LOG_MMAP to synchronise threads as
   described above in the case where both prepare_ordered() and
   commit_ordered() are used.

-----------------------------------------------------------------------
New server variables:

Two new status variables are introduced:

binlog_commits
    Total number of transactions committed to the binary log

binlog_group_commits
    Number of group commits (of one or more transactions) done. If
    sync_binlog=1, then this is the number of fsync()'s done for transaction
    commits to the binary log

The effectiveness of binlog group commit can thus be computed as the fraction 
(binlog_commits/binlog_group_commits); if this is eg. 10 then it means that on
the average 10 transactions were committed to the binlog at once as a group.

(Note that most non-transactional statements cause events to be written
directly into the binary log, without going through the group commit code, so
they do /not/ count as "transactions" for the purpose of these status
variables.)

In debug builds only, a new global server system variable is defined to help
testing:

binlog_dbug_fsync_sleep
    If this is set different from zero, then it is a number of micro-seconds
    to sleep after running fsync() on the binary log to flush transactions to
    disk. This can thus be used to artificially increase the perceived cost
    of such fsync().

-----------------------------------------------------------------------
A possible alternative for this worklog:

 - At the moment there is no plugin actually using prepare_ordered(), so, it
   could be removed from the design. But it fits in well, is efficient to
   implement, and could be useful later (eg. for the Facebook feature of
   releasing locks early in InnoDB).
 Low-Level Design
1. Changes for ha_commit_trans()

The code in ha_commit_trans() must be changed to use the new
TC_LOG::log_and_order() method of MWL#132 instead of the old log_xid().


2. Binlog changes (log.cc)


2.1. Misc. cleanup

I did some refactoring to make the binlog code clearer and easier to extend
with group commit:

 - Some renaming and simplifications of parameters.

 - Drop the table_map_version code, it is not used (similar cleanup was done
   in MySQL on my suggestion).

 - Split old binlog_end_trans() function into two new functions (same change
   is in MySQL 5.5).

 - Don't needlessly take LOCK_log in MYSQL_BIN_LOG::write(Event) (backport
   from MySQL 5.5).

 - Fix several bugs in error handling when binlog write fails:
    * Failure in write_cache() not logged due to bogus assignment of
      write_error.
    * Assertion in net_end_statement() due to not calling my_error() in error
      case.
    * Not resetting transaction cache after failed binlog write.
    * XtraDB/InnoDB not releasing prepare_commit_mutex in innobase_rollback()
      when binlogging fails.


2.2 Implementing the group commit algorithm for the binlog

Binlog is changed to implement log_and_order() instead of log_xid(). This
simply calls into binlog_flush_trx_cache() (previously binlog_end_trans()),
which is modified to use the group commit algorithm described in high-level
specification.

A transaction enters the queue when it does prepare_ordered(). This way, the
scheduling order for prepare_ordered() calls is what determines the sequence
in the queue and effectively the commit order. Entering the queue is done
before requesting LOCK_log, so transactions can queue up while the previous
fsync() of the log (done under LOCK_log) is running. The first transaction to
enter the queue becomes the leader, which will request the LOCK_log, do binlog
fsync() for all, run commit_ordered() in correct sequence, and wakeup the
others. The rest of the transactions in the queue just wait to be woken up.


2.3 Non-transactional binlog writes

Non-transactional statements (like DDL or pure MyISAM DML) are not binlogged
using binlog_flush_trx_cache(); instead they are written directly to the
binary log (protected by the LOCK_log mutex) in MYSQL_BIN_LOG::write().

As proposed here, the logging of these non-transactional events will not
participate in group commit. The reason is that using such events is not
crash-safe anyway, so if they consistute a major part of the workload there is
less motivation to set sync_binlog=1 in the first place. Hence it is not
considered worthwhile to implement this as part of this worklog.

Group commit for these non-transactional writes should be possible to add
later without major changes to the basics. The idea would be to change the
non-transactional case in MYSQL_BIN_LOG::write(Log_event *event_info) to work
similar to binlog_flush_trx_cache(). Instead of taking LOCK_log and writing
the event directly, MYSQL_BIN_LOG::write() would need to put the thread into
the group commit queue like write_transaction_to_binlog_events() does for
transactional commit events. It will then either run the group commit leader
part if it is the first in queue, or wait for wakeup if not. The
trx_group_commit_leader() function will then need to be extended to handle two
different kinds of queued events, transactional commits vs. non-transactional
direct event writes.

MWL#165 is filed for this extension.


3 Locks

3.1 Global LOCK_prepare_ordered

This lock is taken to serialise calls to prepare_ordered(), and to protect the
queue of transactions participating in group commit. Note that effectively,
the commit order is decided by the order in which threads obtain this lock.


3.2 Global LOCK_commit_ordered

This lock is taken around calls to commit_ordered(). This way, we can
implement truly consistent START TRANSACTION WITH CONSISTENT SNAPSHOT simply
by taking this mutex while starting the snapshot (MWL#136). We also use this
lock around reading and updating the binlog_commits and binlog_group_commits
status variables to get a consistent view of both values for a single point in
the commit history.


3.3 Lock contention

For the binlog group commit, most of the locks taken should be uncontended
most of the time. The LOCK_log of course will be contended, as new threads
queue up waiting for the previous group commit (and binlog fsync()) to finish
so they can do the next group commit. This is the whole point of implementing
group commit.

The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
contended as long as handlers follow the intension of having the corresponding
handler calls execute quickly.

The per-thread THD::LOCK_wakeup_ready should not be contended; they are only
used to wake up a sleeping thread.


4. TC_LOG_MMAP changes (log.cc)

Changed to implement log_and_order() rather than log_xid(). Since TC_LOG_MMAP
has no concept of internal ordering of commits, this is just maintaining
correct ordering of prepare_ordered() calls with respect to commit_ordered()
calls. Thus unless both of these are implemented in at least one participating
storage engine, there is little changes needed.

But if both methods need to be called, the ordering is handled using another
queue. Each transaction atomically enter the queue and call prepare_ordered()
protected by the LOCK_prepare_ordered mutex. Each transaction is then logged
inside TC_LOG_MMAP, running in parallel.

Once the transaction is logged inside TC_LOG_MMAP, each transaction waits for
the previous transaction in the queue to call commit_ordered() (the first
transaction in the queue proceeds immediately). Then the transaction calls its
own commit_ordered() method, and finally signals the next transaction in
queue.

To avoid one group of transactions overtaking the previous group (which would
mess up the ordering), we set a flag to reserve the queue when the first
transaction in the queue finishes logging inside TC_LOG_MMAP. This flag is
cleared when the last transaction in the queue is done. We protect the flag
with the LOCK_prepare_ordered mutex, and introduce a pthread condition
variable COND_queue_busy for the second batch of transactions to be able to
wait until the first batch is finished going through the queue.

[Note that an alternative algorithm to this would be to modify the logging
inside TC_LOG_MMAP to run through the queue in a single thread, similar to
what is described for binlog above. This may or may not be faster. However I
think this is not a priority to work on now; TC_LOG_MMAP is only used for
multi-engine transactions (eg. PBXT+InnoDB) when binlog is /not/ enabled,
which I think is quite a corner case not worth it to spend much time
optimising.]


5. Changes to class THD (sql_class.cc)

To facilitate one thread waking up the next, we add a generic facility for
this to THD:

void THD::wait_for_wakeup_ready()
    Put a thread to sleep, waiting for another thread to wake us up. If
    signal_wakeup_ready() was already called by another thread, returns
    immediately without sleeping.

void THD::signal_wakeup_ready();
    Signal the thread to wake up. It will wake up if it is already sleeping in
    wait_for_wakeup_ready(), and will also wakeup immediately if it calls into
    wait_for_wakeup_ready() later.

void THD::clear_wakeup_ready()
    Reset the wakeup signal, so wait_for_wakeup_ready() will again go to sleep
    until next signal_wakeup_ready() call.

These are used in TC_LOG_MMAP and TC_LOG_BINLOG, but are quite generic and
could also be used for other purposes. They are intended only for short
waiting (eg. not something that would be interruptible by SQL kill).

A new condition variable COND_wakeup_ready and associated mutex
LOCK_wakeup_ready and boolean flag wakeup_ready are added to THD as private
members to facilitate this.


6. XtraDB changes (ha_innodb.cc)

The changes needed in XtraDB are comparatively simple, as XtraDB already
implements group commit, it just needs to be enabled with the new
commit_ordered() call.

The existing commit() method already is logically in two parts. The first part
runs under the prepare_commit_mutex() and must be run in same order as binlog
commit. This part needs to be moved to commit_ordered(). The second part runs
after releasing prepare_commit_mutex and does transaction log write+fsync; it
can remain.

Then the prepare_commit_mutex is removed (and the enable_unsafe_group_commit
XtraDB option to disable it).

There are two asserts that check that the thread running the first part of
XtraDB commit is the same as the thread running the other operations for the
transaction. These have to be removed (as commit_ordered() can run in a
different thread). Also an error reporting with sql_print_error() has to be
delayed until commit() time.


7. PBXT changes

I am working with Paul McCullagh to implement the commit_ordered() method for
PBXT. This work is discussed in this thread on maria-developers@:

    https://lists.launchpad.net/maria-developers/msg03619.html


8. Launchpad tree

The implementation of this worklog is maintained in this bzr branch on
Launchpad:

    lp:~maria-captains/maria/mariadb-5.1-mwl116

A quick benchmark was done, with sync_binlog=1 and
innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
transactions against one table.

Without the patch, we get only 25 queries per second.

With the patch, we get 650 queries per second.


9. Open issues/tasks

9.1 XA / other prepare() and commit() call sites.

Check that user-level XA is handled correctly and working. And covered
sufficiently with tests. Also check that any other calls of ha->prepare() and
ha->commit() outside of ha_commit_trans() are handled correctly.

9.2 Testing

This worklog needs additions to the test suite, including error inserts to
check error handling, and synchronisation points to check thread parallelism
correctness.
 File Attachments
 NameTypeSizeByDate
 User Comments
 Time Estimates
NameHours WorkedLast Updated
Knielsen7407 Jun 2010
Total74 
 Hrs WorkedProgressCurrentOriginal
This Task7400
Total7400
 
 Funding and Votes
Votes: 1: 100%
 Change vote: Useless    Nice to have    Important    Very important    

Funding: 0 offers, total 0 Euro
 Progress Reports
(Knielsen - Tue, 05 Jul 2011, 12:47
    
Status updated.
--- /tmp/wklog.116.old.1927	2011-07-05 12:47:48.000000000 +0000
+++ /tmp/wklog.116.new.1927	2011-07-05 12:47:48.000000000 +0000
@@ -1,2 +1,2 @@
-Code-Review
+Complete
 

(Pstoev - Thu, 17 Feb 2011, 10:32
    
QA signoff

(Knielsen - Thu, 06 Jan 2011, 08:14
    
High Level Description modified.
--- /tmp/wklog.116.old.11249	2011-01-06 08:14:26.000000000 +0000
+++ /tmp/wklog.116.new.11249	2011-01-06 08:14:26.000000000 +0000
@@ -11,9 +11,10 @@
 
  - Implement/fix group commit to work properly with the binary log enabled.
 
- - (Optionally) avoid the need to fsync() in the engine, and instead rely on
-   replaying any lost transactions from the binary log against the engine
-   during crash recovery.
+ - provide the foundations for implementing MWL#164. The idea in MWL#164 is to
+   avoid the need to fsync() in the engine, and instead rely on replaying any
+   lost transactions from the binary log against the engine during crash
+   recovery.
 
 For background see these articles:
 

(Knielsen - Mon, 01 Nov 2010, 15:55
    
Category updated.
--- /tmp/wklog.116.old.27735	2010-11-01 15:55:47.000000000 +0000
+++ /tmp/wklog.116.new.27735	2010-11-01 15:55:47.000000000 +0000
@@ -1,2 +1,2 @@
-Server-BackLog
+Server-Sprint
 

(Knielsen - Mon, 01 Nov 2010, 15:55
    
Version updated.
--- /tmp/wklog.116.old.27735	2010-11-01 15:55:47.000000000 +0000
+++ /tmp/wklog.116.new.27735	2010-11-01 15:55:47.000000000 +0000
@@ -1,2 +1,2 @@
-Server-9.x
+Server-5.3
 

(Knielsen - Mon, 01 Nov 2010, 15:55
    
Status updated.
--- /tmp/wklog.116.old.27735	2010-11-01 15:55:47.000000000 +0000
+++ /tmp/wklog.116.new.27735	2010-11-01 15:55:47.000000000 +0000
@@ -1,2 +1,2 @@
-Assigned
+Code-Review
 

(Knielsen - Mon, 01 Nov 2010, 15:55
    
Lead Architect updated:  -> Knielsen
Code Review updated:  -> Serg
QA updated:  -> Pstoev

(Knielsen - Mon, 01 Nov 2010, 12:42
    
Low Level Design modified.
--- /tmp/wklog.116.old.20398	2010-11-01 12:42:41.000000000 +0000
+++ /tmp/wklog.116.new.20398	2010-11-01 12:42:41.000000000 +0000
@@ -1,286 +1,182 @@
 1. Changes for ha_commit_trans()
 
-The gut of the code for commit is in the function ha_commit_trans() (and in
-commit_one_phase() which is called from it). This must be extended to use the
-new prepare_ordered(), group_log_xid(), and commit_ordered() calls.
+The code in ha_commit_trans() must be changed to use the new
+TC_LOG::log_and_order() method of MWL#132 instead of the old log_xid().
 
-1.1 Atomic queue of committing transactions
 
-To keep the right commit order among participants, we put transactions into a
-queue. The operations on the queue are non-locking:
+2. Binlog changes (log.cc)
 
- - Insert THD at the head of the queue, and return old queue.
 
-    THD *enqueue_atomic(THD *thd)
+2.1. Misc. cleanup
 
- - Fetch (and delete) the whole queue.
+I did some refactoring to make the binlog code clearer and easier to extend
+with group commit:
 
-    THD *atomic_grab_reverse_queue()
+ - Some renaming and simplifications of parameters.
 
-These are simple to implement with atomic compare-and-set. Note that there is
-no ABA problem [2], as we do not delete individual elements from the queue, we
-grab the whole queue and replace it with NULL.
+ - Drop the table_map_version code, it is not used (similar cleanup was done
+   in MySQL on my suggestion).
 
-A transaction enters the queue when it does prepare_ordered(). This way, the
-scheduling order for prepare_ordered() calls is what determines the sequence
-in the queue and effectively the commit order.
-
-The queue is grabbed by the code doing group_log_xid() and commit_ordered()
-calls. The queue is passed directly to group_log_xid(), and afterwards
-iterated to do individual commit_ordered() calls.
-
-Using a lock-free queue allows prepare_ordered() (for one transaction) to run
-in parallel with commit_ordered (in another transaction), increasing potential
-parallelism.
-
-The queue is simply a linked list of THD objects, linked through a
-THD::next_commit_ordered field. Since we add at the head of the queue, the
-list is actually in reverse order, so must be reversed when we grab and delete
-it.
-
-The reason that enqueue_atomic() returns the old queue is so that we can check
-if an insert goes to the head of the queue. The thread at the head of the
-queue will do the sequential part of group commit for everyone.
-
-
-1.2 Locks
-
-1.2.1 Global LOCK_prepare_ordered
-
-This lock is taken to serialise calls to prepare_ordered(). Note that
-effectively, the commit order is decided by the order in which threads obtain
-this lock.
-
-
-1.2.2 Global LOCK_group_commit and COND_group_commit
-
-This lock is used to protect the serial part of group commit. It is taken
-around the code where we grab the queue, call group_log_xid() on the queue,
-and call commit_ordered() on each element of the queue, to make sure they
-happen serialised and in consistent order. It also protects the variable
-group_commit_queue_busy, which is used when not using group_log_xid() to delay
-running over a new queue until the first queue is completely done.
-
-
-1.2.3 Global LOCK_commit_ordered
-
-This lock is taken around calls to commit_ordered(), to ensure they happen
-serialised.
+ - Split old binlog_end_trans() function into two new functions (same change
+   is in MySQL 5.5).
 
+ - Don't needlessly take LOCK_log in MYSQL_BIN_LOG::write(Event) (backport
+   from MySQL 5.5).
 
-1.2.4 Per-thread thd->LOCK_commit_ordered and thd->COND_commit_ordered
+ - Fix several bugs in error handling when binlog write fails:
+    * Failure in write_cache() not logged due to bogus assignment of
+      write_error.
+    * Assertion in net_end_statement() due to not calling my_error() in error
+      case.
+    * Not resetting transaction cache after failed binlog write.
+    * XtraDB/InnoDB not releasing prepare_commit_mutex in innobase_rollback()
+      when binlogging fails.
 
-This lock protects the thd->group_commit_ready variable, as well as the
-condition variable used to wake up threads after log_xid() and
-commit_ordered() finishes.
 
+2.2 Implementing the group commit algorithm for the binlog
 
-1.2.5 Global LOCK_group_commit_queue
+Binlog is changed to implement log_and_order() instead of log_xid(). This
+simply calls into binlog_flush_trx_cache() (previously binlog_end_trans()),
+which is modified to use the group commit algorithm described in high-level
+specification.
 
-This is only used on platforms with no native compare-and-set operations, to
-make the queue operations atomic.
-
-
-1.3 Commit algorithm.
-
-This is the basic algorithm, simplified by
-
- - omitting some error handling
-
- - omitting looping over all handlers when invoking handler methods
-
- - omitting some possible optimisations when not all calls needed (see next
-   section).
-
- - Omitting the case where no group_log_xid() is used, see below.
-
----- BEGIN ALGORITHM ----
-    ht->prepare()
-
-    // Call prepare_ordered() and enqueue in correct commit order
-    lock(LOCK_prepare_ordered)
-    ht->prepare_ordered()
-    old_queue= enqueue_atomic(thd)
-    thd->group_commit_ready= FALSE
-    is_group_commit_leader= (old_queue == NULL)
-    unlock(LOCK_prepare_ordered)
-
-    if (is_group_commit_leader)
-
-        // The first in queue handles group commit for everyone
-
-        lock(LOCK_group_commit)
-        // Wait while queue is busy, see below for when this occurs
-        while (group_commit_queue_busy)
-            cond_wait(COND_group_commit)
+A transaction enters the queue when it does prepare_ordered(). This way, the
+scheduling order for prepare_ordered() calls is what determines the sequence
+in the queue and effectively the commit order. Entering the queue is done
+before requesting LOCK_log, so transactions can queue up while the previous
+fsync() of the log (done under LOCK_log) is running. The first transaction to
+enter the queue becomes the leader, which will request the LOCK_log, do binlog
+fsync() for all, run commit_ordered() in correct sequence, and wakeup the
+others. The rest of the transactions in the queue just wait to be woken up.
 
-        // Grab and reverse the queue to get correct order of transactions
-        queue= atomic_grab_reverse_queue()
 
-        // This call will set individual error codes in thd->xid_error
-        // It also sets the cookie for unlog() in thd->xid_cookie
-        group_log_xid(queue)
+2.3 Non-transactional binlog writes
 
-        lock(LOCK_commit_ordered)
-        for (other IN queue)
-            if (!other->xid_error)
-                ht->commit_ordered()
-        unlock(LOCK_commit_ordered)
+Non-transactional statements (like DDL or pure MyISAM DML) are not binlogged
+using binlog_flush_trx_cache(); instead they are written directly to the
+binary log (protected by the LOCK_log mutex) in MYSQL_BIN_LOG::write().
 
-        unlock(LOCK_group_commit)
+As proposed here, the logging of these non-transactional events will not
+participate in group commit. The reason is that using such events is not
+crash-safe anyway, so if they consistute a major part of the workload there is
+less motivation to set sync_binlog=1 in the first place. Hence it is not
+considered worthwhile to implement this as part of this worklog.
 
-        // Now we are done, so wake up all the others.
-        for (other IN TAIL(queue))
-            lock(other->LOCK_commit_ordered)
-            other->group_commit_ready= TRUE
-            cond_signal(other->COND_commit_ordered)
-            unlock(other->LOCK_commit_ordered)
-    else
-        // If not the leader, just wait until leader did the work for us.
-        lock(thd->LOCK_commit_ordered)
-        while (!thd->group_commit_ready)
-            cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
-        unlock(other->LOCK_commit_ordered)
+Group commit for these non-transactional writes should be possible to add
+later without major changes to the basics. The idea would be to change the
+non-transactional case in MYSQL_BIN_LOG::write(Log_event *event_info) to work
+similar to binlog_flush_trx_cache(). Instead of taking LOCK_log and writing
+the event directly, MYSQL_BIN_LOG::write() would need to put the thread into
+the group commit queue like write_transaction_to_binlog_events() does for
+transactional commit events. It will then either run the group commit leader
+part if it is the first in queue, or wait for wakeup if not. The
+trx_group_commit_leader() function will then need to be extended to handle two
+different kinds of queued events, transactional commits vs. non-transactional
+direct event writes.
 
-    // Finally do any error reporting now that we're back in own thread.
-    if (thd->xid_error)
-        xid_delayed_error(thd)
-    else
-        ht->commit(thd)
-        unlog(thd->xid_cookie, thd->xid)
----- END ALGORITHM ----
+MWL#165 is filed for this extension.
 
-If the transaction coordinator does not support group_log_xid(), we have to do
-things differently. In this case after the serialisation point at
-prepare_ordered(), we have to parallelise again when running log_xid()
-(otherwise we would loose group commit). But then when log_xid() is done, we
-have to serialise again to check for any error and call commit_ordered() in
-correct sequence for any transaction where log_xid() did not return error.
 
-The central part of the algorithm in this case (when using log_xid()) is:
+3 Locks
 
----- BEGIN ALGORITHM ----
-    cookie= log_xid(thd)
-    error= (cookie == 0)
+3.1 Global LOCK_prepare_ordered
 
-    if (is_group_commit_leader)
+This lock is taken to serialise calls to prepare_ordered(), and to protect the
+queue of transactions participating in group commit. Note that effectively,
+the commit order is decided by the order in which threads obtain this lock.
 
-        // The first to enqueue grabs the queue and runs first.
-        // But we must wait until a previous queue run is fully done.
 
-        lock(LOCK_group_commit)
-        while (group_commit_queue_busy)
-            cond_wait(COND_group_commit)
-        queue= atomic_grab_reverse_queue()
-        // The queue will be busy until last thread in it is done.
-        group_commit_queue_busy= TRUE
-        unlock(LOCK_group_commit)
-    else
-        // Not first in queue -> wait for previous one to wake us up.
-        lock(thd->LOCK_commit_ordered)
-        while (!thd->group_commit_ready)
-            cond_wait(thd->LOCK_commit_ordered, thd->COND_commit_ordered)
-        unlock(other->LOCK_commit_ordered)
+3.2 Global LOCK_commit_ordered
 
-    if (!error)      // Only if log_xid() was successful
-        lock(LOCK_commit_ordered)
-        ht->commit_ordered()
-        unlock(LOCK_commit_ordered)
+This lock is taken around calls to commit_ordered(). This way, we can
+implement truly consistent START TRANSACTION WITH CONSISTENT SNAPSHOT simply
+by taking this mutex while starting the snapshot (MWL#136). We also use this
+lock around reading and updating the binlog_commits and binlog_group_commits
+status variables to get a consistent view of both values for a single point in
+the commit history.
 
-    // Wake up the next thread, and release queue in last.
-    next= thd->next_commit_ordered
 
-    if (next)
-        lock(next->LOCK_commit_ordered)
-        next->group_commit_ready= TRUE
-        cond_signal(next->COND_commit_ordered)
-        unlock(next->LOCK_commit_ordered)
-    else
-        lock(LOCK_group_commit)
-        group_commit_queue_busy= FALSE
-        unlock(LOCK_group_commit)
----- END ALGORITHM ----
+3.3 Lock contention
 
-There are a number of locks taken in the algorithm, but in the group_log_xid()
-case most of them should be uncontended most of the time. The
-LOCK_group_commit of course will be contended, as new threads queue up waiting
-for the previous group commit (and binlog fsync()) to finish so they can do
-the next group commit. This is the whole point of implementing group commit.
+For the binlog group commit, most of the locks taken should be uncontended
+most of the time. The LOCK_log of course will be contended, as new threads
+queue up waiting for the previous group commit (and binlog fsync()) to finish
+so they can do the next group commit. This is the whole point of implementing
+group commit.
 
 The LOCK_prepare_ordered and LOCK_commit_ordered mutexes should be not much
 contended as long as handlers follow the intension of having the corresponding
 handler calls execute quickly.
 
-The per-thread LOCK_commit_ordered mutexes should not be contended; they are
-only used to wake up a sleeping thread.
-
-
-1.4 Optimisations when not using all three new calls
+The per-thread THD::LOCK_wakeup_ready should not be contended; they are only
+used to wake up a sleeping thread.
 
 
-The prepare_ordered(), group_log_xid(), and commit_ordered() methods are
-optional, and if not implemented by a particular handler/transaction
-coordinator, we can optimise the algorithm to take advantage of not having to
-keep ordering for the missing parts.
+4. TC_LOG_MMAP changes (log.cc)
 
-If there is no prepare_ordered(), then we need not take the
-LOCK_prepare_ordered mutex.
+Changed to implement log_and_order() rather than log_xid(). Since TC_LOG_MMAP
+has no concept of internal ordering of commits, this is just maintaining
+correct ordering of prepare_ordered() calls with respect to commit_ordered()
+calls. Thus unless both of these are implemented in at least one participating
+storage engine, there is little changes needed.
+
+But if both methods need to be called, the ordering is handled using another
+queue. Each transaction atomically enter the queue and call prepare_ordered()
+protected by the LOCK_prepare_ordered mutex. Each transaction is then logged
+inside TC_LOG_MMAP, running in parallel.
+
+Once the transaction is logged inside TC_LOG_MMAP, each transaction waits for
+the previous transaction in the queue to call commit_ordered() (the first
+transaction in the queue proceeds immediately). Then the transaction calls its
+own commit_ordered() method, and finally signals the next transaction in
+queue.
+
+To avoid one group of transactions overtaking the previous group (which would
+mess up the ordering), we set a flag to reserve the queue when the first
+transaction in the queue finishes logging inside TC_LOG_MMAP. This flag is
+cleared when the last transaction in the queue is done. We protect the flag
+with the LOCK_prepare_ordered mutex, and introduce a pthread condition
+variable COND_queue_busy for the second batch of transactions to be able to
+wait until the first batch is finished going through the queue.
+
+[Note that an alternative algorithm to this would be to modify the logging
+inside TC_LOG_MMAP to run through the queue in a single thread, similar to
+what is described for binlog above. This may or may not be faster. However I
+think this is not a priority to work on now; TC_LOG_MMAP is only used for
+multi-engine transactions (eg. PBXT+InnoDB) when binlog is /not/ enabled,
+which I think is quite a corner case not worth it to spend much time
+optimising.]
+
+
+5. Changes to class THD (sql_class.cc)
+
+To facilitate one thread waking up the next, we add a generic facility for
+this to THD:
+
+void THD::wait_for_wakeup_ready()
+    Put a thread to sleep, waiting for another thread to wake us up. If
+    signal_wakeup_ready() was already called by another thread, returns
+    immediately without sleeping.
+
+void THD::signal_wakeup_ready();
+    Signal the thread to wake up. It will wake up if it is already sleeping in
+    wait_for_wakeup_ready(), and will also wakeup immediately if it calls into
+    wait_for_wakeup_ready() later.
+
+void THD::clear_wakeup_ready()
+    Reset the wakeup signal, so wait_for_wakeup_ready() will again go to sleep
+    until next signal_wakeup_ready() call.
+
+These are used in TC_LOG_MMAP and TC_LOG_BINLOG, but are quite generic and
+could also be used for other purposes. They are intended only for short
+waiting (eg. not something that would be interruptible by SQL kill).
+
+A new condition variable COND_wakeup_ready and associated mutex
+LOCK_wakeup_ready and boolean flag wakeup_ready are added to THD as private
+members to facilitate this.
 
-If there is no commit_ordered(), then we need not take the LOCK_commit_ordered
-mutex.
 
-If there is no group_log_xid(), then we only need the queue to ensure same
-ordering of transactions for commit_ordered() as for prepare_ordered(). Thus,
-if either of these (or both) are also not present, we do not need to use the
-queue at all.
-
-
-2. Binlog code changes (log.cc)
-
-
-The bulk of the work needed for the binary log is to extend the code to allow
-group commit to the log. Unlike InnoDB/XtraDB, there is no existing support
-inside the binlog code for group commit.
-
-The existing code runs most of the write + fsync to the binary lock under the
-global LOCK_log mutex, preventing any group commit.
-
-To enable group commit, this code must be split into two parts:
-
- - one part that runs per transaction, re-writing the embedded event positions
-   for the correct offset, and writing this into the in-memory log cache.
-
- - another part that writes a set of transactions to the disk, and runs
-   fsync().
-
-Then in group_log_xid(), we can run the first part in a loop over all the
-transactions in the passed-in queue, and run the second part only once.
-
-The binlog code also has other code paths that write into the binlog,
-eg. non-transactional statements. These have to be adapted also to work with
-the new code.
-
-In order to get some group commit facility for these also, we change that part
-of the code in a similar way to ha_commit_trans. We keep another,
-binlog-internal queue of such non-transactional binlog writes, and such writes
-queue up here before sleeping on the LOCK_log mutex. Once a thread obtains the
-LOCK_log, it loops over the queue for the fast part, and does the slow part
-once, then finally wakes up the others in the queue.
-
-In the transactional case in group_log_xid(), before we run the passed-in
-queue, we add any members found in the binlog-internal queue. This allows
-these non-transactional writes to share the group commit.
-
-However, in the case where it is a non-transactional write that gets the
-LOCK_log, the transactional transactions from the ha_commit_trans() queue will
-not be able to take part (they will have to wait for their turn to do another
-fsync). It seems difficult to cleanly let the binlog code grab the queue from
-out of the ha_commit_trans() algorithm. I think the group commit is mostly
-useful in transactional workloads anyway (non-transactional engines will loose
-data anyway in case of crash, so why fsync() after each transaction?)
-
-
-3. XtraDB changes (ha_innodb.cc)
+6. XtraDB changes (ha_innodb.cc)
 
 The changes needed in XtraDB are comparatively simple, as XtraDB already
 implements group commit, it just needs to be enabled with the new
@@ -302,10 +198,20 @@
 delayed until commit() time.
 
 
-4. Proof-of-concept implementation
+7. PBXT changes
+
+I am working with Paul McCullagh to implement the commit_ordered() method for
+PBXT. This work is discussed in this thread on maria-developers@:
 
-There is a proof-of-concept implementation of this architecture, in the form
-of a quilt patch series [3].
+    https://lists.launchpad.net/maria-developers/msg03619.html
+
+
+8. Launchpad tree
+
+The implementation of this worklog is maintained in this bzr branch on
+Launchpad:
+
+    lp:~maria-captains/maria/mariadb-5.1-mwl116
 
 A quick benchmark was done, with sync_binlog=1 and
 innodb_flush_log_at_trx_commit=1. 64 parallel threads doing single-row
@@ -316,49 +222,18 @@
 With the patch, we get 650 queries per second.
 
 
-5. Open issues/tasks
+9. Open issues/tasks
 
-5.1 XA / other prepare() and commit() call sites.
+9.1 XA / other prepare() and commit() call sites.
 
 Check that user-level XA is handled correctly and working. And covered
 sufficiently with tests. Also check that any other calls of ha->prepare() and
 ha->commit() outside of ha_commit_trans() are handled correctly.
 
-5.2 Testing
+9.2 Testing
 
 This worklog needs additions to the test suite, including error inserts to
 check error handling, and synchronisation points to check thread parallelism
 correctness.
 
 
-6. Alternative implementations
-
- - The binlog code maintains its own extra atomic transaction queue to handle
-   non-transactional commits in a good way together with transactional (with
-   respect to group commit). Alternatively, we could ignore this issue and
-   just give up on group commit for non-transactional statements, for some
-   code simplifications.
-
- - The binlog code has two ways to prepare end_event and similar, one that
-   uses stack-allocation, and another for when stack allocation is not
-   possible that uses thd->mem_root. Probably the overhead of thd->mem_root is
-   so small that it would make sense to use the same code for both cases.
-
- - Instead of adding extra fields to THD, we could allocate a separate
-   structure on the thd->mem_root() with the required extra fields (including
-   the THD pointer). Would seem to require initialising mutexes at every
-   commit though.
-
- - It would probably be a good idea to implement TC_LOG_MMAP::group_log_xid()
-   (should not be hard).
-
-
------------------------------------------------------------------------
-
-References:
-
-[2] https://secure.wikimedia.org/wikipedia/en/wiki/ABA_problem
-
-[3] https://knielsen-hq.org/maria/patches.mwl116/
-
-

(Knielsen - Mon, 01 Nov 2010, 12:42
    
High-Level Specification modified.
--- /tmp/wklog.116.old.20378	2010-11-01 12:42:13.000000000 +0000
+++ /tmp/wklog.116.new.20378	2010-11-01 12:42:13.000000000 +0000
@@ -12,7 +12,7 @@
 
 To do this, we add two new handlerton methods:
 
-    int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+    void (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
     void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
 
 The idea is that the existing prepare() and commit() methods run in the
@@ -30,7 +30,13 @@
 
 As the prepare_ordered() and commit_ordered() calls are serialised, the idea
 is that handlers should do the minimum amount of work needed in these calls,
-relaying most of the work (eg. fsync() ...) to prepare() and commit().
+relaying most of the work (eg. fsync() ...) to prepare() and commit().  After
+commit_ordered() returns, the transaction should be recorded (in memory) as
+committed before any transactions that later call commit_ordered(), and in
+MVCC engines the transaction should become visible to newly started
+transactions. There is however no need to write anything to disk in
+prepare_ordered() (in case of a crash the commit will be recovered during
+crash recovery).
 
 As a concrete example, for InnoDB the commit_ordered() method will do the
 first part of commit that fixed the commit order in the transaction log
@@ -38,81 +44,156 @@
 it. This split already exists inside the InnoDB code, running before
 respectively after releasing the prepare_commit_mutex.
 
+Note that prepare_ordered() and commit_ordered() are only called when internal
+2-phase commit is used, that is when at least two XA-capable storage engines
+(binlog included) participate in a transaction. For eg. an InnoDB-only
+transaction with binlog disabled, the two new methods will not be called (just
+like for the existing prepare() handlerton method).
+
 In addition, the XA transaction coordinator (TC_LOG) is special, since it is
 the one responsible for deciding whether to commit or rollback the
-transaction. For this we need an extra method, since this decision can be done
-only after we know that all prepare() and prepare_ordered() calls succeed, and
-must be done to know whether to call commit_ordered()/commit(), or do rollback.
-
-The existing method for this is TC_LOG::log_xid(). To make implementing group
-commit simpler to implement in a transaction coordinator and more efficient,
-we introduce a new method:
-
-    void group_log_xid(THD *first_thd);
-
-This method runs in the sequential part of group commit. It receives a list of
-transactions to perform log_xid() on, in the correct commit order. (Note that
-TC_LOG can do parallel parts of group commit in its own prepare() and commit()
-methods).
-
-This method can make it easier to implement the group commit in TC_LOG, as it
-gets directly the list of transactions in the right order. Without it, it
-might need to compute such order anyway in a prepare_ordered() method, and the
-server has to create this ordered list anyway to implement the order guarantee
-for prepare_ordered() and commit_ordered().
-
-This group_log_xid() method also is more efficient, as it avoids some
-inter-thread synchronisation. Since group_log_xid() is serialised, we can run
-it together with all the commit_ordered() method calls and need only a single
-sequential code section. With the log_xid() methods, we would need first a
-sequential part for the prepare_ordered() calls, then a parallel part with
-log_xid() calls (to not loose group commit ability for log_xid()), then again
-a sequential part for the commit_ordered() method calls.
-
-The extra synchronisation is needed, as each commit_ordered() call will have
-to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
-should not be called), and also wait for commit_ordered() to finish in all
-threads handling earlier commits. In effect we will need to bounce the
-execution from one thread to the other among all participants in the group
-commit.
-
-As a consequence of the group_log_xid() optimisation, handlers must be aware
-that the commit_ordered() call can happen in another thread than the one
-running commit() (so thread local storage is not available). This should not
-be a big issue as the THD is available for storing any needed information.
-
-Since group_log_xid() runs for multiple transactions in a single thread, it
-can not do error reporting (my_error()) as that relies on thread local
-storage. Instead it sets an error code in THD::xid_error, and if there is an
-error then later another method will be called (in correct thread context) to
-actually report the error:
-
-    int xid_delayed_error(THD *thd)
-
-The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
-are optional (as is xid_delayed_error). A storage engine or transaction
-coordinator is free to not implement them if they are not needed. In this case
-there will be no order guarantee for the corresponding stage of group commit
-for that engine. For example, InnoDB needs no ordering of the prepare phase,
-so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
-all, so does not need to implement any of them.
-
-Note in particular that all existing engines (/binlog implementations if they
-exist) will work unmodified (and also without any change in group commit
-facilities or commit order guaranteed).
-
-Using these new APIs, the work will be to
+transaction. MWL#132 introduces the log_and_order() call for TC_LOG to handle
+this decision, as well as invoking prepare_ordered() and commit_ordered() in
+correct sequence. We need to implement log_and_order() for the binlog and MMAP
+TC implementations.
 
- - In ha_commit_trans(), implement the correct semantics for the three new
-   calls.
+The new methods prepare_ordered() and commit_ordered() are optional. A storage
+engine is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB currently needs no ordering of the
+prepare phase, so can omit implementing prepare_ordered(). But the Facebook
+InnoDB extension of releasing locks early needs ordering in the prepare step
+to preserve statement-based replication correctness; so this might be
+implemented by adding prepare_ordered().
+
+This means that all existing storage engines will work unmodified with this
+new feature. They will still be able to take advantage of binlog group commit
+(at least as long as they do not actively take steps to prevent it, like the
+InnoDB prepare_commit_mutex does). But they will of course not get the
+guarantee of consistent commit ordering with binlog, just as they do not have
+this guarantee with MySQL or MariaDB without this feature.
+
+
+Binlog implementation
+---------------------
+
+For the binlog, we implement a queue of transactions waiting to commit to the
+binlog (waiting because a previous transaction is active writing to the binlog
+or fsync()'ing the data so written, while holding the LOCK_log mutex). The
+first transaction to enter the queue is the group leader, the rest are group
+participants.
+
+Only the group leader locks the LOCK_log mutex. When it obtains the mutex, it
+writes all of the participants in the queue into the binary log (as well as
+itself), and does a single fsync() call for all of them, this implementing
+group commit for the binlog. The leader then runs the commit_ordered() method
+for all transactions in the queue in the right order. This ensures the same
+commit order in binlog and engines that implement commit_ordered(). Once
+commit_ordered() has been run, the leader signals the participants to wakeup
+and finish the commit in each thread individually, returning from
+log_and_order() and proceeding to invoke the commit() handlerton methods in
+engines (or rollback() in case of error).
+
+Doing things this way makes for efficient thread synchronisation and
+scheduling. Each thread doing commit only has to do one wait and one wakeup
+(the leader will wait for LOCK_log and wakeup when it obtains that mutex; the
+participants will wait for a signal from the leader). If there is no
+contention for committing to the binary log, there will be no waiting or
+wakeup at all.
+
+Also, by doing the serial part of commit in a single thread (the leader), we
+avoid imposing any particular scheduling order between the threads
+participating in group commit. Ie. it is not a requirement that if thread A
+commits before thread B (in the binlog and inside engines), then the wakeup of
+B must wait for A to wakeup and handle its own commit first. This makes the OS
+kernel free to decide on the best scheduling order, which I believe will help
+with issues like CPU affinity and NUMA (though I have not benchmarked this).
+
+Note that with this algorithm, the write()/fsync() to binlog and the
+commit_ordered() calls happen in the group leader thread, which may be
+different from the participant thread normally handling a given
+transaction. To support this usage, we specify the interface for
+commit_ordered() to be that it can be called in a different thread than the
+normal thread handling the transaction and associated with THD. We specify
+similar out-of-thread behaviour for prepare_ordered(); such behaviour is not
+done in the binlog group commit implementation, but may be useful for similar
+reasons in something like Galera.
+
+Since the binlog commit and commit_ordered() calls can happen in a different
+thread, the corresponding code can not use thread local variables. In
+particular, it can not use my_error() and similar functions. Therefore, any
+reporting of errors that happen in this part of the code must be delayed to
+the last part of the commit processing: after wakeup of participants for the
+binlog part, and in the commit() handlerton method for storage engines. To
+emphasise this, the prepare_ordered() and commit_ordered() calls are declared
+as returning void, rather than returning an error code; any error code must be
+saved and returned instead from the log_and_order() method (for errors during
+binlog write/fsync()) or the commit() method (for errors during
+commit_ordered). (Of course, normally the storage engine should never have to
+return error during commit after successful prepare(), as that breaks 2-phase
+commit guarantees).
+
+The restriction that thread local storage is not available in
+prepare_ordered() and commit_ordered() should otherwise not be a big problem,
+as storage engines have the THD available for accessing any necessary
+per-thread information.
+
+
+TC_LOG_MMAP implementation
+--------------------------
+
+The MMAP transaction coordinator does not need any ordering of commits, as it
+only needs to record the set of transactions to be recovered after crash,
+nothing about the sequence in which they committed or anything. And it already
+supports group commit. So as such it does not need to be modified. However, as
+per MWL#132, it does need to implement the new log_and_order() rather than the
+old log_xid() method.
+
+Also, the new API introduced here guarantees to storage engines that the
+sequence of calls to prepare_ordered() among transactions will be the same as
+the sequence of calls to commit_ordered(). So if the storage engine(s) in a
+transaction implement both prepare_ordered() and commit_ordered(), the
+TC_LOG_MMAP log_and_order() needs to ensure this by proper thread
+synchronisation. If prepare_ordered() or commit_ordered() (or both) are
+unimplemented in participating storage engines, there is no ordering
+requirements at all.
+
+When consistent order is required, the algorithm in
+TC_LOG_MMAP::log_and_order() runs as follows:
+
+ - When prepare_ordered() is called at the start, transactions are added to a
+   list in the same order that they called prepare_ordered() (with suitable
+   mutex protection).
+
+ - The threads independently execute in parallel the logging part of
+   TC_LOG_MMAP, using the old TC_LOG_MMAP::log_xid() code.
+
+ - Each thread in the list finally goes to sleep waiting for the previous
+   thread to complete commit_ordered(). When it is signalled by the previous
+   thread, it runs its own commit_ordered() methods, and subsequently signal
+   the next thread in the list to wake up. (We also need to handle the cases
+   of the first and last thread in the list, of course).
+
+
+Overview of work needed for this worklog
+----------------------------------------
+
+ - In ha_commit_trans(), change the code to use log_and_order() instead of
+   log_xid(). (This is arguably part of MWL#132; the two worklogs are closely
+   related in this respect).
 
  - In XtraDB, use the new commit_ordered() call to remove the
    prepare_commit_mutex (and resurrect group commit) without loosing the
    consistency with binlog commit order.
 
- - In log.cc (binlog module), implement group_log_xid() to do group commit of
+ - In log.cc (binlog module), implement the queueing of transactions and
+   leader/participant distinction in log_and_order() to do group commit of
    multiple transactions to the binlog with a single shared fsync() call.
 
+ - Similarly, in log.cc, modify TC_LOG_MMAP to synchronise threads as
+   described above in the case where both prepare_ordered() and
+   commit_ordered() are used.
+
 -----------------------------------------------------------------------
 New server variables:
 
@@ -130,10 +211,10 @@
 (binlog_commits/binlog_group_commits); if this is eg. 10 then it means that on
 the average 10 transactions were committed to the binlog at once as a group.
 
-Note that non-transactional statements (including statements that modify
-eg. MyISAM tables, and also DDL statements) count as a "transaction" for these
-counters, as such statements are written directly into the binary log without
-waiting for a later COMMIT.
+(Note that most non-transactional statements cause events to be written
+directly into the binary log, without going through the group commit code, so
+they do /not/ count as "transactions" for the purpose of these status
+variables.)
 
 In debug builds only, a new global server system variable is defined to help
 testing:
@@ -145,42 +226,11 @@
     of such fsync().
 
 -----------------------------------------------------------------------
-Some possible alternative for this worklog:
-
- - We could eliminate the group_log_xid() method for a simpler API, at the
-   cost of extra synchronisation between threads to do in-order
-   commit_ordered() method calls. This would also allow to call
-   commit_ordered() in the correct thread context.
-
- - Alternatively, we could eliminate log_xid() and require that all
-   transaction coordinators implement group_log_xid() instead, again for some
-   moderate simplification.
+A possible alternative for this worklog:
 
  - At the moment there is no plugin actually using prepare_ordered(), so, it
    could be removed from the design. But it fits in well, is efficient to
-   implement, and could be useful later (eg. for the requested feature of
+   implement, and could be useful later (eg. for the Facebook feature of
    releasing locks early in InnoDB).
 
------------------------------------------------------------------------
-Some possible follow-up projects after this is implemented:
-
- - Implement an XtraDB prepare_ordered() methods that can release row locks
-   early (Mark Callaghan from Facebook advocates this, but need to determine
-   exactly how to do this safely).
-
- - Implement a new crash recovery algorithm that uses the consistent commit
-   ordering to need only fsync() for the binlog. At crash recovery, any
-   missing transactions in an engine is replayed from the correct point in the
-   binlog (this point must be stored transactionally inside the engine, as
-   XtraDB already does today).
-
- - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
-   consistent snapshow, with same set of committed and not committed
-   transactions in all engines, 2) returns a corresponding consistent binlog
-   position. This should be easy by piggybacking on the synchronisation
-   implemented for ha_commit_trans().
-
- - Use this in XtraBackup to get consistent binlog position without having to
-   block all updates with FLUSH TABLES WITH READ LOCK.
-
 

(Knielsen - Mon, 01 Nov 2010, 12:41
    
High Level Description modified.
--- /tmp/wklog.116.old.20364	2010-11-01 12:41:40.000000000 +0000
+++ /tmp/wklog.116.new.20364	2010-11-01 12:41:40.000000000 +0000
@@ -54,14 +54,19 @@
  - Other backup methods could have similar needs, eg. XtraBackup or
    `mysqldump --single-transaction`, to have consistent commit order between
    binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
-   or similar expensive blocking operation. (other backup methods, like LVM
+   or similar expensive blocking operation. Without such consistent ordering,
+   we may get a backup where transaction A is committed but B is not, while in
+   the binlog transaction B is logged before A. This makes the backup
+   unsuitable for provisioning a new slave. (Other backup methods, like LVM
    snapshot, don't need consistent commit order, as they can restore
-   out-of-order commits during crash recovery using XA).
+   out-of-order commits from the binlog during crash recovery using XA).
 
  - If we have consistent commit order, we can think about optimising commit to
    need only one fsync (for binlog); lost commits in storage engines can then
    be recovered from the binlog at crash recovery by re-playing against the
-   engine from a particular point in the binlog.
+   engine from a particular point in the binlog. (Consistent commit order may
+   not be absolutely needed for this, but it does simplify the implementation
+   a lot).
 
  - With consistent commit order, we can get better semantics for START
    TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
@@ -70,10 +75,11 @@
    engines.
 
  - In InnoDB, the performance in the presense of hotspots can be improved if
-   we can release row locks early in the commit phase, but this requires that we
-release them in
-   the same order as commits in the binlog to ensure consistency between
-   master and slaves.
+   we can release row locks early in the commit phase, but this requires that
+   we release them in the same order as commits in the binlog to ensure
+   consistency between master and slaves. (The Facebook patch does this,
+   however more research is needed to understand exactly what they do and in
+   which circumstances it is safe).
 
  - There was some discussions around Galera [1] synchroneous replication and
    global transaction ID that it needed consistent commit order among
-- View All Progress Notes (29 total) --


Report Generator:
 
Saved Reports:

WorkLog v4.0.0
  © 2010  Sergei Golubchik and Monty Program AB
  © 2004  Andrew Sweger <yDNA@perlocity.org> and Addnorya
  © 2003  Matt Wagner <matt@mysql.com> and MySQL AB