This worklog has been replaced with mariadb.org/jira

This site is here for historical purposes only. Do not add or edit tasks here!

 
 
 

WorkLog Frontpage Log in / Register
High-Level Description | Task Dependencies | High-Level Specification | Low-Level Design | File Attachments | User Comments | Time Estimates | Funding and Votes | Progress Reports

 Parallel replication of group-committed transactions
Title
Task ID184
Queue
Version N/A
Status
PriorityN/A
Copies toSergei

Created byKnielsen18 Mar 2011Done
Supervisor N/A  
Lead Architect    
Architecture Review  
Implementor  
Code Review  
QA  
Documentation  
 High-Level Description
Overview
--------

Parallel replication is the idea of executing multiple transactions (or more
precisely event groups from the binlog) in parallel in multiple threads on the
slave, in order to make the slave applier more scalable to multiple CPU cores
and disk spindles.

The main problem with parallel replication is identifying transactions that
are *independent*, meaning that they can be safely executed in parallel on the
slave without dead-locking each other or producing different results than on
the master.

This worklog describes one way of identifying such independence. MWL#169
describes another. Apart from discovering independent transactions, it is also
necessary to implement a mechanism for actually distributing events to
different execution threads, coordinating commit order, handling errors and
crash recovery, etc. It is expected that different ways of determining
independent transactions, such as this worklog and MWL#169, will share the
same mechanism for actual parallel execution. It should even be possible to
combine different ways to discover independence at the same time.


Idea
----

MWL#116 introduces group commit, where multiple transactions are committed
together and also written to the binlog together. This works by collecting
parallel transactions that are waiting inside COMMIT (or autocommit) for
access to the binary log.

The idea in this worklog is that all transactions participating in a single
group commit must necessarily be independent, and can be safely applied in
parallel on a slave.

Suppose we are about to write N transactions T_1, ..., T_N into the binlog in
group commit. Then all statements for each transaction have completed, but
none of them are committed yet. This means that (1) there can be no
conflicting row locks between any two T_i and T_j, and (2) no transaction T_i
was able to see any changes from T_j (for i<>j and assuming isolation level
greater than read uncommitted).

Another way to see this is to note that on the master, the order of T_1, ...,
T_N is essentially random and depends only on how the kernel decided to
schedule the threads during the COMMIT. Any ordering would have been possible
just by having different thread scheduling delays when the transactions are
put in the ordered list used for group commit; there is no inter-thread
dependencies at that point in the code. So any order of the transactions in
the binlog would have been possible, and thus any order of slave
execution. Therefore, as long as COMMIT cannot change what modifications the
transaction already did, any execution order of the transactions on the slave
is permissible.


Implementation
--------------

I suggest that when this feature is enabled, we introduce two new binlog
events, group_commit_begin and group_commit_end. We write a group_commit_begin
event just before all the events for the transactions in a group commit, and a
group_commit_end event just after.

We can use the "ignorable events" from MySQL 5.5 for these events; this will
allow a slave that does not implement these events to safely ignore them.

If needed, we can further increase compatibility by

 - Having a server option to enable/disable the writing of these two events.

 - Add a flag that the slave sets to request the sending of these events, and
   only send them if requested (similar to MWL#47).

It still needs to be decided how much of this to do.

The slave can then collect all the event groups between the group_commit_begin
and group_commit_end pair, and schedule them in parallel.


Limitations
-----------

The main limitation of this approach is that on the slave, it can only run
transactions in parallel if they happened to commit at roughly the same time
on the master.

This means that there will be little opportunity for parallelism on the slave
if

1. The master has low write load

2. The master is mainly running few but long transactions

3. The master has fast fsync() performance for the commit

In the case (1), there is in any case less need for scaling the event
execution.

In the case (2), another method (for example MWL#169) will be needed to
achieve efficient parallel replication.

For (3), we can improve parallelism by deliberately delaying commits on the
master. By inserting a short sleep just before group commit, we allow more
independent transactions time to participate, giving more opportunity for
parallelism on the slave. At the same time, we reduce fsync() load on the
master, potentially freeing more I/O capacity for other work. For example, we
might add an option to rate-limit COMMIT fsync() to 25 per seconds or
whatever. Care will be needed when using this option, as if the master load
has insufficient parallelism, such rate-limitation will reduce throughput on
the master.


The MWL#163 innodb_release_locks_early option causes transactions to
effectively be committed early in COMMIT during the prepare phase. This allows
another transaction to see and/or modify rows also modified by the first
transaction, and still participate in the same group commit. This means that
when innodb_release_locks_early is set, the transactions cannot be safely used
for parallel replication (eg. in this case we must disable the writing of
group_commit_begin / group_commit_end).


Events that modify non-transactional tables or do DDL causes more
complications, as they are written directly to the binlog and make changes
visible immediately to other transactions (basically they are not
transactional). To simplify things, we will not declare any such events
independent with respect to any other transactions according to this
worklog. There are many useful applications of parallel replications where
such events are rare.
 Task Dependencies
Others waiting for Task 184Task 184 is waiting forGraph
206 Parallel replication
 
 High-Level Specification
 Low-Level Design
 File Attachments
 NameTypeSizeByDate
 User Comments
 Time Estimates
NameHours WorkedLast Updated
Knielsen121 Mar 2011
Total1 
 Hrs WorkedProgressCurrentOriginal
This Task100
Total100
 
 Funding and Votes
Votes: 0: 0%
 Make vote: Useless    Nice to have    Important    Very important    

Funding: 0 offers, total 0 Euro
 Progress Reports
(Monty - Thu, 30 Jun 2011, 14:32
    
Dependency created: WL#206 now depends on WL#184

(Knielsen - Mon, 21 Mar 2011, 08:12
    
Write up initial spec.
Worked 1 hour and estimate 0 hours remain (original estimate increased by 1 hour).

(Sergei - Fri, 18 Mar 2011, 17:51
    
Observers changed: Sergei


Report Generator:
 
Saved Reports:

WorkLog v4.0.0
  © 2010  Sergei Golubchik and Monty Program AB
  © 2004  Andrew Sweger <yDNA@perlocity.org> and Addnorya
  © 2003  Matt Wagner <matt@mysql.com> and MySQL AB