This worklog has been replaced with mariadb.org/jira

This site is here for historical purposes only. Do not add or edit tasks here!

 
 
 

WorkLog Frontpage Log in / Register
High-Level Description | Task Dependencies | High-Level Specification | Low-Level Design | File Attachments | User Comments | Time Estimates | Funding and Votes | Progress Reports

 Extend crash recovery to recover non-prepared transactions from binlog
Title
Task ID164
Queue
Version N/A
Status
PriorityN/A
Copies toSergei

Created byKnielsen27 Oct 2010Done
Supervisor N/A  
Lead Architect    
Architecture Review  
Implementor  
Code Review  
QA  
Documentation  
 High-Level Description
Currently, when an xa-capable storage engine commits a transaction that is
also written to the binary log, we do three fsync() calls, which is
expensive. One in prepare(), one for binlog write(), and one for commit().

These multiple fsync()'s are needed to ensure that crash recovery can recover
into a consistent state after a crash that happens during commit, so that any
transaction that becomes committed in the storage engine is also committed in
the binlog, and vice versa. This is essential for replication, as otherwise a
crashed master may not be able to recover into a usable state without a full
restore from backup.

The fsync()s are also needed to ensure durability, that is, any transaction
that is reported to the client as having committed, will also remain committed
after a crash.

This worklog is about optimising this so that we can reliably recover into a
consistent state (and preserve durability) with just a single fsync(), the one
for the binary log. In effect, this will allow to run with
--innodb-flush-log-at-trx-commit=2, or even =0, and still preserve both
durability of InnoDB commits and consistent crash recovery.

The idea is that along with every commit, InnoDB will store also an
identification of the transactions associated with that commit (in a
transaction-safe way, eg. using the InnoDB transaction log). In fact, InnoDB
already does this, since it records the binlog position of the last commit
done at any point in time.

Then during crash recovery, we can do as follows:

 - First let each engine recover into an engine-consistent state. This may
   be missing some transactions due to --innodb-flush-log-at-trx-commit=0.

 - Ask each transactional engine for the ID of the last transaction committed.

 - By virtue of MWL#116, we know that the order of commits in the storage
   engine and in the binlog is the same. Thus, the transactions missing in the
   engine is exactly those that are written in the binlog after the last
   transaction committed in the engine.

 - Recover the missing transactions in the engine by re-playing the binary log
   from that point to the end (similar to how it would be applied on a slave.)

There are some restrictions/requirements:

 - This will only work for engines that implement the MWL#116 commit_ordered()
   handlerton method.

 - After a crash, different engines (eg. PBXT vs. InnoDB) may be in different
   states (different last transaction committed). It will be necessary to
   recover each engine separately, at least up to a point where they are
   consistent. This should be simple enough for transactions that only involve
   tables for one engine. For cross-engine transactions, it can probably work
   for row-based replication, simply by ignoring updates to tables in other
   engines when applying the binlog events. But for statement-based
   replication it will not work easily [*].

This way, it should be possible to greatly reduce the fsync() overhead when
using the binlog with a transactional engine.

If some kind of global transaction ID is implemented, it might be better to
have storage engines store that rather than the current binlog position.

An alternative is to store a simple transaction ID, for example the XID
already used to recover prepare()d but not yet commit()ted transactions. This
has the advantage that the storage engine would be able to record this ID
already during prepare() (the binlog position for a transaction is only
available in commit() after successful binlog write, not in prepare()). But it
requires that the binlog recovery is able to locate an arbitrary XID and
convert it into a corresponding binlog position (/ global transaction ID).

-----------------------------------------------------------------------

[*] One could imagine trying to make it work even for statement-based binlog
events, at least for MVCC engines. When running the updates against one
engine, any select against the other engine could try to use an MVCC snapshow
corresponding to the transaction that corresponds to the appropriate
transaction back in time. But I think this is not realistic / worth it to
implement.
 Task Dependencies
Others waiting for Task 164Task 164 is waiting forGraph
116 Efficient group commit for binary log 
 
 High-Level Specification
Getting binlog position of last commit from engine
--------------------------------------------------

InnoDB already records transactionally the binlog position of every commit
(this is used during InnoDB crash recovery and XtraBackup restore to print out
the position to help provision a new replication slave).

However, we need to add an API for the upper server layer to request this
information. (The MWL#116 already provides an API for the engine to get this
information during commit).

The simplest is just to add a new handler method for this. The default
implementation will just return NULL. We will then extend InnoDB/XtraDB to
return the appropriate information. (It would be good if PBXT was also
extended to implement this).

It is somewhat undesirable to continue the bad tradition of using binlog
internal file names and file offsets for this. An alternative would be to
instead use the transaction XID for this (the XID is already used for XA
between binlog and engines). But I think this is better left for doing
properly when we implement global transaction ID. Since this will provide the
needed mechanisms for mapping XID to binlog position, etc.

Order of crash recovery
-----------------------

When we run with --sync-binlog=1 and --innodb-flush-log-at-trx-commit={0,2},
after a crash, and initial crash recovery inside the engine, a transaction may
be in one of the following states:

1. Committed in both engine and binlog.

2. Committed in binlog, prepared in the engine.

3. Committed in binlog, missing from the engine.

3. Missing in binlog, XA prepared in engine.

Note that as long as we run with --sync-binlog=1, it is not possible to find
a transaction committed in an engine, but missing in the binlog. This is
because the commit record in the engine is not written until the binlog has
been fsync()'ed durably to disk. (This is true for transactional engines that
support XA / two-phase commit).

The crash recovery that must be done in the server for this worklog is to
recover the transactions in state (2) and (3). This must be done carefully so
that

 - We do not get a conflict between transactions modifying the same rows or
   otherwise conflicting.

 - We ensure that if we crash at any point during crash recovery, we will in a
   second crash recovery come up in a well-defined state that allows to
   reliably continue with crash recovery.

This basically boils down to having to recover transactions in the right
order.

Since transactions are committed in the same order in binlog and engines, the
transactions missing in the engine will be exactly all transactions in the
binlog after the binlog position of the last transaction committed in the
engine.

However, since prepares are done out-of-order with respect to commits, any
subset of those transactions can be in the XA prepared state, in particular
there can be "holes". So we need to deal with these XA prepared transactions
properly.

One approach is to just roll back all of the XA prepared transactions
unconditionally, and recover everything by re-playing from the
binlog. However, I do not like this approach. We need to keep the ability to
recover using XA anyway (I do not want to remove the XA recovery without
binlog replay using --innodb-flush-log-at-trx-commit=1). And it just does not
feel right.

(Note that we need in any case the prepare step (if not XA prepare). Without
prepare, we could get an error during commit, and binlog does not support
rollback).

Instead what we can do is to first get the list of all XA prepared
transactions. Then we start walking the binlog from the point of last
commit. For each transaction, we check if it was found in XA prepared state;
if so we XA commit it, else we re-apply it from the binlog. Finally we XA
rollback any remaining transactions in XA prepared state.

This ensures that we commit all transactions in binlog order. This is
important in case we crash again during crash recovery, as we then have a
well-defined starting point for second crash recovery attempt. If we for
example were to XA commit transactions first, we would after a crash not know
which transactions had been committed and which not.

We also avoid problems with conflicting transactions in this way. We need to
be certain that when we re-apply transaction A from the binlog, and
transaction B is still in the XA prepared state, that we will not get a row
lock conflict during A from locks held by B.

If we re-apply A with B still in XA prepared state, it means that A was
originally committed before B. Since is in the XA prepared state inside the
engine, while A is missing in the engine, it means that B prepared before A
committed. This in turn means that there can be no conflicts between A and B;
A was able to commit before B released locks, and B was able to prepare before
A released locks. So it is safe to re-apply A without risk of blocking on
locks held by B.

Dealing with multi-engine transactions
--------------------------------------

For recovering lost transactions from the binlog, we want to basically request
the binlog position of last committed transaction from the storage engine,
then re-play any transactions after that from the binlog against the engine.

However, there could be multiple storage engines involved, each with a
different set of missing transactions. Or even single transactions with
multiple engines participating. I think multi-engine transactions are not a
high priority to handle perfectly (they are unlikely to be used much in
practise I think). But we still need to do something reasonable when multiple
engines are involved. In particular, a two-minute experiment with PBXT should
not cause crash recovery to permanently break for InnoDB.

The main problem is that two engines may need recovery from different points
in the binlog. When scanning the early part of the binlog, we would need to
identify which engine was used in each transaction, and skip those
transactions that use involve the engine that was further ahead in terms of
binlog position. And if a transaction involves multiple engines, it becomes
even more difficult to handle, especially for statement-based binlog events.

I think the right solution is to decide that MWL#164 recovery is not supported
when using multiple storage engines together. In this case,
--innodb-flush-log-at-trx-commit=1 and similar for other engines will still be
a requirement. I believe this is a fully acceptable solution; the use of
multiple transactional storage engines together is very uncommon. In fact,
it is quite possible that InnoDB will be the only engine supporting the
reporting of last binlog position for some time at least, making multi-engine
support a moot point.

(Note that the requirement for an engine to utilise MWL#164 recovery is to
support XA transactions and reporting of last binlog position. I think the
only other XA-capable engine is PBXT, and I am not sure PBXT has last binlog
position functionality. PBXT seems to be little maintained these days.)

And the implementation of this is nice and simple. When we request binlog
positions from each supporting engine, we will take the last of all the
reported positions and start recovery from there. This ensures that a replayed
transaction will be missing in all engines, avoiding double apply
problems. And if say PBXT was used sometimes in the past, it will not affect
correct recovery of InnoDB transactions in use during a crash. We will just
document that MWL#164 recovery is only supported when using one engine at a
time.

Dealing with errors during recovery
-----------------------------------

Unfortunately, re-applying events from the binlog is not 100% reliable, for
the same reason that replication can diverge under various circumstances. Some
statements in statement-based binlog events can be non-deterministic, DDL and
non-transactional DML is not crash safe, and so on. A lot of users already
deal with this due to using replication, so MWL#164 recovery can still be
useful in most installations. But it does mean that the implementation must be
prepared to deal with errors in re-applying events.

The goal for MWL#164 recovery should be: If we can recover from a crash
into a state where both engines and replication are consistent using
--innodb-flush-log-at-trx-commit=1, then we should also be able to recover
when using --innodb-flush-log-at-trx-commit=0, by re-playing transactions.
This should be achivable; we can fail to recover for example due to crasing in
the middle of DDL or MyISAM DML, but such crashes also cannot be recoved
reliably even with --innodb-flush-log-at-trx-commit=1.

One issue to deal with is when the last event in the binlog is not an InnoDB
DML transaction. For example if we execute a GRANT statement, then shut down
the server. In this case, we do not want to re-apply the GRANT at startup.

 - We should only do MWL#164 crash recovery if we detect that the server has
   crashed! We already check this for normal XA recovery, by checking if the
   binlog file was closed properly. We should ensure that the binlog is closed
   after shutting down all (transactional XA) engines, so we know that if
   binlog was closed properly then also all engines were.

 - If would be good if also DDL involving InnoDB recorded the last binlog
   position (since DDL is not crash-safe this will not ensure safe recovery,
   but it will reduce the likelyness of duplicating the DDL during recovery).

 - There can still be problems if mixing MyISAM and InnoDB. However this falls
   under the "multiple engines not supported"; in this case the user should
   stick with --innodb-flush-log-at-trx-commit=1 and XA recovery, which will
   work as well or as poorly as always.

To control recovery and error handling we introduce a new startup option
--binlog-heuristic-recover, with the following possible values:

OFF: Do not attempt any MWL#164 recovery from binlog (default).

ERROR: Attempt to recover any transactions found missing from the binlog. If
an error occurs during this, abort, write an error message in the error log,
and stop the server. The message will ask the user to check the error and
re-start the server with the appropriate --binlog-heuristic-recover option.

WARN: Attempt to recover any transactions found missing from the binlog. If
an error occurs during this, do not abort, just write a message explaining the
error into the error log as a warning and press on.

It would be nice to have MWL#164 recovery enabled by default, the idea being
that if we detect a problem after crash, it is better to do our best to fix it
by default. However, without support for multi-engine, this is not possible.
In this case we might wrongly do double re-apply of a lot of MyISAM events if
the last InnoDB event is some way back in the binlog. If we later implement
skipping of events not using XA-capable engines, we could change this so WARN
becomes default.

Re-applying events
------------------

To do the MWL#164 recovery, we first need to ask each storage engine for (a)
binlog position of its last commit and (b) list of transactions in XA prepared
state.

Note that we need the full list of XA prepared transactions. The current API
for this requests at most N transactions from the engine. And the InnoDB
implementation will return the same N transactions in each call unless some of
them are prepared or committed in-between.

I suggest to handle this by simply looping, repeatedly calling the API with a
bigger and bigger N until all are returned (detected by getting less than
N). This should not be an issue, even a very busy server is unlikely to have
more than a few 100 such XA prepared transactions.

We then call into every engine to ask for last binlog position (only
transactional XA engines supporting MWL#164 recovery will return anything, at
first probably only InnoDB/XtraDB). We will compute the largest such position.

We then start scanning the binlog from the computed position. Note that such
scan is already performed for the existing XA recovery, so we merely modify
this scan to do additional processing, and also to start scanning from an
earlier binlog file if the computed position requires it. The existing scan
always scans the full last binlog file, and we should keep this behaviour (for
XA engines with no support for MWL#164 recovery), but we may thus scan more
than one binlog file, especially if the last one is very short.

In addition to recording a set of XIDs found during the scan, we also record
the binlog position corrosponding to each XID. And we check each XID found
after the computed last commit binlog position to see if they are in XA
prepared state in any engine. If so, then there is no need for MWL#164
recovery by re-applying events, we can just do the normal XA recovery, XA
committing all XIDs found in the binlog and rollback the rest.

But if transactions are missing in the engine, and --binlog-heuristic-recover
is not OFF, then we do another scan of the binlog from the last commit
position.

For each event group, we look up the XID (if any) from the binlog position,
which was found in the first scan (this avoids having to buffer all events for
each transaction before applying). If the XID is in XA prepared state in an
engine, we XA commit it, else we execute the events to re-apply and recoveer
the transaction in this way.

[ToDo: What if the event group does not have an XID, eg. it is DDL or MyISAM
DML or similar? I think in this case we can just skip the event; such events
are in any case not crash recoverable (with or without MWL#164), and it avoids
possibly double-apply of DDL or MyISAM. This would probably make it safe to
enable --binlog-heuristic-recover=WARN by default, since we will only do
anything different in cases where transactions are really missing in the
engine, so trying to re-apply them can only make things better. We should
probably also have a limit --binlog-recover-max-files-to-scan, maybe
defaulting to 5 or so, so that we do not try to scan a year of binlogs if
nobody used InnoDB during the last year.]

At the end, we XA rollback any transactions left in XA prepared state. (We
could also roll back such transactions initially; it does not really matter,
as XA prepared transactions must in any case be non-conflicting, but maybe
rolling back at the end seems more natural, after all they would appear last
in the binlog had they had time to commit).

If the binlog position reported from the engine points into a binlog file that
no longer exists, we cannot recover. We will treat this as an error or warning
according to --binlog-heuristic-recover and we cannot do any recovery except
to XA commit/rollback as appropriate. It is not a good idea to partially apply
whatever transactions we do have! Also note that even with
--innodb-flush-log-at-trx-commit=0, InnoDB will still flush transactions to
disk every second, so all we need is about one second of binlogs
available. This should hardly be a problem in practise.

How do we compare binlog positions when the file name differs? We should be
able to do this by checking the sequence of file names in master-bin.index.

And how do we scan from one binlog file to the next? We can use the final
rotate event in one binlog, it has the name of the next file. Alternatively,
we can use master-bin.index (but rotate event is better, as it is what the
replication slaves are also using).
 Low-Level Design
 File Attachments
 NameTypeSizeByDate
 User Comments
 Time Estimates
NameHours WorkedLast Updated
Knielsen1001 Aug 2011
All Sub Tasks74 
Total84 
 Hrs WorkedProgressCurrentOriginal
This Task10190197
Sub Tasks7400
Total84190197
 
 Funding and Votes
Votes: 1: 67%
 Change vote: Useless    Nice to have    Important    Very important    

Funding: 0 offers, total 0 Euro
 Progress Reports
(Knielsen - Mon, 01 Aug 2011, 06:56
    
Write up detailed specification on how this could be implemented.
Worked 7 hours and estimate 190 hours remain (original estimate unchanged).

(Knielsen - Tue, 05 Jul 2011, 11:39
    
Thinking/working on high-level description.
Worked 3 hours and estimate 197 hours remain (original estimate increased by 200 hours).

(Knielsen - Tue, 05 Jul 2011, 10:59
    
High-Level Specification modified.
--- /tmp/wklog.164.old.31147	2011-07-05 10:59:25.000000000 +0000
+++ /tmp/wklog.164.new.31147	2011-07-05 10:59:25.000000000 +0000
@@ -1,2 +1,290 @@
+Getting binlog position of last commit from engine
+--------------------------------------------------
+
+InnoDB already records transactionally the binlog position of every commit
+(this is used during InnoDB crash recovery and XtraBackup restore to print out
+the position to help provision a new replication slave).
+
+However, we need to add an API for the upper server layer to request this
+information. (The MWL#116 already provides an API for the engine to get this
+information during commit).
+
+The simplest is just to add a new handler method for this. The default
+implementation will just return NULL. We will then extend InnoDB/XtraDB to
+return the appropriate information. (It would be good if PBXT was also
+extended to implement this).
+
+It is somewhat undesirable to continue the bad tradition of using binlog
+internal file names and file offsets for this. An alternative would be to
+instead use the transaction XID for this (the XID is already used for XA
+between binlog and engines). But I think this is better left for doing
+properly when we implement global transaction ID. Since this will provide the
+needed mechanisms for mapping XID to binlog position, etc.
+
+Order of crash recovery
+-----------------------
+
+When we run with --sync-binlog=1 and --innodb-flush-log-at-trx-commit={0,2},
+after a crash, and initial crash recovery inside the engine, a transaction may
+be in one of the following states:
+
+1. Committed in both engine and binlog.
+
+2. Committed in binlog, prepared in the engine.
+
+3. Committed in binlog, missing from the engine.
+
+3. Missing in binlog, XA prepared in engine.
+
+Note that as long as we run with --sync-binlog=1, it is not possible to find
+a transaction committed in an engine, but missing in the binlog. This is
+because the commit record in the engine is not written until the binlog has
+been fsync()'ed durably to disk. (This is true for transactional engines that
+support XA / two-phase commit).
+
+The crash recovery that must be done in the server for this worklog is to
+recover the transactions in state (2) and (3). This must be done carefully so
+that
+
+ - We do not get a conflict between transactions modifying the same rows or
+   otherwise conflicting.
+
+ - We ensure that if we crash at any point during crash recovery, we will in a
+   second crash recovery come up in a well-defined state that allows to
+   reliably continue with crash recovery.
+
+This basically boils down to having to recover transactions in the right
+order.
+
+Since transactions are committed in the same order in binlog and engines, the
+transactions missing in the engine will be exactly all transactions in the
+binlog after the binlog position of the last transaction committed in the
+engine.
+
+However, since prepares are done out-of-order with respect to commits, any
+subset of those transactions can be in the XA prepared state, in particular
+there can be "holes". So we need to deal with these XA prepared transactions
+properly.
+
+One approach is to just roll back all of the XA prepared transactions
+unconditionally, and recover everything by re-playing from the
+binlog. However, I do not like this approach. We need to keep the ability to
+recover using XA anyway (I do not want to remove the XA recovery without
+binlog replay using --innodb-flush-log-at-trx-commit=1). And it just does not
+feel right.
+
+(Note that we need in any case the prepare step (if not XA prepare). Without
+prepare, we could get an error during commit, and binlog does not support
+rollback).
+
+Instead what we can do is to first get the list of all XA prepared
+transactions. Then we start walking the binlog from the point of last
+commit. For each transaction, we check if it was found in XA prepared state;
+if so we XA commit it, else we re-apply it from the binlog. Finally we XA
+rollback any remaining transactions in XA prepared state.
+
+This ensures that we commit all transactions in binlog order. This is
+important in case we crash again during crash recovery, as we then have a
+well-defined starting point for second crash recovery attempt. If we for
+example were to XA commit transactions first, we would after a crash not know
+which transactions had been committed and which not.
+
+We also avoid problems with conflicting transactions in this way. We need to
+be certain that when we re-apply transaction A from the binlog, and
+transaction B is still in the XA prepared state, that we will not get a row
+lock conflict during A from locks held by B.
+
+If we re-apply A with B still in XA prepared state, it means that A was
+originally committed before B. Since is in the XA prepared state inside the
+engine, while A is missing in the engine, it means that B prepared before A
+committed. This in turn means that there can be no conflicts between A and B;
+A was able to commit before B released locks, and B was able to prepare before
+A released locks. So it is safe to re-apply A without risk of blocking on
+locks held by B.
+
+Dealing with multi-engine transactions
+--------------------------------------
+
+For recovering lost transactions from the binlog, we want to basically request
+the binlog position of last committed transaction from the storage engine,
+then re-play any transactions after that from the binlog against the engine.
+
+However, there could be multiple storage engines involved, each with a
+different set of missing transactions. Or even single transactions with
+multiple engines participating. I think multi-engine transactions are not a
+high priority to handle perfectly (they are unlikely to be used much in
+practise I think). But we still need to do something reasonable when multiple
+engines are involved. In particular, a two-minute experiment with PBXT should
+not cause crash recovery to permanently break for InnoDB.
+
+The main problem is that two engines may need recovery from different points
+in the binlog. When scanning the early part of the binlog, we would need to
+identify which engine was used in each transaction, and skip those
+transactions that use involve the engine that was further ahead in terms of
+binlog position. And if a transaction involves multiple engines, it becomes
+even more difficult to handle, especially for statement-based binlog events.
+
+I think the right solution is to decide that MWL#164 recovery is not supported
+when using multiple storage engines together. In this case,
+--innodb-flush-log-at-trx-commit=1 and similar for other engines will still be
+a requirement. I believe this is a fully acceptable solution; the use of
+multiple transactional storage engines together is very uncommon. In fact,
+it is quite possible that InnoDB will be the only engine supporting the
+reporting of last binlog position for some time at least, making multi-engine
+support a moot point.
+
+(Note that the requirement for an engine to utilise MWL#164 recovery is to
+support XA transactions and reporting of last binlog position. I think the
+only other XA-capable engine is PBXT, and I am not sure PBXT has last binlog
+position functionality. PBXT seems to be little maintained these days.)
+
+And the implementation of this is nice and simple. When we request binlog
+positions from each supporting engine, we will take the last of all the
+reported positions and start recovery from there. This ensures that a replayed
+transaction will be missing in all engines, avoiding double apply
+problems. And if say PBXT was used sometimes in the past, it will not affect
+correct recovery of InnoDB transactions in use during a crash. We will just
+document that MWL#164 recovery is only supported when using one engine at a
+time.
+
+Dealing with errors during recovery
+-----------------------------------
+
+Unfortunately, re-applying events from the binlog is not 100% reliable, for
+the same reason that replication can diverge under various circumstances. Some
+statements in statement-based binlog events can be non-deterministic, DDL and
+non-transactional DML is not crash safe, and so on. A lot of users already
+deal with this due to using replication, so MWL#164 recovery can still be
+useful in most installations. But it does mean that the implementation must be
+prepared to deal with errors in re-applying events.
+
+The goal for MWL#164 recovery should be: If we can recover from a crash
+into a state where both engines and replication are consistent using
+--innodb-flush-log-at-trx-commit=1, then we should also be able to recover
+when using --innodb-flush-log-at-trx-commit=0, by re-playing transactions.
+This should be achivable; we can fail to recover for example due to crasing in
+the middle of DDL or MyISAM DML, but such crashes also cannot be recoved
+reliably even with --innodb-flush-log-at-trx-commit=1.
+
+One issue to deal with is when the last event in the binlog is not an InnoDB
+DML transaction. For example if we execute a GRANT statement, then shut down
+the server. In this case, we do not want to re-apply the GRANT at startup.
+
+ - We should only do MWL#164 crash recovery if we detect that the server has
+   crashed! We already check this for normal XA recovery, by checking if the
+   binlog file was closed properly. We should ensure that the binlog is closed
+   after shutting down all (transactional XA) engines, so we know that if
+   binlog was closed properly then also all engines were.
+
+ - If would be good if also DDL involving InnoDB recorded the last binlog
+   position (since DDL is not crash-safe this will not ensure safe recovery,
+   but it will reduce the likelyness of duplicating the DDL during recovery).
+
+ - There can still be problems if mixing MyISAM and InnoDB. However this falls
+   under the "multiple engines not supported"; in this case the user should
+   stick with --innodb-flush-log-at-trx-commit=1 and XA recovery, which will
+   work as well or as poorly as always.
+
+To control recovery and error handling we introduce a new startup option
+--binlog-heuristic-recover, with the following possible values:
+
+OFF: Do not attempt any MWL#164 recovery from binlog (default).
+
+ERROR: Attempt to recover any transactions found missing from the binlog. If
+an error occurs during this, abort, write an error message in the error log,
+and stop the server. The message will ask the user to check the error and
+re-start the server with the appropriate --binlog-heuristic-recover option.
+
+WARN: Attempt to recover any transactions found missing from the binlog. If
+an error occurs during this, do not abort, just write a message explaining the
+error into the error log as a warning and press on.
+
+It would be nice to have MWL#164 recovery enabled by default, the idea being
+that if we detect a problem after crash, it is better to do our best to fix it
+by default. However, without support for multi-engine, this is not possible.
+In this case we might wrongly do double re-apply of a lot of MyISAM events if
+the last InnoDB event is some way back in the binlog. If we later implement
+skipping of events not using XA-capable engines, we could change this so WARN
+becomes default.
+
+Re-applying events
+------------------
+
+To do the MWL#164 recovery, we first need to ask each storage engine for (a)
+binlog position of its last commit and (b) list of transactions in XA prepared
+state.
+
+Note that we need the full list of XA prepared transactions. The current API
+for this requests at most N transactions from the engine. And the InnoDB
+implementation will return the same N transactions in each call unless some of
+them are prepared or committed in-between.
+
+I suggest to handle this by simply looping, repeatedly calling the API with a
+bigger and bigger N until all are returned (detected by getting less than
+N). This should not be an issue, even a very busy server is unlikely to have
+more than a few 100 such XA prepared transactions.
+
+We then call into every engine to ask for last binlog position (only
+transactional XA engines supporting MWL#164 recovery will return anything, at
+first probably only InnoDB/XtraDB). We will compute the largest such position.
+
+We then start scanning the binlog from the computed position. Note that such
+scan is already performed for the existing XA recovery, so we merely modify
+this scan to do additional processing, and also to start scanning from an
+earlier binlog file if the computed position requires it. The existing scan
+always scans the full last binlog file, and we should keep this behaviour (for
+XA engines with no support for MWL#164 recovery), but we may thus scan more
+than one binlog file, especially if the last one is very short.
+
+In addition to recording a set of XIDs found during the scan, we also record
+the binlog position corrosponding to each XID. And we check each XID found
+after the computed last commit binlog position to see if they are in XA
+prepared state in any engine. If so, then there is no need for MWL#164
+recovery by re-applying events, we can just do the normal XA recovery, XA
+committing all XIDs found in the binlog and rollback the rest.
+
+But if transactions are missing in the engine, and --binlog-heuristic-recover
+is not OFF, then we do another scan of the binlog from the last commit
+position.
+
+For each event group, we look up the XID (if any) from the binlog position,
+which was found in the first scan (this avoids having to buffer all events for
+each transaction before applying). If the XID is in XA prepared state in an
+engine, we XA commit it, else we execute the events to re-apply and recoveer
+the transaction in this way.
+
+[ToDo: What if the event group does not have an XID, eg. it is DDL or MyISAM
+DML or similar? I think in this case we can just skip the event; such events
+are in any case not crash recoverable (with or without MWL#164), and it avoids
+possibly double-apply of DDL or MyISAM. This would probably make it safe to
+enable --binlog-heuristic-recover=WARN by default, since we will only do
+anything different in cases where transactions are really missing in the
+engine, so trying to re-apply them can only make things better. We should
+probably also have a limit --binlog-recover-max-files-to-scan, maybe
+defaulting to 5 or so, so that we do not try to scan a year of binlogs if
+nobody used InnoDB during the last year.]
+
+At the end, we XA rollback any transactions left in XA prepared state. (We
+could also roll back such transactions initially; it does not really matter,
+as XA prepared transactions must in any case be non-conflicting, but maybe
+rolling back at the end seems more natural, after all they would appear last
+in the binlog had they had time to commit).
+
+If the binlog position reported from the engine points into a binlog file that
+no longer exists, we cannot recover. We will treat this as an error or warning
+according to --binlog-heuristic-recover and we cannot do any recovery except
+to XA commit/rollback as appropriate. It is not a good idea to partially apply
+whatever transactions we do have! Also note that even with
+--innodb-flush-log-at-trx-commit=0, InnoDB will still flush transactions to
+disk every second, so all we need is about one second of binlogs
+available. This should hardly be a problem in practise.
+
+How do we compare binlog positions when the file name differs? We should be
+able to do this by checking the sequence of file names in master-bin.index.
+
+And how do we scan from one binlog file to the next? We can use the final
+rotate event in one binlog, it has the name of the next file. Alternatively,
+we can use master-bin.index (but rotate event is better, as it is what the
+replication slaves are also using).
 
 

(Sergei - Wed, 27 Oct 2010, 13:23
    
Observers changed: Sergei

(Knielsen - Wed, 27 Oct 2010, 13:13
    
Dependency created: WL#164 now depends on WL#116

(Knielsen - Wed, 27 Oct 2010, 13:12
    
High Level Description modified.
--- /tmp/wklog.164.old.22979	2010-10-27 13:12:54.000000000 +0000
+++ /tmp/wklog.164.new.22979	2010-10-27 13:12:54.000000000 +0000
@@ -57,6 +57,19 @@
 This way, it should be possible to greatly reduce the fsync() overhead when
 using the binlog with a transactional engine.
 
+If some kind of global transaction ID is implemented, it might be better to
+have storage engines store that rather than the current binlog position.
+
+An alternative is to store a simple transaction ID, for example the XID
+already used to recover prepare()d but not yet commit()ted transactions. This
+has the advantage that the storage engine would be able to record this ID
+already during prepare() (the binlog position for a transaction is only
+available in commit() after successful binlog write, not in prepare()). But it
+requires that the binlog recovery is able to locate an arbitrary XID and
+convert it into a corresponding binlog position (/ global transaction ID).
+
+-----------------------------------------------------------------------
+
 [*] One could imagine trying to make it work even for statement-based binlog
 events, at least for MVCC engines. When running the updates against one
 engine, any select against the other engine could try to use an MVCC snapshow


Report Generator:
 
Saved Reports:

WorkLog v4.0.0
  © 2010  Sergei Golubchik and Monty Program AB
  © 2004  Andrew Sweger <yDNA@perlocity.org> and Addnorya
  © 2003  Matt Wagner <matt@mysql.com> and MySQL AB