Let’s start with a discussion of what causes errors in databases. The following is at least a partial list:
1) Application errors. The application performed one or more incorrect updates. Generally, this is not discovered for minutes to hours thereafter. The database must be backed up to a point before the offending transaction(s), and subsequent activity redone.
2) Repeatable DBMS errors. The DBMS crashed at a processing node. Executing the same transaction on a processing node with a replica will cause the backup to crash. These errors have been termed Bohr bugs. [2]
3) Unrepeatable DBMS errors. The database crashed, but a replica is likely to be ok. These are often caused by weird corner cases dealing with asynchronous operations, and have been termed Heisenbugs [2]
4) Operating system errors. The OS crashed at a node, generating the “blue screen of death.”
5) A hardware failure in a local cluster. These include memory failures, disk failures, etc. Generally, these cause a “panic stop” by the OS or the DBMS. However, sometimes these failures appear as Heisenbugs.
6) A network partition in a local cluster. The LAN failed and the nodes can no longer all communicate with each other.
7) A disaster. The local cluster is wiped out by a flood, earthquake, etc. The cluster no longer exists.
A network failure in the WAN connecting clusters together. The WAN failed and clusters can no longer all communicate with each other.
很经典的8种分类,甚至包括了地震和洪水…


