DiemBFT Implementation
In my Distributed Systems graduate class, we worked on a group project implementing the Byzantine Fault Tolerant distributed consensus algorithm desribed in the paper DiemBFT v4: State Machine Replication in the Diem Blockchain. There were two main phases to the project: first we implemented the core DiemBFT algorithm, a two phase commit, leader based consensus algorithm with quadratic cost view-change when a leader fails. This phase included the implementation of a simple testing framework allowing the user to specify message drops or delays in specific rounds for specific messages. In the second phase of the project, we implemented the Twins approach to testing BFT systems and ran more thorough tests of our DiemBFT implementation. Everything was done using Python with DistAlgo extensions.
DiemBFT Implementation
The main components of the DiemBFT algorithm are described in detail with pseudocode in the DiemBFT paper. However, several design decisions and components were omitted from the paper, leaving it up to us to determine how to complete the remaining implementation. For example, I was in charge of implementing how clients send transactions and receive confirmation of committed transactions; how replicas choose and verify transactions being added to the blockchain to avoid duplicates; and how replicas sync up to their peers when they fall behind. This last component proved to be the most challenging, requiring a solid understanding of how and why the consensus algorithm actually works. In terms of syncing up replicas, the paper simply mentions:
When a message is handled in the pseudo code, we assume that its format and signature has been validated, and that the receiver has synced up with all ancestors and any other data referenced by meta-information in the message.
Without this sync up process, a problematic scenario can arise if a replica \(A\) misses an honest proposal in round \(r\). The leader of the round \(r+1\) will collect \(2f+1\) votes and send out a proposal with a Quorum Certificate (QC) for round \(r\). When replica \(A\) receives the Quorum Certificate for round \(r\), it won’t have the corresponding block that is being certified so it must retrieve this block from one of its peers.
The problems I needed to solve were: When do replicas check whether they need to sync up? What information must an outdated replica send to its peer and what information must the peer send back? What data structures need updating once a replica receives a sync up response?
I decided to put all sync related code into a separate syncmanager module. Calls to syncmanager needed to be added every time a vote, proposal, or timeout message is received before the message is processed by the rest of the code. This serves two purposes: firstly, this will trigger a sync up if a QC in the message references an unknown block ID, and secondly, it prevents the message from being processed if a sync up is already in progress. If a sync up is already in progress, all vote, proposal, and timeout messages are held in a message buffer until the sync up finishes and these messages are then replayed.
If a replica receives a QC referencing a block that hasn’t been
committed yet, every replica that signed the QC should have the block
stored in its pending block tree within the BlockTree module. I
decided to have the outdated replica send a sync request to one
replica in the QC signator set to maintain message linearity. If this
sync up fails, we move on to the next replica in the signator set.
Upon receiving a sync up request, the up to date replica will send a
list of blocks corresponding to a path in its pending block tree from
the block with the QC for the latest committed block to the block
with the QC from the highest round it has seen. If the outdated
replica only fell one or two rounds behind, it can process the
QC’s in this chain and call the execute_and_insert
function in the
BlockTree module to add these blocks to its own pending block tree.
However, this is not sufficient to sync up replicas that have fallen arbitrarily far behind. Replicas store pending blocks in the BlockTree module, but blocks are removed and sent to the Ledger module once they are committed. If an outdated replica missed blocks that have already been committed by the time it tries to sync up, these blocks must be retrieved from the Ledger module. This means the Ledger module needed modifying so that every committed block remains in memory in case it is needed for a sync up message. A sync request therefore contains the QC of the highest committed block so the up to date replica knows how many committed blocks it needs to send. A sync response must then contain a chain of committed blocks and a path of pending blocks, which is sufficient to catch up any replica even if it has missed every message since the genesis block.
After some testing, I noticed there was still an annoying issue that delayed the sync up process for an extra round. The issue would occur when an outdated replica receives a proposal containing a QC for the block from the previous round that it missed, so it sends a sync request to a replica that voted in the previous round. Sometimes, the up to date replica will receive the sync request before it receives that same QC from the leader of the next round. This means the replica is being asked to send a sync response up to a QC that it hasn’t seen yet. To solve this, the QC that triggered the sync in the outdated replica is included in the sync request so that the up to date replica can process that QC and update its pending block tree accordingly.
Twins Testing
The Twins approach to testing Byzantine Fault Tolerant systems simulates faulty replicas by creating “twin” replicas. This means two replicas are created with all of the same cryptographic credentials but they act as one replica when interacting with other replicas. This allows for interesting behavior such as proposal equivocation, i.e. the twin replicas are elected as leader but each twin sends a different, valid proposal. Twins also creates network partitions for each round. Only replicas in the same cell of a partition can send and receive messages from each other in that round. Note that two twins can be placed in separate partitions in the same round. We also implemented an intra-partition message dropping feature to allow for more interesting test cases. The Twins implementation includes a test generator and a test executor. The generator permutes all possible partitions across all possible rounds with all possible assignments of leaders. A range of parameters exist to control the size of the search space. I was in charge of implementing the test executor. This component reads in test cases outputted by the test generator and actually executes them. It also checks whether safety and liveness properties are satisfied.
Test Executor Implementation
I implemented the Test Executor using a TestExec class that acted as a central message hub that all messages must pass through. The TestExec class inspects the contents of each message in order to determine the round number and then determines whether or not to forward this message to the intended recipient using the network partition for this round and any intra-partition message drop rules. Messages are redirected to this hub by subclassing the Replica class and overriding the send and receive methods in order to separate the Twins code from the DiemBFT code as much as possible. I also overrode the LeaderElection module in order to elect leaders chosen by the test executor configuration. The more interesting design decisions came about during implementation of the property checking code.
Property Checking
We chose to check the safety properties offline rather than online because checking them online adds a slight overhead during execution which can interfere with timeouts so it is less intrusive to perform this analysis after execution has finished. There are four safety and liveness properties specified in the DiemBFT paper that we verify.
Property 1
If a block is certified in a round, no other block can gather \(f + 1\) non-Byzantine votes in the same round. Hence, at most one block is certified in each round.
I used DistAlgo message history queries, aggregators, and quantifiers to check this property. We first iterate through rounds and check that each round satisfies the property. To check if a round satisfies the property, we check that for each certified block from this round, there does not exist another block that gathered \(f + 1\) votes from non-twin replicas. We get certified blocks by iterating through all blocks which received \(2f + 1\) votes.
Property 2.
For every two globally direct-committed blocks \(B\), \(B_0\), either \(B\) ←∗ \(B_0\) or \(B_0\) ←∗ \(B\).
We check this property by aggregating all globally direct committed blocks. We also aggregate all proposed blocks along with their parent block. This should form a tree of blocks. We then iterate through every combination of globally direct committed blocks and ensure that either the first block extends the second xor the second block extends the first. We determine whether a block is globally direct committed by iterating through all proposed blocks and checking that there were \(f + 1\) votes for the QC of this block in the round after it was proposed.
Property 3.
Let \(r\) be a round after GST. Every honest validator eventually locally commits some block \(B\) by calling
Ledger.commit(B.id)
for \(B.round > r\).
This property is only semi-decidable since we run the algorithm for a finite number of rounds after GST, and it only says that eventually a block will be committed after GST with no bounds on when. Nevertheless, we query the history of committed messages sent by the Replica class when a replica locally commits a block. The committed message includes the id of the block, the round that it was proposed in, and the time when it was committed. We use a DistAlgo quantifier to ensure that each honest replica has committed some block in a round after GST.
Property 4.
Suppose all honest validators agree on honest leaders of rounds \(r\), \(r + 1\), \(r + 2\) and \(r\) occurs after GST. Then, every honest validator locally direct-commits the honest block proposed in round \(r\) within \(7∆\) time of when the first honest validator entered round \(r\).
We collect some data online while the algorithm is executing in order to check this property. TestExec maintains a dictionary mapping round numbers to datetime objects representing the time when an honest replica first entered that round. This is computed by checking the round numbers of all received messages and saving the time whenever a message for a new round is received. Once the algorithm terminates, we iterate through the rounds with three consecutive honest leaders and then iterate through the honest replicas for each round. We query the message history of committed messages to ensure that each honest replica locally committed the block for that round within \(7∆\) time of when the first replica entered the round.