To trace the history of Parallel computing we have to go back to the late 1950s, with developments emerging in the form of supercomputers during the 1960s and 70’s. These have been shared-memory multiprocessors, with more than one processor working side-by-side on shared information.
In the mid-1980s, a new form of parallel computing was released when the Caltech Concurrent Computation project developed a supercomputer for scientific purposes from 64 Intel 8086/8087 processors.
In recent years, parallel computing is turning mainstream and depends on multi-core processors.
Parallel Execution on EVM (for all the EVM compatible networks) will be implemented in several phases:
- 1.0 is the foundation, architecture, and workflow will be set up.
- 2.0 will be the performance-enhanced version.
- 3.0 will apply parallel to miner mode.
- 4.0+: has not been determined yet.
Today, I will not repeat the 1.0’s architecture design or pipeline workflow here, instead, I will do some retrospection on the project progress and share some personal thoughts.
For the Parallel 1.0 architecture design and final report, please refer to:
- PR in NodeReal repo: https://github.com/node-real/bsc/pull/12
Parallel 1.0 Update
BSC faced tremendous transaction traffic during Nov 2021, and a 2-man team kicked off this parallel execution project on 15th Dec 2021 to speed up the block process. The code was merged to the Dev branch in the middle of March 2022.
The efforts we made is briefly shared with a timeline:
We set up 4 nodes with the same hardware configurations(16 CPU cores) to do the performance test. The chain data is based on BSC-snapshot-2022.01.10, we did full sync for ~60hours and we compared the performance result by mgas ps(million gas consumed per second) and block ps (block processed per second).
Note: 8x means 8 concurrency and 15x means 15 concurrency.
BSC v1.1.8 is a performance release, it uses prefetch to accelerate transaction execution already. Parallel execution disabled the prefetch since parallel already has a pipeline, which does prefetch at some level. It may not be a very satisfying result, but it sets a good foundation for Parallel 2.0.
As the Parallel 1.0 project moves forward, there are some areas that are under our consideration.
It is not friendly for chains like Ethereum/BSC to use parallel execution, since transactions are sorted in order within a block, they could have dependency to other transactions, we call the dependency as conflict.
There is an article that said Ethereum transactions in 2017 with ~35% conflict rate, although I have no solid data about the recent Ethereum transaction conflict rate, it could not be better in my view since big apps(opensea, uniswap, metamask...) are dominating the Ethereum network. BSC has similar issues, big apps like pancake, WBNB wallet are used widely in the BSC network.
If the conflict rate is higher than 30%, there will be lots of transaction redo, which will make the pipeline almost broken.
There can be several methods to reduce the conflict rate:
- A good dispatcher to put the potential conflict transaction in the same routine
- More precious conflict detection algorithm to figure out the state change in very detail, not at the address level, but it should know which key is changed, whose balance is changed...
- A conflict makeup, even if transactions are conflicted, can we repair it without redoing? It is an open topic and can be discussed.
The initial conflict rate was ~40%, it was based on the address and simple dispatcher(dispatch on idle). We improved the dispatcher to dispatch transactions according to their from & to address, then the conflict rate dropped to ~20% ; We did a reverted transaction optimization, all its state changes will be discarded except its own balance change when the transaction is reverted, it can reduce conflict rate for these reverted transactions. Then we improved the conflict detector, no balance or void state change will not be recorded to do conflict check; And we fixed a SystemAddress conflict issue, finally we reduced the conflict rate to ~10% with parallel num 15. And if we use a smaller parallel num, the conflict rate can even drop to ~5% with parallel.num = 6, it is an acceptable rate.
Parallel execution is not free, there are additional costs to schedule the pipeline, and the accumulated cost can not be ignored. I did several performance test to collect these costs and I got a rough cost estimate as blow:
- Dispatch IPC: ~20us(3 IPC no conflict, 5 IPC on conflict)
- New SlotDB: ~50us
- Conflict Detect: ~20us
- Result Merge: ~100us
- StateObject Copy: ~10us for each StateObject (~3 DeepCopy on Average)
That is ~250us for each transaction, the average transaction execution time is ~1500us in my environment, that is ~15% cost.
Best Concurrency Number
It is important to determine the best concurrency number to best utilize CPU resources, but I am not quite interested in this topic, since there won't be many choices.
The number of CPU cores is limited, most pc or cloud servers have CPU cores of 4, 8, 12, or 16.
Obviously, it won't be a linear performance gain with more concurrency, more concurrency will cause more conflict rate and more CPU context switch costs.
I did a performance test for different concurrency num settings, I used a cloud server with 16 cores, and it seems parallel.num 8 is slightly better than parallel.num 15.
Memory & GC
BSC is written by Golang, it is a very popular language, and easy to get started. But we will have the GC issue since the transaction execution will use temporary memories.
Parallel execution would use more temporary memory to keep the pending execution states and results.
We tried to save the memory usage in Parallel 1.0 with SlotDB Reuse and StateObject Copy-On-Write, but GC is still more frequent in parallel mode. It could allocate additional ~50KB memory for each transaction, mainly to keep the temporary state changes and unconfirmed results; For a block with ~200 transactions, that is almost 10MB. And for a full sync rate of 4 blocks per second, that would be ~40MB/s.
GC will have a big impact on program execution, I captured a golang trace when GC occurs. As the below trace shows, when GC happened, transaction execution increased ~70%(507us -> 878us), block validate increased ~700%(20ms -> 150ms), block commit increased ~100% (45ms -> 95ms). Block validate will try to calculate root hash of the trie tree, it is more cpu consuming, so it has more impact than block commit.
We also record the GC frequency, it is ~6/min, that is GC for every 10 seconds.
GC would be more challenging in Parallel 2.0, since the faster the block processed, the more temporary memory will be generated. We have to eliminate the GC impact by reuse most of these temporary memories, detail will be released later in Parallel 2.0 design.
All the state changes made in the parallel execution are kept within the execution slot first, once the parallel execution result is confirmed, we have to merge these state changes back into the main StateDB, such as balance, nonce, code, KV storage. The difficulty of merge is that we can not miss any of the state changes, StateObject could be created/suicided/recreated, nonce advanced, balance changed, KV storage updated…
Generally, the merge operation should be done in sequential order, since it will update the shared main StateDB. The accumulated cost of merge is considerable, we should make it more efficient too. Some optimization methodologies can be introduced if needed, such as: parallel merge with rwlock or cache and do final merge on commit.
Before I started my blockchain career, I did lots of performance optimizations: network, mobile application startup, UI response…
Performance is key to most of the products, but there is an impossible-triangle too.
There will be stability challenges for architecture design and hardware capability, it will take great effort to fix these stability issues, such as concurrent object access. And it will make the product more complicated, it is a challenge for architecture design too.
There are lots of methodologies to improve performance: levelized cache, snapshot or warm startup, adaptive window size, prioritized task, preallocated memory, prefetch…
The pipeline is very important and complicated, chromium teams put lots of effort into making the 60 FPS pipeline run smoothly. It is a long-term job to make the pipeline more efficient.
Parallel 2.0 Outlook
Parallel 2.0 was first introduced in late January 2022, it is for performance.
There will be some methodologies to address these known performance bottlenecks of Parallel 1.0.
The details of Parallel 2.0 are still under design, I will briefly introduce key components that could be added into Parallel 2.0, but it is not final determined yet.
Streaming Pipeline ⭐️⭐️⭐️⭐️
Execute out-of-order & commit in-order.
If a transaction's execution stage (EVM) is completed, it doesn't need to wait for its previous transaction's merge result. The transaction can queue its result, skip stages of Conflict Detect, Finalize Stage, Result Merge.
The pending transaction in the same slot will have the opportunity to be executed. It will make the pipeline run without waiting, I call it a streaming pipeline
Two new concepts will be introduced: Shadow Slot & PendingConfirm Queue.
Adaptive Dispatcher ⭐️⭐️⭐️
In Parallel 1.0, we use a simple dispatcher, dispatch to idle slot or dispatch to slot with the same From/To address.
We need a smarter dispatcher to: further reduce conflict rate and make slot work balance.
It can even learn from the historical data, make decisions based on hardware capability, and generate best parallel runtime parameters.
More Efficient Conflict Detector ⭐️⭐️⭐️
Currently the conflict detector is a simple "double for loop" to check if two SlotDB has overlapped state change.
We mark transaction results as conflict with previous transaction's result, if it reads a state which has been changed by another transaction within the conflict window.
The conflict window has cfWindowStart and cfWindowEnd, cfWindowStart is the TxIndex the SlotDB was based on when SlotDB was created, cfWindowEnd is the transaction's own TxIndex.
Real Time Conflict Detect
RT (real time), no wait for its brother transaction.
- pros: can schedule earlier redo.
- cons: more CD(conflict detect) cost; Redo after redo, since redo is not based on the desired world state.
RT conflict detect is complicated, and it is better to reduce conflict rate with smarter dispatcher
Void Change Optimize
Transactions could do some void operation, for example, set a key with a new value but set origin value back, the state did not change at all.
Copy-On-Write + ⭐️⭐️
No StateObject Copy at all
With COW(Copy-On-Write), we read StateObject from the main StateDB and will do a StateObject deep copy when we try to write the StateObject, the copy operation is to avoid poisoning the main StateDB.
We can remove all the deep copy by DirtyRecord, details will be released later.
Recycle SlotDB's Memory ⭐️⭐️
According to the memory analysis for the parallel node, CopyForSlot will allocate ~50K memory every time. Since most of the memory is consumed by the maps, we can use sync.Pool to manage all the maps. We can recycle the maps used by the SlotDB asynchronously when the block is committing.
Last but not least, We will keep working on improving the performance, both layer 1 or layer 2. And update our work with the community, keep following us and tracking the progress!
It is a new module to avoid overloaded task on dispatcher routine
One issue of Parallel Design 1.0 is the routine chan communication cost, especially the dispatcher routine.
MessageHub is a separate routine to do this chan communication work for dispatcher routine. So the dispatcher can focus on: dispatch decision making, SlotDB or transaction context preparation, result merge...
Join Our Community
Join our community to learn more about NodeReal and stay up to date with us!