We optimized libsnark and reduced Loopring’s settlement cost to $0.000124 per trade or $124 per 1,000,000 trades.
A couple of months ago we shared the costs for running the WeDEX beta on the Loopring protocol. The onchain costs, as well as the offchain costs, are still pretty high. And while there was a clear path forward for reducing the onchain costs by a factor of 10x (creating bigger blocks and using batch verification) there was less of a clear path forward for reducing the offchain costs by a comparable factor. This would mean that the offchain costs would remain high (or even increase because larger blocks are generally more expensive to prove) and take up a very large part of the total cost. Efficiency improvements to the prover were necessary to lower the costs.
As we depend on ethsnarks, the most obvious groth16 prover (the proving system we currently use) for us to use was and still is libsnark. Libsnark is less efficient than bellman (which quite a few other projects use), so switching to bellman would be an option. But our circuits are most accessible in libsnark and some of our witnesses still need to be generated when a block needs to be proven. Our block format contains all the state necessary to prove a block, but does not yet contain all the witness values necessary by the actual prover (i.e. intermediate values of the constraints), so sticking to libsnark would be nicer. There is also no reason why libsnark needs to perform any worse than bellman (as even bellman has room for improvements).
The prover uses a lot of memory, around 5GB per million constraints. That’s a lot of memory for big circuits. Our biggest circuit, for now, is 64M constraints which means we need to run the prover on expensive hardware for decent performance. That’s far from ideal.
We decided to try and improve the libsnark prover. We have done numerous optimizations, large and small, with great results. We’d like to share some of the more interesting optimizations we’ve done. But first, let’s get to the results!
We can now generate proofs at 40,000 constraints/second per CPU core using less than 1GB of memory per 1M constraints.
A zkRollup transfer is 30,000 constraints while a zkRollup trade is 60,000 constraints. A 16-core CPU can prove 20 transfers/second or 10 trades/second.
For example, a circuit with 64M constraints (1024 trades, the biggest circuit we did a trusted setup for) now only takes 104 seconds to prove using just 55GB of memory on a 16-core CPU. This is 635,000 constraints/second using just consumer-level hardware! A 16-core CPU + 64GB of RAM costs around $1000 which is quite cheap. This makes buying the necessary hardware for proof generation very much a possibility, which is great for exchanges but also for the potential decentralization of zkRollup operators.
Using cloud computing is of course also a popular solution. On AWS it’s possible to rent a c5.9xlarge “compute optimized” instance (18 CPU cores/72GB of RAM) on-demand at $1.5/hour (we have found m5 instances to have almost equal performance). This instance (with only 16 CPU cores being used) can prove 35,000 trades/hour which brings the prover cost to $0.000042/trade or $42 per 1,000,000 trades. As explained in our protocol design, the onchain cost per trade can be as low as 375 gas with onchain data-availability. At current ETH prices ($220) and gas prices (1 Gwei, generally we don’t need very fast transactions) the onchain cost is just $0.000082/trade or $82 per 1,000,000 trades. This brings the total cost to $0.000124/trade or $124 per 1,000,000 trades.
The prover scales nicely in the number of cores up to 16 CPU cores. Using more than 16 CPU cores does not work well yet and is certainly something we’d like to improve.
The prover now also scales linearly (even a bit better than linearly, though this is likely because we didn’t tweak the prover for smaller circuits) in the number of constraints. This is quite curious knowing that the theoretical cost for an FFT is O(n*log(n)), but in practice, it behaves linearly using the new FFT implementation, at least up to the 64M constraints we tested.
The multi-exponentiations are 75% of the total work. The pie chart below is for a circuit with 64M constraints, but smaller circuits have almost the exact same work distribution.
We did not do any measurements comparing our current prover to the old libsnark prover or bellman. The normal libsnark prover has a lot of issues that simply makes a direct comparison uninteresting. But roughly speaking these optimizations on a 16-core CPU are 15x faster than our current libsnark prover (roughly 8x faster in pure prover performance) and 4x faster than bellman using numbers made available by other projects (very much just an estimate, especially because we’re only using 16 cores, the numbers shared by other projects mainly use the beefiest machines available on AWS which are much more expensive!). All while using 5x less memory.
As a fun little exercise let’s see what it would take to bring VISA scale throughput (2000 TPS) to Ethereum today on Loopring (everyone seems to like to bring this up). A simple transfer costs around 200 gas onchain. Using the numbers given above we’d need just 100 16-core CPUs and around 50% of Ethereum’s total capacity. The cost while using AWS would be $0.00006/transfer or $60 per 1,000,000 transfers. However, the cost to buy all the hardware needed for the prover would cost just around $100,000. This is actually very reasonable so there’s very little need to depend on AWS. If this single time upfront cost is paid (and disregarding costs to keep the hardware running like electricity) the cost would be just $0.00004/transfer or $40 per 1,000,000 transfers.
Basically the prover needs to do two kinds of operations: Fast Fourier Transforms (FFTs) and multi-exponentiations. A large part of optimizing the prover thus focuses on optimizing these two operations, but things like memory use (which is very high for large circuits) are also important. And then there’s, of course, the typical things like memory allocations and just not doing work that is not needed.
The FFT is perhaps one of the most studied algorithms ever, so we can use the enormous amount of information that’s available to try and make a well-performing implementation. And that’s what we did. We have implemented a new FFT implementation that more closely follows what well known fast FFT libraries do. It’s not perfect and there can no doubt still be a lot improved, but it already drastically outperforms the libsnark implementation (and almost certainly the Bellman implementation as it’s extremely similar).
The FFT is done over field elements that have a size of 32 bytes. That’s quite a bit more data than normal FFTs over floats or integers. And the operations themselves are also more expensive. So we weren’t quite sure if the common wisdom of fast FFTs could be applied to the FFTs the prover needs to do. Luckily for us, it could indeed be applied.
The implementation is based on the recursive nature of the FFT algorithm and is also implemented with recursion in a depth-first like fashion and is done out-of-place. The reason for this is cache friendliness, as that is of the utmost importance for well-performing CPU code. The recursive depth-first approach is “cache optimal” in that it makes optimal use of CPU caches of any size.
The twiddle factors used in the FFT are precomputed once and reused for all FFTs. A multiplication over two field elements is quite expensive so it makes sense to reuse these twiddle factors when possible. This puts extra pressure on the memory/cache use, but to make sure the precomputed twiddles can be predicted nicely by the CPU they are prepared for each stage in the FFT so they can be accessed consecutively in memory (so the data that is accessed is nicely predicted by the CPU).
For the stages, we currently support radix-2 and radix-4. Normally bigger radixes are better, but in this particular case, radix-4 needs an extra field multiplication compared to radix-2. So while radix-4 performs a bit better in most cases, it doesn’t perform that much better and it’s a trade-off between the number of recursions and the number of field multiplications. It would be interesting to see how larger radixes perform, but radix-8 is already quite complicated to work out.
Multiple cores are supported by first splitting the tree into multiple sub-trees corresponding to the number of physical cores which are handled depth-first for each core. The work at the top of the tree is split up horizontally over the different cores for each stage.
Libsnark and bellman both use a version of Pippenger’s algorithm for the multi-exponentiations, though implemented slightly differently.
The basic building block for this algorithm is the curve addition, being able to do fast curve additions is very important. And this is the first and maybe most straightforward optimization we did here. We switched to the mcl library for these operations which greatly improved the performance (thanks to SECBIT for bringing this to our attention).
The next thing was to make the implementation more like the bellman implementation which was simply better. Basically this meant running over the data differently. But this also had an interesting result. The exponents (field elements of 254 bits) are split into c pieces. Multi-threading is achieved by processing each piece in a different thread. So c not only impacts the efficiency of the algorithm directly, but it also affects how well the algorithm scales over multiple cores. In our tests we have found c = 16 to be the best performing value which limits us to just 16 cores that have work (c * 16 = 256 > 254). 16 is such a good value that lowering the value to make use of more cores currently hardly helps. A future improvement could perhaps be to parallelize inside each piece instead of modifying the c parameter if more cores are available, though it’s hard to predict what would work best.
Another nice thing the bellman implementation does is to prefetch certain data that the CPU cannot predict because the access pattern depends on the data itself. This prefetching is very important. It’s also important to prefetch the data correctly for the CPU that’s being used. The amount of data that is prefetched varies by CPU and the optimal value can only be found by testing on the actual hardware. So this value needs to be configurable to achieve the best performance on different systems. Some other small improvements to the prefetching are to allow configuring the cache logicality of the prefetch (a value of 1 works well which means low cache locality) and to prefetch the data more in advance (2 loop iterations seems to work best).
The two main memory consumers are the proving key and the circuits. Surprisingly (though in retrospect not surprising at all) it was actually the circuits that used the most memory. At less than 512MB per million constraints, the proving key actually doesn’t use that much memory.
Keeping a flat list of all constraints uses a lot of memory. Especially because we use Poseidon which is very efficient in the number of constraints (which is great!) but it has long linear combinations that use a lot of memory.
One of the easiest changes that greatly save on memory is to not store each coefficient independently. The prover now uses a pool for the coefficients so they can be reused and the constraints just store an index to the actual coefficient. Storing an index (4 bytes) compared to the coefficient itself (32 bytes) greatly lowers the memory used by the constraints. For example, our trading circuits with 64M constraints stored over one billion (yes, billion!) coefficients in memory while there are only 20569 unique coefficient values!
But even with this change, the flat list of constraints contains a lot of duplicated data. Most constraints are from hashing data, in our case Poseidon and SHA256. While the data (i.e. the variables) these hashes work on is different, the hashing still produces the same constraints many times, just on different variables. To exploit this we partly moved away from a simple flat list of constraints. There’s still a simple list of constraints, but each constraint can now implement custom logic so it can be stored more efficiently in memory. And that’s what we currently do for the hash functions. We currently just store all constraints a single time (in a different protoboard) and point to those constraints for the main circuit. On the fly, the variables are translated to the variables of the instance in the main circuit that is currently evaluated.
With 64M constrains, storing just a pointer (8 bytes) for each constraint already uses 512MB of memory. So stepping completely away from the flat list of constraint pointers to something with even more structure could be worth it.
There were also some other things that used up quite a bit of memory without any reason. For example, the proving key contained a copy of the circuit that could easily be stripped out as we rebuild the circuit anyway. The libsnark prover also duplicated data a lot even if the data differed only slightly to what was expected between operations.
Note that the proving key is still stored completely in memory and all these changes have a negligible impact on performance.
- We now have a server mode for generating proof. This server mode prepares all the necessary data to generate a proof a single time so we don’t have to do this preparation for each proof. This means we don’t have to depend on super-fast performance for e.g. building the circuit and loading the proving key. Though the performance of both have been significantly improved indirectly by other optimizations.
- Removed some calculations that are only needed for the zero-knowledge part of groth16. As we only use snarks for scalability we do not need to do those (thanks to Kobi Gurkan for bringing this to our attention).
- Removed almost all memory allocations in the main prover function. Memory allocations are pretty slow, and while we could look into faster memory allocators, you can’t beat the performance of not doing anything at all.
- More parallelization in libsnark. Some quite expensive operations in libsnark could not make use of multiple cores even though enabling this was trivial.
- We now support a config file that allows tweaking the prover for different hardware. While the default values work well, it’s impossible to find the best parameters for a certain system without actually trying them out on the actual hardware. If you’re going to have systems continuously generating proofs it makes sense to spend a bit of time to find the best possible parameters. And to find these optimal parameters we support a benchmark mode. Different parameter combinations can easily be tested just by creating a benchmark file and the prover will try out all given parameter combinations. The result is a list of proving times and corresponding parameters ordered best to worst. This information can be used to create the best possible config file for the hardware being used.
We’re happy with the current results, but we’re certain the performance can be further improved. The offchain costs are greatly reduced with these prover optimizations to just 1/3 of the total optimal protocol cost. Faster proving times also means the onchain state is finalized faster, which reduces costs for things like composability with layer 1. And it’s good to see the battle-tested groth16 keep up in prover performance compared to newer SNARKs.
The prover is still 100% CPU based. GPUs would also be an interesting way to speed up the prover that we have some code for. And other projects even have fully working implementations for this. GPUs are however a bit harder to work with. Cost-wise, using cloud computing, it also makes less sense as these systems are more expensive to rent. CPUs with many cores are finally getting a lot cheaper which makes them quite attractive for moderately parallelizable work.
We hope these optimizations can benefit other projects using ZKPs. While the optimizations were done in libsnark for groth16, most of the optimizations should be transferable to other implementations like bellman and even completely different SNARKs. If you have optimization ideas or have questions about certain optimizations don’t hesitate to contact us! All optimizations are available on our GitHub:
We are testing a new prover based on these optimizations for the newly launched Loopring Exchange (https://loopring.io). If things go well, we will deploy the new prover for production in two months.