Subsquid’s core developer who researched and implemented the new generation of Archive services wrote this article, describing the process and technical details. Its content heavily contributed to the writing of this post.

This series of articles details Subsquid's approach in the creation and evolution of the Archive service, responsible for ingesting on-chain data, and serving it to ETL processes upon request.

The first article of this series covered the problems observed with similar products and the technical design that we have pursued in order to solve the issue. The service described at the end of the post can scale via replication and already allows for ETLs that are orders of magnitude faster than previously possible.

This article will explain the continued research efforts, and how, when faced with the challenge of a much bigger, and data-rich blockchain, such as Ethereum, Subsquid responded with a new generation of Archives, based on a fresh and completely rewritten technological approach.

A better storage solution

The previous installment in this series already discussed the necessity of having an indexing middleware, and the initial issues, as well as the solutions proposed by Subsquid's team.

The performance improvements are already evident and appreciated by a plethora of developer teams in the Polkadot ecosystem, but the core team knew that FireSquid was not meant to be the end-all-be-all for Subsquid.

Blockchain technology had been created as a disruptive answer to the economic crisis of 2008 and to provide means for trustless transactions. Web3 was born as a consequence of this, to hopefully provide internet users with means to re-appropriate the data they are generating, and stop being the product.

But none of these propositions said that we should throw away decades of consolidated, proven, performant and efficient technologies. The boom in internet adoption has generated unprecedented amounts of data, and unprecedented needs for storing, accessing, and processing such data. As a result, the IT industry has evolved, studied and developed new technologies to serve these data needs.

For these reasons, at Subsquid, we think we should learn from these lessons, and leverage what has already been done to answer similar issues we are facing, instead of turning the other way, in the name of a crusade to disrupt the industry.

It's time to stand on the shoulders of the giants that came before us.

As much as some of these giants' brilliant ideas have been used for nefarious purposes.

Using a distributed database solution, such as CockroachDB was already a major step up from the previous generation, but it only contributed to highlighting the shortcomings of common RDBMS, and that the database remained one of the main bottlenecks of the entire operation. They tend to need a lot of storage space, and writing and reading are still very CPU-and-memory-intense tasks.

Furthermore, the necessity of a relational database is diminished by the relative simplicity of relations typically encountered in a blockchain. For these reasons, research for alternative storage solutions has been going on for quite a while.

Indexing Ethereum with Parquet files

These efforts became a crucial investment when the strategic decision to start indexing the Ethereum blockchain was made.

At the time of writing, the Ethereum blockchain is made of ~16 million blocks, preserving the history of ~1.8178 billion transactions. In comparison, Polkadot has ~13 million blocks, with 11.4 million signed Extrinsics, and parachains such as Astar and Moonbeam have 2.5 million blocks each.

Columnar databases have been used in Data Warehouse solutions to successfully tackle datasets with similar characteristics to blockchain data, so this is where Subsquid's focus has shifted. We were looking for something that provided these characteristics:

Efficient data compression, to save storage space
Allowed for indexing, and partial data-loading, to save memory and CPU
Easy integration with existing querying tools and solutions
Easily distributable, but most importantly: easy to decentralize

and the final choice fell on Parquet. To quote its website:

Parquet is an open-source column-oriented data file format designed for efficient data storage and retrieval

It ticked all the boxes above, especially:

The columnar structure allows for achieving high compression rates
Row-group-based indexing, as well as column subsetting, allow for partial data loading
Plenty of existing tools to query it (DataFusion, Databricks, Polars, DuckDB, …)
It's a file, thanks to IPFS we can obtain decentralization out-of-the-box

Architecture

It has been developed in Rust, and is made of two main components:

Ingester
Worker

The ingester is responsible for downloading data from the Ethereum RPC endpoint and storing it in parquet files. We designed it to optionally sync these files to decentralized storage (e.g. Filebase, because it's s3-compatible and easy to use) for better distribution to workers.

The worker is responsible for building auxiliary indexes on data (e.g., bloom filters on log files using the log address field), and serving the data over a REST API. It also fetches the data it doesn’t have in parquet files from the Ethereum RPC API and stores it locally.

One of the upsides of parquet files is that they can be optionally synced from the decentralized storage, providing a mechanism for bootstrapping new workers, without having to start ingesting from scratch.

Ingester

The ingester downloads data from Ethereum RPC into memory and writes it into parquet files when it reaches the target number of logs, transactions or blocks. It groups the parquet files into folders which are named after the block range they contain. Each folder contains log.parquet, tx.parquet and block.parquet.

This structure allows for very easy processing of the data. For example, if we need to process block data that corresponds to a log.parquet we just need the block.parquet that is located in the same folder.

We can easily construct indexes per folder and also treat a folder like an atomic unit of data which helps when writing or transferring data.

Worker

The worker stores bloom filter indexes for each parquet folder using RocksDB which is an embedded key-value store implemented by Facebook.

This way, when executing a query it can iterate over the indexes and prune the query using the bloom filter and only open the actual parquet files if it needs to. This provides a huge performance boost if the block range per parquet folder is sufficiently small and the contract/address you are querying for is not super busy.

The worker also stores the hot (new) block data which didn’t make it to parquet files yet. This allows it to serve queries for the tip of the blockchain.

Finally, it implements a REST API which allows for querying logs and transactions. These queries on parquet files are performed using Polars, which is a data frame library implemented in Rust and it can query parquet files. Queries on hot data are performed via custom code that uses RocksDB API directly.

Performance

Syncing data for the entire Ethereum blockchain history takes about 16 hours on our current setup which uses resource-limited Kubernetes pods. On the other hand, bootstrapping a worker from parquet files fetched from decentralized storage takes about 5 hours.

As for the query performance, we have migrated one of TheGraph's examples which indexes the Gravatar contract to an ETL developed with Subsquid SDK.

TheGraph's implementation takes 20 minutes to fully sync. We didn’t test syncing this data directly from Ethereum RPC but I would guess it would take considerably more time.

In comparison, Subsquid's ETL takes under a second to index the entire chain. Yes, the totality of Events on Gravatar's contract on Ethereum, in less than a second.

This result is only anecdotal, of course, so it's hard to draw any conclusions, but it's still indicative.

I strongly encourage everyone to read Ozgur’s article, once again, he’s our core developer who developed this Archive service implementation. The appendixes are especially interesting because they go into much greater detail about the inner works, the choices made, and the performance improvements and customizations put in place.

Conclusions

As this series is made of multiple articles, the story of Subsquid's Archive service is made of multiple steps, and the one described here is just one of them.

The team has a technical roadmap laid down to turn this service into a fully decentralized network, and create a true "blockchain-data-as-a-service".

Parquet files and decentralized storage are great steps in this direction, allowing workers bootstrapping and making sure that a new worker can join the "network" at any time. What's next is making sure that there is a network, and that it can be coordinated. The end goal is to have a blockchain (either that, or a smart contract on a blockchain) that coordinates a swarm of workers, and that clients can submit blockchain transactions to request data.

But in the meantime, we are hard at work consolidating and improving what we have built so far, as well as providing any developer in need of such a service, with a fully functional, state-of-the-art blockchain data service and framework, that has proven itself capable of unparalleled performance so far.

If you are curious to know how this story continues, stay tuned and keep your eyes peeled for the next installment in this series.

Building a blockchain data ingestion service (Part 2)

Table of contents