A Deep Dive into How to Index Blockchain Data

July 9, 2025
Blockchain
A Deep Dive into How to Index Blockchain Data

Getting Started with Blockchain Indexing

When working with a blockchain, you typically have a node — or a set of nodes — that communicate over a peer-to-peer network. If you want to analyze what’s happening on-chain, you need a way to extract and process blockchain data from these nodes.

There are many kinds of services that depend on blockchain data indexing:

  • Analytics systems that scan everything happening on-chain and run machine learning or statistical pipelines on top of that data.
  • Targeted indexers that track specific smart contracts, collect event data, and aggregate it for reporting or visualization.
  • Reactive services that monitor for particular on-chain events and trigger some action when those events occur.

In all of these cases, the task is the same: retrieve data from the blockchain and make it accessible for further processing.

In this article, we’ll walk through how blockchain indexers evolved — starting with the simplest approach using Ethereum as an example.

The Most Basic Blockchain Data Indexing Solution

At the core of any blockchain indexing tool is the node. The node exposes an RPC interface, which allows external services to extract data from the blockchain.

On Ethereum, the most basic method is eth_getLogs. Logs are a special type of data structure designed by smart contract developers to emit information for off-chain consumers — precisely for blockchain data indexers.

‍

Basic Blockchain Data Indexing Solution with eth_getLogs

eth_getLogs works well for simple use cases:

  • It allows filtering by event signature, contract address, and block range.
  • It is reliable for querying event logs over time.

However, eth_getLogs often isn’t enough for more complex scenarios.

When eth_getLogs Falls Short

There are cases where logs alone don’t provide the full context:

  • Event logs don’t include all metadata — for example, they won’t contain a transaction’s timestamp or detailed execution data. You’d need additional RPC calls (like eth_getTransactionByHash) to retrieve that, which slows down the data pipeline.
  • The indexing flow becomes inefficient: first you fetch logs, then make separate calls for each transaction’s metadata — this leads to performance bottlenecks.

To address this, Ethereum provides another method: eth_getBlockReceipts.

‍

Blockchain Data indexing via eth_getBlockReceipts  flow chart

eth_getBlockReceipts returns detailed transaction receipts for an entire block. This includes:

  • Calldata — what the user intended to do.
  • Logs — what actually happened as a result.

This gives us a complete picture of everything that occurred within a block — a better model for indexing blockchain data.

However, eth_getBlockReceipts has its own trade-offs:

  • It leads to over-fetching — pulling in more data than necessary, since it doesn’t support filters (unlike eth_getLogs). You can’t ask for receipts only involving certain addresses or event types.

While this is an improvement over fetching individual logs and metadata separately, even retrieving all data from a block often isn’t enough for more demanding blockchain indexing solutions — especially in real-time or high-volume environments.

When Block Receipts Aren’t Enough

Retrieving full block receipts is often better than fetching individual logs — it gives us all transaction data in one call. But even this approach can fall short in advanced blockchain data indexing scenarios.

Take Uniswap V3 position tracking as an example:

  • When a swap event happens, we need to recalculate fees for all positions within the active price range at that moment.
  • The swap event itself doesn’t include all necessary data — ideally, the smart contract would emit storage changes (like updates to FeeGrowth variables), but including this in events isn’t always gas-efficient.
  • To compute fees, we need the updated FeeGrowth values for both tokens in the pool. These aren’t in the event or block receipts. We’d need to query the node separately for each swap to get current storage values.

On Ethereum, where blocks are produced roughly every 12–15 seconds, this might be tolerable — you can query the node between blocks. But on faster chains, where a single block might contain hundreds of swaps, this approach doesn’t scale.

Blockchain Indexing with Debug Trace API

Ethereum nodes support debug API extensions — but before relying on them, you must check if your RPC provider enables this extension on the network you’re indexing.

‍

How to index Blockchain Data with Debug Trace API

One key method is debug_traceBlock, which provides detailed execution traces for every transaction in a block. This trace includes:

  • Calldata (what the user intended)
  • Logs (what the contract emitted)
  • Storage change — crucial for cases like Uniswap V3, where we need to see how FeeGrowth changed during a swap

With debug_traceBlock, we can:

  • Extract storage changes for specific variables like FeeGrowth directly from the trace, avoiding extra queries.
  • Access the full transaction execution tree — see how a transaction interacted with multiple contracts (e.g., router → pool), all in one structure.

Unlike standard eth_getTransactionReceipt, traces let us inspect the state at each step in the call tree — not just the final state at the end of a block. This solves a major gap in blockchain indexing solutions where precise intermediate state data is required.

Wrapping Up Node-Polling Model in Blockchain Indexing

At this point, we’ve covered the standard approaches to blockchain indexing — working directly with the node’s native interfaces (RPC, debug APIs) to extract data and build indexers. This is the foundation of most early blockchain indexing tools.

What’s Missing Node-Polling Model

While the standard approach of working directly with the node gives us access to blockchain data, it comes with key limitations for building a reliable blockchain indexing solution:

1.  No true push model

What we really want is a push model — where we don’t have to constantly poll the node asking “is there anything new?” but instead can subscribe to a stream of exactly the data we need, starting from a specific block or moment. Nodes don’t provide this — and that’s by design (we’ll get to why later).

Nodes do offer WebSocket interfaces, which at first glance seem similar to streaming. But they don’t solve the problem:

  • You can only subscribe to events from the moment you connect.
  • You can’t ask the node to stream data starting from a particular block height.

For most blockchain data indexing use cases — for example payment tracking — this isn’t workable. If you reconnect, you lose events. And missing data in these systems is critical.

So while streaming exists in theory, it’s limited. To work around it, devs are forced to implement their own logic on a client — typically by sending regular requests to the node and collecting responses to avoid gaps.

2.  Handling chain reorganizations

The second problem you have to solve at the client level is chain reorgs — when a blockchain fork occurs. In a polling model:

  • There’s no way for the node to notify you that a fork happened.
  • You query the node for a specific block, and it just gives you the data for that block — whether it’s still part of the canonical chain or not.

This leaves two options:
1.  You work only with finalized blocks — but that introduces a lag, which isn’t ideal for many blockchain indexer use cases.
2.  You build complex logic on a client to stay close to the head of the chain and clean up when a reorg happens. Essentially, you’d be re-implementing part of what the blockchain node itself does. In practice, this is too much overhead. That’s why most teams working on blockchain indexing tools simply stick to finalized data.

Solving Polling Limitations with Firehose by The Graph

The two main pain points with traditional polling-based blockchain indexers led to a new approach that actually solves them: Firehose from The Graph. Let’s break down how this service works.

The First Piece -  A Modified Node

Running a regular blockchain node like the ones we talked about earlier doesn’t make much sense—it just moves the slow polling model onto our side.

So instead, we fork the node and add a streaming patch that our service can read from. Here’s how it works:

  • When a new block lands on the node, it’s immediately pushed into a pipe.
  • Our indexing service reads from this pipe in real time.

For Ethereum, this requires a custom fork since there’s no official way to patch nodes for streaming. On Solana, it’s simpler—there’s a Geyser plugin that lets you hook into the node’s events. When a new block shows up, it gets pushed into a pipe that our service then reads.

Tackling Historical Streaming

Standard nodes aren’t built to stream historical blockchain data from any point in the past, and here’s why:

  • Nodes rely on efficient storage, usually on disk, optimized for quick lookups.
  • Streaming historical data means constant heavy reads from storage, which can overload the system.
  • Streaming live data in memory is one thing, but hitting storage nonstop for older blocks creates unpredictable load.

Because of this, nodes don’t support historical streaming out of the box. Firehose changes that by providing a service that can stream blockchain data from any block height, letting indexers replay the chain as needed.

The Second Piece -  Cloud Storage (S3)

Firehose stores data as flat files—similar to what the node itself uses—which is the smallest efficient unit.

It uses S3-compatible cloud storage, which brings some big benefits:

  • Cloud-native and serverless, so developers don’t have to worry about managing infrastructure or scaling.
  • You pay for exactly what you use—no more, no less.
  • No vendor lock-in, since almost every cloud provider offers S3-compatible storage with similar APIs. Switching providers is straightforward if you find a better deal.

The Final Piece - A Better API

Regular nodes communicate over JSON-RPC via HTTP, streaming plain text exactly as received, which isn’t very efficient for modern indexing tools

Firehose uses gRPC, a binary protocol that:

  • Packs data efficiently before streaming.
  • Works across languages—define your schema once, then generate client code in whatever language you want.
  • This means no more writing and maintaining separate client libraries for every language, making integration much easier.

Firehose Indexing Service Workflow Explained

Firehose Indexing Service Workflow Explained. Flow Chart

Here’s the basic flow of the Firehose service:

  • We run a blockchain node and modify it to enable real-time streaming.
  • The streamed data is pushed into cloud storage buckets (e.g., S3).
  • We build a streaming interface that users connect to for blockchain data indexing.

A key part of this interface is the Joined Block Source — a mechanism that automatically switches between data sources depending on what the user needs.

For example, if a user wants to stream blocks starting from an hour ago (historical data), the service initially fetches data from the historical storage (the buckets). Once the user catches up to the latest block (the current block head), the stream switches automatically to real-time data delivered directly from our modified node.

User Benefits of Firehose Streaming

  • Cursor-based streaming: Users can specify the exact block from which to start indexing blockchain data, supporting precise blockchain indexing workflows.
  • Chain-agnostic design: Firehose works across different blockchain networks. The only difference is in the node modification for real-time streaming; the historical storage and API layers remain standardized.
  • Reorg notifications: Firehose notifies users immediately when a chain reorganization (reorg) happens, enabling accurate, reliable blockchain data indexing without data inconsistencies.
  • Unified data sources: Users don’t need to manage or think about switching between historical and real-time data streams—the service handles it seamlessly.
  • Built-in reorg handling: Complex reorg logic, which previously had to be implemented client-side when indexing blockchain data with classic nodes, is fully handled inside Firehose. The client only needs to react to reorg events sent by the service.

This architecture removes major pain points in blockchain indexing and delivers a scalable, reliable solution that simplifies how developers and applications consume blockchain data.

How Firehose Keeps Your Blockchain Indexing Always Up and Running

When we talk about 100% availability, the architecture has to be built to avoid any single point of failure. Here’s how FireHose approaches this:

What we do to ensure high availability:

  • Run at least two nodes streaming blocks in parallel.
  • Keep one node as the primary source, and the second as a backup.
  • Add an RPC provider working in polling mode as an additional fallback. It’s slower but ensures data flow if both nodes go down.

‍

How Firehose Keeps Your Blockchain Indexing Always Up and Running. Blockchain Data Streaming via Firehose flow chart

To handle these data streams efficiently, we split the reader component into at least two instances. These readers independently fetch blocks from different sources and write them into a centralized bucket storage.

Each reader exposes a gRPC interface to stream binary block data.

The FireHose component does the following for end users:

  • Subscribes to multiple live sources (the readers) to get the freshest data as fast as possible.
  • Merges the incoming data streams, performing deduplication.
  • Whichever reader delivers a block first, that block is sent to the user.
  • If the primary node fails, the backup node continues streaming, albeit slower, so users keep receiving data without interruption.

Handling Data Duplication in Storage

Since all readers write blocks to the same bucket, deduplication at the storage level becomes essential.

To solve this, we introduce a dedicated merger service that:

  1. Pulls all blocks from the primary bucket (One blocks bucket).
  2. Optimizes storage of finalized blocks by deduplicating and bundling them into groups of 100.
  3. Writes these optimized bundles into a separate storage — the Merged blocks bucket.
  4. Stores all forked blocks separately in the Forked blocks bucket.

Now, FireHose works with three buckets:

  • One blocks bucket (raw blocks from readers)
  • Merged blocks bucket (deduplicated, optimized bundles)
  • Forked blocks bucket (fork data)

This means when a user requests a large historical range, the service delivers blocks in bundles of 100 instead of individual blocks, making data retrieval faster and more efficient.

Remaining Challenges

FireHose solves key problems related to fetching data directly from nodes and greatly improves service reliability.

However, overfetching remains an issue. FireHose currently streams all data without filtering, which isn’t optimal since different applications require different data subsets.

Standard filter presets can’t cover every use case because each app’s needs are unique and often complex.

The simplest and most flexible solution is to allow developers to write custom filters themselves, streaming only the filtered data their applications actually need — making FireHose more efficient and adaptable. This is where Substreams step in.

Custom Data Filtering with Substreams

Substreams is an engine that lets developers upload their own code — essentially a function that takes some input, processes it, and returns a result — compiled to WebAssembly.

In practice, the developer writes a function that takes input (for example, a block) and outputs something specific — like Raydium events. How the developer extracts those Raydium events from the block is entirely up to their logic. It’s not that complicated.

The developer writes the code, compiles it, uploads it to the server — and from there, the engine runs that function on every block. This means the stream delivers exactly the custom data the application needs, defined by the developer’s logic.

How Blockchain Data Streaming Service Architecture Evolves with Substreams

How Blockchain Data Streaming Service Architecture Evolves with Substreams: Architecture Infographics

‍

When Substreams enters the picture, the architecture shifts to the following:

  • Substreams becomes its own service alongside Firehose.
    It runs developer-supplied WebAssembly (Wasm) modules, processes incoming block data, and streams back only the filtered, application-specific data.
  • Developers define exactly what they need.
    They specify the contracts, events, or on-chain data relevant to their app — no more unnecessary data flooding the client.

To support this, we introduce a Relayer component:

  • In the original Firehose setup, Firehose was the sole consumer of reader streams and handled deduplication itself. Now that both Firehose and Substreams consume block data, deduplication logic is moved into the Relayer.
  • The Relayer ensures that whichever node delivers the block first is the one whose data gets streamed to clients.

How Substreams Blockchain Data Streaming Service Scales

The Substreams service is built around two core components: the Front Tier and the Worker Pool.

‍

How Substreams Blockchain Data Streaming Service Works - Flow chart

When a user requests to process a block range — for example, from block 10,000 to 14,999 (5,000 blocks) — the request is sent to the Front Tier.

The Front Tier manages a group of workers (Substreams Tier 2). Each worker can handle up to 16 concurrent tasks. The Front Tier splits the requested range into smaller segments of about 1,000 blocks each and distributes these segments across the workers.

Each worker processes its assigned block segment and writes the resulting data into a dedicated Substreams store bucket. This bucket serves as a cache layer that stores processed data for quick access and efficient retrieval — we’ll cover its importance in more detail when we discuss data bundling.

Instead of streaming data directly back to Front Tier, the workers stream progress updates. These updates indicate when a segment finishes processing or if an error occurs (e.g., a function revert), since user-defined logic might occasionally fail.

The Front Tier ensures strict ordering by waiting for the first segment to finish before streaming its data to the user. It then moves sequentially through each segment, waiting for each to complete before sending its data. This guarantees a reliable, ordered data stream from the start to the end of the requested block range.

How Modules Work in Substreams

Let’s break down the structure of the functions you can load into Substreams and how they help with scaling.

‍

How Modules Work in Substreams

Module Outputs Caching

The simplest optimization that comes to mind is caching. When you write your own module, you can configure it to accept the output of some cached module instead of raw blocks. By referencing the data from this cached module in your request

For example, there’s an existing module—built by others before us—that takes blocks from the merged blocks bucket as input. Its job is to extract all Uniswap V3 events within each block. It doesn’t modify the data, just filters it down, so the output is smaller than the original block data. Essentially, it contains only the Uniswap V3 events, not the entire block content.

‍

Substreams Blockchain Data Streaming Service Module Outputs Caching flow chart

Our service then stores this filtered data in the Substreams Store Bucket. When writing your own module, you can specify that it should take another module’s output (the Uniswap V3 events) as input instead of raw blocks. By including this module in your query, the server recognizes it can pull pre-filtered data directly from the cache, saving compute resources.

Since billing is based on the amount of data retrieved, accessing already filtered data from the cache not only streamlines the workflow for the developer but also reduces costs.

Index Modules

Index modules are different from regular ones because they produce a standard kind of output. For every block, they give you a list of keys — kind of like markers — that help quickly check if the block holds the data you need.

What this means is the index module takes raw blocks, scans them, and builds an index that shows things like which contracts were touched or what log topics showed up in that block.

How Filters Use Indexes to Cut Down Data

Say you have a module called Filtered Transactions. It uses the index output to narrow down blocks. In your module’s manifest, you say: “I want to use this index,” and you add a filter, for example, “Show me Raydium transactions.”

‍

How Filters Use Indexes to Cut Down Data

The server then pulls the cached indexes, figures out which blocks actually contain Raydium transactions, and only sends those blocks to your Filtered Transactions module. So you’re not wasting time checking every block.

Reusing Cached Data to Save Time and Power

What’s worth noting is that if someone already filtered Raydium transactions before, that data is probably cached. So instead of running through the index again, you can just grab that filtered result and start right away.

Loading Blockchain Indexed Data into a Database

At this stage, the goal is to move all data processed by Substreams into a database. This is typically done using a service like SQL sink, an open-source tool developed by The Graph.

Connecting Substreams to a Database via SQL Sink

SQL Sink is an open-source tool developed by The Graph that connects to the Substreams server and consumes data streams.

A key distinction of SQL Sink is that data modules must emit data in a specific format. This format defines how the data should be structured and mapped to database operations.

Specifically, it outlines database commands like insert, upsert, update, and delete along with the relevant primary keys and associated data.

This design enables high processing speed while delegating all data transformation logic to the Substreams modules. The output is a clear set of instructions for database operations, which SQL Sink executes. The user’s role is to implement modules that produce data conforming to this format.

‍

how to Load Blockchain Indexed Data into a Database

What’s happening next:

SQL Sink processes the insert, upsert, update, and delete commands by distributing the data across database tables as defined by the modules.

To handle chain reorganizations, every database operation is logged in a dedicated History table. When a chain reorg event is detected, along with information on the latest valid block, the system can roll back all operations associated with invalid blocks by referring to the History table — ensuring the database remains consistent and accurate.

Chain Reorg Handling

When a chain reorg happens, the system looks up all operations in the History table that fall within the range of invalid blocks and undoes those changes. This ensures the database stays consistent after a reorg.

‍

how to Handle Chain Reorg in blockchain indexer

‍

The service is built to be flexible. While it currently supports basic operations like insert, upsert, update, and delete, it can be forked and extended to add new operations—like increments—by creating custom modules and handlers that turn these into SQL commands.

Users aren’t tied to just the default SQL sink either. The core service provides the data streams and parallel processing, so users can build their own sinks however they want—whether that’s tweaking the SQL sink or writing something completely new from scratch.

Comparison with Subgraphs

Subgraphs operate as a self-contained package: users provide compiled WebAssembly code that defines all the logic for handling various scenarios—how to respond to events, transactions, and so forth.

‍

Subgraphs operate as a self-contained package: users provide compiled WebAssembly code that defines all the logic for handling various scenarios—how to respond to events, transactions, and so forth.

Unlike the substreams approach, subgraphs don’t maintain their own block storage. Every time a block range needs processing, the subgraph queries a node directly. This means there’s no need to host the full data infrastructure described earlier, which is an advantage in terms of simplicity and setup. It also means the entire system can be deployed independently.

However, subgraphs lack data parallelization. They must sync all blocks from scratch, which can be a bottleneck. Additionally, subgraphs are well-suited for networks like Ethereum but are not practical for high-throughput blockchains such as Solana.

Why Indexing Still Holds Back Blockchain Growth

As new L1s and high-performance chains continue to launch, one core challenge remains overlooked: indexing infrastructure. Many emerging networks still lack native tools to support reliable, scalable access to on-chain data. For developers building on top of these chains, this creates a constant struggle—spending time and resources writing custom indexers, working around slow RPCs, or patching together brittle solutions that don’t keep up with modern network throughput.

This creates real ecosystem bottlenecks:

  • Accessing blockchain data at scale is still complex
  • Developers burn time and grant budgets on infrastructure instead of building apps
  • Protocol teams duplicate efforts, solving the same problem over and over
  • Without solid indexing, new networks become harder to build on—slowing adoption

Substreams was designed to address exactly this. It’s a high-throughput data indexing framework that lets blockchains offer production-grade data infrastructure natively. Key benefits include:

  • Real-time and historical data streaming
  • Cursor-based access and parallel processing
  • A modular architecture where devs write their own filters
  • Caching and deduplication to reduce cost and speed up access

By integrating Substreams, chains can offer developer-friendly access to structured, streamable blockchain data—without sacrificing performance or scalability.

About Rock’n’Block

Rock’n’Block is a Web3-native dev shop specializing in blockchain infrastructure and data indexing solutions. We work with projects and protocols to build reliable, scalable systems for real-time and historical blockchain data processing.

With deep expertise in Firehose, Substreams, and custom indexing pipelines, we help developers access and analyze blockchain data efficiently across multiple chains — from EVM networks to Solana and TON. Our focus is on delivering production-ready solutions that handle high throughput, support complex data queries, and adapt to evolving blockchain ecosystems.

Read our case study: How We Built a Blockchain Data Streaming Service for Blum

We’ve supported over 300 projects, powering products used by millions of users worldwide. Our goal is to enable founders and teams to focus on building great applications by solving the complex backend challenges around blockchain data indexing.

‍

Have an Idea?
Let's chat!

By submitting this form, you acknowledge that you agree to our Privacy Policy and Terms of Service.

Get free consultation

message on whatsupmessage on telegramcall button
This site is protected by reCAPTCHA and the Privacy Policyand Terms of Service apply.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
closeChat with us

Let's Connect and Innovate Together!

Reach out to our team

WhatsUp
Contact in Telegram
Book a call
Office image
275 Featherstall Rd N, Oldham OL1 2NJ, UK
Seoul - South Korea: 5th floor, 40, Godeok-ro, Gangdong-gu
TEC Business Center FZE Level 3, The Offices 3, One Central, World Trade Center Dubai - UAE
twitterTelegramFacebookLinkedin
To:
Rock'n'Block logo
Rock n Block
This site is protected by reCAPTCHA and the Privacy Policyand Terms of Service apply.
done
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Our Latest Blog Updates

Go to the blog
closeChat with us