IO tracing

When should I use IO traces?

IO traces can be used to identify slow receipts and to understand why they are slow. Or to detect general inefficiencies in how we use the DB.

The results give you counts of DB requests and some useful statistics such as trie node cache hits. It will NOT give you time measurements, use Grafana to observe those.

The main use cases in the past were to estimate the performance of new storage features (prefetcher, flat state) and to find out why specific contracts produce slow receipts.

Setup

When compiling neard (or the parameter estimator) with feature=io_trace it instruments the binary code with fine-grained database operations tracking.

Aside: We don't enable it by default because we are afraid the overhead could be too much, since we store some information for very hot paths such as trie node cache hits. Although we haven't properly evaluated if it really is a performance problem.

This allows using the --record-io-trace=/path/to/output.io_trace CLI flag on neard. Run it in combination with the subcommands neard run, neard view-state, or runtime-params-estimator and it will record an IO trace. Make sure to provide the flag to neard itself, however, not to the subcommands. (See examples below)

# Example command for normal node
# (Careful! This will quickly fill your disk if you let it run.)
cargo build --release -p neard --features=io_trace
target/release/neard \
    --record-io-trace=/mnt/disks/some_disk_with_enough_space/my.io_trace \
    run

# Example command for state viewer, applying a range of chunks in shard 0
cargo build --release -p neard --features=io_trace
target/release/neard \
    --record-io-trace=75220100-75220101.s0.io_trace \
    view-state apply-range --start-index 75220100 --end-index 75220101 \
    --shard-id 0 sequential

# Example command for params estimator
cargo run --release -p runtime-params-estimator --features=required,io_trace \
-- --accounts-num 200000 --additional-accounts-num 200000 \
--iters 3 --warmup-iters 1 --metric time \
--record-io-trace=tmp.io \
--costs ActionReceiptCreation

IO trace content

Once you have collected an IO trace, you can inspect its content manually, or use existing tools to extract statistics. Let's start with the manual approach.

Simple example trace: Estimator

An estimator trace typically starts something like this:

commit
  SET DbVersion "'VERSION'" size=2
commit
  SET DbVersion "'KIND'" size=3
apply num_transactions=0 shard_cache_miss=7
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
  GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
  GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
  GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
...

Depending on the source, traces look a bit different at the start. But you should always see some setup at the beginning and per-chunk workload later on.

Indentation is used to display the call hierarchy. The commit keyword shows when a commit starts, and all SET and UPDATE_RC commands that follow with one level deeper indentation belong to that same database transaction commit.

Later, you see a group that starts with an apply header. It groups all IO requests that were performed for a call to fn apply that applies transactions and receipts of a chunk to the previous state root.

In the example, you see a list of GET requests that belong to that apply, each with the DB key used and the size of the value read. Further, you can read in the trace that this specific chunk had 0 transactions and that it cache-missed all 7 of the DB requests it performed to apply this empty chunk.

Example trace: Full mainnet node

Next let's look at an excerpt of an IO trace from a real node on mainnet.

...
GET State "AQAAAAMAAACm9DRx/dU8UFEfbumiRhDjbPjcyhE6CB1rv+8fnu81bw==" size=9
GET State "AQAAAAAAAACLFgzRCUR3inMDpkApdLxFTSxRvprJ51eMvh3WbJWe0A==" size=203
GET State "AQAAAAIAAACXlEo0t345S6PHsvX1BLaGw6NFDXYzeE+tlY2srjKv8w==" size=299
apply_transactions shard_id=3
  process_state_update 
    apply num_transactions=3 shard_cache_hit=207
      process_transaction tx_hash=C4itKVLP5gBoAPsEXyEbi67Gg5dvQVugdMjrWBBLprzB shard_cache_hit=57
        GET FlatState "AGxlb25hcmRvX2RheS12aW5jaGlrLm5lYXI=" size=36
        GET FlatState "Amxlb25hcmRvX2RheS12aW5jaGlrLm5lYXICANLByB1merOzxcGB1HI9/L60QvONzOE6ovF3hjYUbhA8" size=36
      process_transaction tx_hash=GRSXC4QCBJHN4hmJiATAbFGt9g5PiksQDNNRaSk666WX shard_cache_miss=3 prefetch_pending=3 shard_cache_hit=35
        GET FlatState "AnJlbGF5LmF1cm9yYQIA5iq407bcLgisCKxQQi47TByaFNe9FOgQg5y2gpU4lEM=" size=36
      process_transaction tx_hash=6bDPeat12pGqA3KEyyg4tJ35kBtRCuFQ7HtCpWoxr8qx shard_cache_miss=2 prefetch_pending=1 shard_cache_hit=21 prefetch_hit=1
        GET FlatState "AnJlbGF5LmF1cm9yYQIAyKT1vEHVesMEvbp2ICA33x6zxfmBJiLzHey0ZxauO1k=" size=36
      process_receipt receipt_id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC predecessor=1663adeba849fb7c26195678e1c5378278e5caa6325d4672246821d8e61bb160 receiver=token.sweat id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC shard_cache_too_large=1 shard_cache_miss=1 shard_cache_hit=38
        GET FlatState "AXRva2VuLnN3ZWF0" size=36
        GET State "AQAAAAMAAADVYp4vtlIbDoVhji22CZOEaxVWVTJKASq3iMvpNEQVDQ==" size=206835
        input 
        register_len 
        read_register 
        storage_read READ key='STATE' size=70 tn_db_reads=20 tn_mem_reads=0 shard_cache_hit=21
        register_len 
        read_register 
        attached_deposit 
        predecessor_account_id 
        register_len 
        read_register 
        sha256 
        read_register 
        storage_read READ key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=10 tn_mem_reads=19 shard_cache_hit=11
          GET FlatState "CXRva2VuLnN3ZWF0LHQAMWoGDDhG9NfuerJfbzjNMh22UT0mmEi98wVyHwjou6M=" size=36
        register_len 
        read_register 
        sha256 
        read_register 
        storage_write WRITE key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=0 tn_mem_reads=30
        ...

Maybe that's a bit much. Let's break it down into pieces.

It start with a few DB get requests that are outside of applying a chunk. It's quite common that we have these kinds of constant overhead requests that are independent of what's inside a chunk. If we see too many such requests, we should take a close look to see if we are wasting performance.

GET State "AQAAAAMAAACm9DRx/dU8UFEfbumiRhDjbPjcyhE6CB1rv+8fnu81bw==" size=9
GET State "AQAAAAAAAACLFgzRCUR3inMDpkApdLxFTSxRvprJ51eMvh3WbJWe0A==" size=203
GET State "AQAAAAIAAACXlEo0t345S6PHsvX1BLaGw6NFDXYzeE+tlY2srjKv8w==" size=299

Next let's look at apply_transactions but limit the depth of items to 3 levels.

apply_transactions shard_id=3
  process_state_update 
    apply num_transactions=3 shard_cache_hit=207
      process_transaction tx_hash=C4itKVLP5gBoAPsEXyEbi67Gg5dvQVugdMjrWBBLprzB shard_cache_hit=57
      process_transaction tx_hash=GRSXC4QCBJHN4hmJiATAbFGt9g5PiksQDNNRaSk666WX shard_cache_miss=3 prefetch_pending=3 shard_cache_hit=35
      process_transaction tx_hash=6bDPeat12pGqA3KEyyg4tJ35kBtRCuFQ7HtCpWoxr8qx shard_cache_miss=2 prefetch_pending=1 shard_cache_hit=21 prefetch_hit=1
      process_receipt receipt_id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC predecessor=1663adeba849fb7c26195678e1c5378278e5caa6325d4672246821d8e61bb160 receiver=token.sweat id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC shard_cache_too_large=1 shard_cache_miss=1 shard_cache_hit=38

Here you can see that before we even get to apply, we go through apply_transactions and process_state_update. The excerpt does not show it but there are DB requests listed further below that belong to these levels but not to apply.

Inside apply, we see 3 transactions being converted to receipts as part of this chunk, and one already existing action receipt getting processed.

Cache hit statistics for each level are also displayed. For example, the first transaction has 57 read requests and all of them hit in the shard cache. For the second transaction, we miss the cache 3 times but the values were already in the process of being prefetched. This would be account data which we fetch in parallel for all transactions in the chunk.

Finally, there are several process_receipt groups, although the excerpt was cut to display only one. Here we see the receiving account (receiver=token.sweat) and the receipt ID to potentially look it up on an explorer, or dig deeper using state viewer commands.

Again, cache hit statistics are included. Here you can see one value missed the cache because it was too large. Usually that's a contract code. We do not include it in the shard cache because it would take up too much space.

Zooming in a bit further, let's look at the DB request at the start of the receipt.

    GET FlatState "AXRva2VuLnN3ZWF0" size=36
    GET State "AQAAAAMAAADVYp4vtlIbDoVhji22CZOEaxVWVTJKASq3iMvpNEQVDQ==" size=206835
    input 
    register_len 
    read_register 
    storage_read READ key='STATE' size=70 tn_db_reads=20 tn_mem_reads=0 shard_cache_hit=21
    register_len 
    read_register

FlatState "AXRva2VuLnN3ZWF0" reads the ValueRef of the contract code, which is 36 bytes in its serialized format. Then, the ValueRef is dereferenced to read the actual code, which happens to be 206kB in size. This happens in the State column because at the time of writing, we still read the actual values from the trie, not from flat state.

What follows are host function calls performed by the SDK. It uses input to check the function call arguments and copies it from a register into WASM memory.

Then the SDK reads the serialized contract state from the hardcoded key "STATE". Note that we charge 20 tn_db_reads for it, since we missed the accounting cache, but we hit everything in the shard cache. Thus, there are no DB requests. If there were DB requests for this tn_db_reads, you would see them listed.

The returned 70 bytes are again copied into WASM memory. Knowing the SDK code a little bit, we can guess that the data is then deserialized into the struct used by the contract for its root state. That's not visible on the trace though, as this happens completely inside the WASM VM.

Next we start executing the actual contract code, which again calls a bunch of host functions. Apparently the code starts by reading the attached deposit and the predecessor account id, presumably to perform some checks.

The sha256 call here is used to shorten implicit account ids. (Link to code for comparison).

Afterwards, a value with 16 bytes (a u128) is fetched from the trie state. To serve this, it required reading 30 trie nodes, 19 of them were cached in the accounting cache and were not charged the full gas cost. And the remaining 11 missed the accounting cache but they hit the shard cache. Nothing needed to be fetched from DB because the Sweatcoin specific prefetcher has already loaded everything into the shard cache.

Note: We see trie node requests despite flat state being used. This is because the trace was collected with a binary that performed a read on both the trie and flat state to do some correctness checks.

    attached_deposit 
    predecessor_account_id 
    register_len 
    read_register 
    sha256 
    read_register 
    storage_read READ key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=10 tn_mem_reads=19 shard_cache_hit=11
        GET FlatState "CXRva2VuLnN3ZWF0LHQAMWoGDDhG9NfuerJfbzjNMh22UT0mmEi98wVyHwjou6M=" size=36

So that is how to read these traces and dig deep. But maybe you want aggregated statistics instead? Then please continue reading.

Evaluating an IO trace

When you collect an IO trace over an hour of mainnet traffic, it can quickly be above 1GB in uncompressed size. You might be able to sample a few receipts and eyeball them to get a feeling for what's going on. But you can't understand the whole trace without additional tooling.

The parameter estimator command replay can help with that. (See also this readme) Run the following command to see an overview of available commands.

# will print the help page for the IO trace replay command
cargo run --profile dev-release -p runtime-params-estimator -- \
  replay --help

All commands aggregate the information of a trace. Either globally, per chunk, or per receipt. For example, below is the output that gives a list of RocksDB columns that were accessed and how many times, aggregated by chunk.

cargo run --profile dev-release -p runtime-params-estimator -- \
  replay  ./path/to/my.io_trace  chunk-db-stats

apply_transactions shard_id=3 block=DajBgxTgV8NewTJBsR5sTgPhVZqaEv9xGAKVnCiMiDxV
  GET   12 FlatState  4 State  

apply_transactions shard_id=0 block=DajBgxTgV8NewTJBsR5sTgPhVZqaEv9xGAKVnCiMiDxV
  GET   14 FlatState  8 State  

apply_transactions shard_id=2 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
  GET   2 FlatState  

apply_transactions shard_id=3 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
  GET   6 FlatState  2 State  

apply_transactions shard_id=1 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
  GET   50 FlatState  5 State  

...

apply_transactions shard_id=3 block=AUcauGxisMqNmZu5Ln7LLu8Li31H1sYD7wgd7AP6nQZR
  GET   17 FlatState  3 State  

top-level:
  GET   8854 Block  981 BlockHeader  16556 BlockHeight  59155 BlockInfo  2 BlockMerkleTree  330009 BlockMisc  1 BlockOrdinal  31924 BlockPerHeight  863 BlockRefCount  1609 BlocksToCatchup  1557 ChallengedBlocks  4 ChunkExtra  5135 ChunkHashesByHeight  128788 Chunks  35 EpochInfo  1 EpochStart  98361 FlatState  1150 HeaderHashesByHeight  8113 InvalidChunks  263 NextBlockHashes  22 OutgoingReceipts  131114 PartialChunks  1116 ProcessedBlockHeights  968698 State  
  SET   865 BlockHeight  1026 BlockMerkleTree  12428 BlockMisc  1636 BlockOrdinal  865 BlockPerHeight  865 BlockRefCount  3460 ChunkExtra  3446 ChunkHashesByHeight  339142 FlatState  3460 FlatStateDeltas  3460 FlatStateMisc  865 HeaderHashesByHeight  3460 IncomingReceipts  865 NextBlockHashes  3442 OutcomeIds  3442 OutgoingReceipts  863 ProcessedBlockHeights  340093 StateChanges  3460 TrieChanges

The output contains one apply_transactions for each chunk, with the block hash and the shard id. Then it prints one line for each DB operations observed (GET,SET,...) together with a list of columns and an OP count.

See the top-level output at the end? These are all the DB requests that could not be assigned to specific chunks. The way we currently count write operations (SET, UPDATE_RC) they are never assigned to a specific chunk and instead only show up in the top-level list. Clearly, there is some room for improvement here. So far we simply haven't worried about RocksDB write performance so the tooling to debug write performance is naturally lacking.

Guide to Nearcore Development