Introduction

Welcome to the nearcore development guide!

The target audience of this guide are developers of nearcore itself. If you are a user of NEAR (either a contract developer, or validator running a node), please refer to the user docs at https://docs.near.org.

This guide is built with mdBook from sources in the nearcore repository. You can edit it by pressing the "edit" icon in the top right corner, we welcome all contributions. The guide is hosted at https://near.github.io/nearcore/.

The guide is organized as a collection of loosely coupled chapters -- you don't need to read them in order, feel free to peruse the TOC, and focus on the interesting bits. The chapters are classified into three parts:

  • Architecture talks about how the code works. So, for example, if you are interested in how a transaction flows through the system, look there!
  • Practices describe, broadly, how we write code. For example, if you want to learn about code style, issue tracking, or debugging performance problems, this is the chapter for you.
  • Finally, the Misc part holds various assorted bits and pieces. We are trying to bias ourselves towards writing more docs, so, if you want to document something and it doesn't cleanly map to a category above, just put it in misc!

If you are unsure, start with Architecture Overview and then read Run a Node

Overview

This document describes the high-level architecture of nearcore. The focus here is on the implementation of the blockchain protocol, not the protocol itself. For reference documentation of the protocol, please refer to nomicon

Some parts of our architecture are also covered in this video series on YouTube.

Bird's Eye View

If we put the entirety of nearcore onto one picture, we get something like this:

Don't worry if this doesn't yet make a lot of sense: hopefully, by the end of this document the above picture would become much clearer!

Overall Operation

nearcore is a blockchain node -- it's a single binary (neard) which runs on some machine and talks to other similar binaries running elsewhere. Together, the nodes agree (using a distributed consensus algorithm) on a particular sequence of transactions. Once transaction sequence is established, each node applies transactions to the current state. Because transactions are fully deterministic, each node in the network ends up with identical state. To allow greater scalability, NEAR protocol uses sharding, which allows a node to hold only a small subset (shard) of the whole state.

neard is a stateful, restartable process. When neard starts, the node connects to the network and starts processing blocks (block is a batch of transactions, processed together; transactions are batched into blocks for greater efficiency). The results of processing are persisted in the database. RocksDB is used for storage. Usually, the node's data is found in the ~/.near directory. The node can be stopped at any moment and be restarted later. While the node is offline it misses the block, so, after a restart, the sync process kicks in which brings the node up-to-speed with the network by downloading the missing bits of history from more up-to-date peer nodes.

Major components of nearcore:

  • JSON RPC. This HTTP RPC interface is how neard communicates with non-blockchain outside world. For example, to submit a transaction, some client sends an RPC request with it to some node in the network. From that node, the transaction propagates through the network, until it is included in some block. Similarly, a client can send an HTTP request to a node to learn about current state of the blockchain. The JSON RPC interface is documented here.

  • Network. If RPC is aimed "outside" the blockchain, "network" is how peer neard nodes communicate with each other within the blockchain. RPC carries requests from users of the blockchain, while network carries various messages needed to implement consensus. Two directly connected nodes communicate by sending protobuf-encoded messages over TCP. A node also includes logic to route messages for indirect peers through intermediaries. Oversimplifying a lot, it's enough for a new node to know an IP address of just one other network participant. From this bootstrap connection, the node learns how to communicate with any other node in the network.

  • Client. Somewhat confusingly named, client is the logical state of the blockchain. After receiving and decoding a request, both RPC and network usually forward it in the parsed form to the client. Internally, client is split in two somewhat independent components: chain and runtime.

  • Chain. The job of chain, in a nutshell, is to determine a global order of transactions. Chain builds and maintains the blockchain data structure. This includes block and chunk production and processing, consensus, and validator selection. However, chain is not responsible for actually applying transactions and receipts.

  • Runtime. If chain selects the order of transactions, Runtime applies transaction to the state. Chain guarantees that everyone agrees on the order and content of transactions, and Runtime guarantees that each transaction is fully deterministic. It follows that everyone agrees on the "current state" of the blockchain. Some transactions are as simple as "transfer X tokens from Alice to Bob". But a much more powerful class of transactions is supported: "run this arbitrary WebAssembly code in the context of the current state of the chain". Running such "smart contract" transactions securely and efficiently is a major part of what Runtime does. Today, Runtime uses a JIT compiler to do that.

  • Storage. Storage is more of a cross-cutting concern, than an isolated component. Many parts of a node want to durably persist various bits of state to disk. One notable case is the logical state of the blockchain, and, in particular, data associated with each account. Logically, the state of an account on a chain is a key-value map: HashMap<Vec<u8>, Vec<u8>>. But there is a twist: it should be possible to provide a succinct proof that a particular key indeed holds a particular value. To allow that internally the state is implemented as a persistent (in both senses, "functional" and "on disk") merkle-patricia trie.

  • Parameter Estimator. One kind of transaction we support is "run this arbitrary, Turing-complete computation". To protect from a loop {} transaction halting the whole network, Runtime implements resource limiting: each transaction runs with a certain finite amount of "gas", and each operation costs a certain amount of gas to perform. Parameter estimator is essentially a set of benchmarks used to estimate relative gas costs of various operations.

Entry Points

neard/src/main.rs contains the main function that starts a blockchain node. However, this file mostly only contains the logic to parse arguments and dispatch different commands. start_with_config in nearcore/src/lib.rs is the actual entry point and it starts all the actors.

JsonRpcHandler::process in the jsonrpc crate is the RPC entry point. It implements the public API of a node, which is documented here.

PeerManagerActor::spawn in the network is an entry for the other point of contract with the outside world -- the peer-to-peer network.

Runtime::apply in the runtime crate is the entry point for transaction processing logic. This is where state transitions actually happen, after chain decided, according to distributed consensus, which transitions need to happen.

Code Map

This section contains some high-level overview of important crates and data structures.

core/primitives

This crate contains most of the types that are shared across different crates.

core/primitives-core

This crate contains types needed for runtime.

core/store/trie

This directory contains the MPT state implementation. Note that we usually use TrieUpdate to interact with the state.

chain/chain

This crate contains most of the chain logic (consensus, block processing, etc). ChainUpdate::process_block is where most of the block processing logic happens.

State update

The blockchain state of a node can be changed in the following two ways:

  • Applying a chunk. This is how the state is normally updated: through Runtime::apply.
  • State sync. State sync can happen in two cases:
    • A node is far enough behind the most recent block and triggers state sync to fast forward to the state of a very recent block without having to apply blocks in the middle.
    • A node is about to become validator for some shard in the next epoch, but it does not yet have the state for that shard. In this case, it would run state sync through the catchup routine.

chain/chunks

This crate contains most of the sharding logic which includes chunk creation, distribution, and processing. ShardsManager is the main struct that orchestrates everything here.

chain/client

This crate defines two important structs, Client and ViewClient. Client includes everything necessary for the chain (without network and runtime) to function and runs in a single thread. ViewClient is a "read-only" client that answers queries without interfering with the operations of Client. ViewClient runs in multiple threads.

chain/network

This crate contains the entire implementation of the p2p network used by NEAR blockchain nodes.

Two important structs here: PeerManagerActor and Peer. Peer manager orchestrates all the communications from network to other components and from other components to network. Peer is responsible for low-level network communications from and to a given peer (more details in this article). Peer manager runs in one thread while each Peer runs in its own thread.

Architecture Invariant: Network communicates to Client through NetworkClientMessages and to ViewClient through NetworkViewClientMessages. Conversely, Client and ViewClient communicates to network through NetworkRequests.

chain/epoch_manager

This crate is responsible for determining validators and other epoch related information such as epoch id for each epoch.

Note: EpochManager is constructed in NightshadeRuntime rather than in Chain, partially because we had this idea of making epoch manager a smart contract.

chain/jsonrpc

This crate implements JSON-RPC API server to enable submission of new transactions and inspection of the blockchain data, the network state, and the node status. When a request is processed, it generates a message to either ClientActor or ViewClientActor to interact with the blockchain. For queries of blockchain data, such as block, chunk, account, etc, the request usually generates a message to ViewClientActor. Transactions, on the other hand, are sent to ClientActor for further processing.

runtime/runtime

This crate contains the main entry point to runtime -- Runtime::apply. This function takes ApplyState, which contains necessary information passed from chain to runtime, a list of SignedTransaction and a list of Receipt, and returns an ApplyResult, which includes state changes, execution outcomes, etc.

Architecture Invariant: The state update is only finalized at the end of apply. During all intermediate steps state changes can be reverted.

runtime/near-vm-logic

VMLogic contains all the implementations of host functions and is the interface between runtime and wasm. VMLogic is constructed when runtime applies function call actions. In VMLogic, interaction with NEAR blockchain happens in the following two ways:

  • VMContext, which contains lightweight information such as current block hash, current block height, epoch id, etc.
  • External, which is a trait that contains functions to interact with blockchain by either reading some nontrivial data, or writing to the blockchain.

runtime/near-vm-runner

run function in runner.rs is the entry point to the vm runner. This function essentially spins up the vm and executes some function in a contract. It supports different wasm compilers including wasmer0, wasmer2, and wasmtime through compile-time feature flags. Currently we use wasmer0 and wasmer2 in production. The imports module exposes host functions defined in near-vm-logic to WASM code. In other words, it defines the ABI of the contracts on NEAR.

neard

As mentioned before, neard is the crate that contains that main entry points. All the actors are spawned in start_with_config. It is also worth noting that NightshadeRuntime is the struct that implements RuntimeAdapter.

core/store/src/db.rs

This file contains the schema (DBCol) of our internal RocksDB storage - a good starting point when reading the code base.

Cross Cutting Concerns

Observability

The tracing crate is used for structured, hierarchical event output and logging. We also integrate Prometheus for light-weight metric output. See the style documentation for more information on the usage.

Testing

Rust has built-in support for writing unit tests by marking functions with the #[test] directive. Take full advantage of that! Testing not only confirms that what was written works the way it was intended to but also helps during refactoring since it catches unintended behaviour changes.

Not all tests are created equal though and while some may only need milliseconds to run, others may run for several seconds or even minutes. Tests that take a long time should be marked as such by prefixing their name with slow_test_ or ultra_slow_test_:

#![allow(unused)]
fn main() {
#[test]
fn ultra_slow_test_catchup_random_single_part_sync() {
    test_catchup_random_single_part_sync_common(false, false, 13)
}
}

During local development both slow and ultra-slow tests will not run with a typical just nextest invocation. You can run them with just nextest-slow or just nextest-all locally. CI will run the slow tests and the ultra_slow ones are left to run on Nayduck.

Because ultra_slow tests are run on nayduck, they need to be explicitly included in nightly/expensive.txt file; for example:

expensive --timeout=1800 near-client near_client tests::catching_up::ultra_slow_test_catchup_random_single_part_sync
expensive --timeout=1800 near-client near_client tests::catching_up::ultra_slow_test_catchup_random_single_part_sync --features nightly

For more details regarding nightly tests see nightly/README.md.

Note that what counts as a slow test is defined in .config/nextest.toml.

How neard works

This chapter describes how neard works with a focus on implementation details and practical scenarios. To get a better understanding of how the protocol works, please refer to nomicon. For a high-level code map of nearcore, please refer to this document.

High level overview

On the high level, neard is a daemon that periodically receives messages from the network and sends messages to peers based on different triggers. Neard is implemented using an actor framework called actix.

Note: Using actix was decided in the early days of the implementation of nearcore and by no means represents our confidence in actix. On the contrary, we have noticed a number of issues with actix and are considering implementing an actor framework in house.

There are several important actors in neard:

  • PeerActor - Each peer is represented by one peer actor and runs in a separate thread. It is responsible for sending messages to and receiving messages from a given peer. After PeerActor receives a message, it will route it to ClientActor, ViewClientActor, or PeerManagerActor depending on the type of the message.

  • PeerManagerActor - Peer Manager is responsible for receiving messages to send to the network from either ClientActor or ViewClientActor and routing them to the right PeerActor to send the bytes over the wire. It is also responsible for handling some types of network messages received and routed through PeerActor. For the purpose of this document, we only need to know that PeerManagerActor handles RoutedMessages. Peer manager would decide whether the RoutedMessages should be routed to ClientActor or ViewClientActor.

  • ClientActor - Client actor is the “core” of neard. It contains all the main logic including consensus, block and chunk processing, state transition, garbage collection, etc. Client actor is single-threaded.

  • ViewClientActor - View client actor can be thought of as a read-only interface to client. It only accesses data stored in a node’s storage and does not mutate any state. It is used for two purposes:

    • Answering RPC requests by fetching the relevant piece of data from storage.
    • Handling some network requests that do not require any changes to the storage, such as header sync, state sync, and block sync requests.

    ViewClientActor runs in four threads by default but this number is configurable.

Data flow within neard

Flow for incoming messages:

Flow for outgoing messages:

How neard operates when it is fully synced

When a node is fully synced, the main logic of the node operates in the following way (the node is assumed to track all shards, as most nodes on mainnet do today):

  1. A block is produced by some block producer and sent to the node through broadcasting.
  2. The node receives a block and tries to process it. If the node is synced it presumably has the previous block and the state before the current block to apply. It then checks whether it has all the chunks available. If the node is not a validator node, it won’t have any chunk parts and therefore won’t have the chunks available. If the node is a validator node, it may already have chunk parts through chunk parts forwarding from other nodes and therefore may have already reconstructed some chunks. Regardless, if the node doesn’t have all chunks for all shards, it will request them from peers by parts.
  3. The chunk requests are sent and the node waits for enough chunk parts to be received to reconstruct the chunks. For each chunk, 1/3 of all the parts (100) is sufficient to reconstruct a chunk. If new blocks arrive while waiting for chunk parts, they will be put into an OrphanPool, waiting to be processed. If a chunk part request is not responded to within chunk_request_retry_period, which is set to 400ms by default, then a request for the same chunk part would be sent again.
  4. After all chunks are reconstructed, the node processes the current block by applying transactions and receipts from the chunks. Afterwards, it will update the head according to the fork choice rule, which only looks at block height. In other words, if the newly processed block is of higher height than the current head of the node, the head is updated.
  5. The node checks whether any blocks in the OrphanPool are ready to be processed in a BFS order and processes all of them until none can be processed anymore. Note that a block is put into the OrphanPool if and only if its previous block is not accepted.
  6. Upon acceptance of a block, the node would check whether it needs to run garbage collection. If it needs to, it would garbage collect two blocks worth of data at a time. The logic of garbage collection is complicated and could be found here.
  7. If the node is a validator node, it would start a timer after the current block is accepted. After min_block_production_delay which is currently configured to be 1.3s on mainnet, it would send an approval to the block producer of the next block (current block height + 1).

The main logic is illustrated below:

How neard works when it is synchronizing

PeerManagerActor periodically sends a NetworkInfo message to ClientActor to update it on the latest peer information, which includes the height of each peer. Once ClientActor realizes that it is more than sync_height_threshold (which by default is set to 1) behind the highest height among peers, it starts to sync. The synchronization process is done in three steps:

  1. Header sync. The node first identifies the headers it needs to sync through a get_locator calculation. This is essentially an exponential backoff computation that tries to identify commonly known headers between the node and its peers. Then it would request headers from different peers, at most MAX_BLOCK_HEADER_HASHES (which is 512) headers at a time.

  2. After the headers are synced, the node would determine whether it needs to run state sync. The exact condition can be found here but basically a node would do state sync if it is more than 2 epochs behind the head of the network. State sync is a very complex process and warrants its own section. We will give a high level overview here.

    1. First, the node computes sync_hash which is the hash of the block that identifies the state that the node wants to sync. This is guaranteed to be the first block of the most recent epoch. In fact, there is a check on the receiver side that this is indeed the case. The node would also request the block whose hash is sync_hash
    2. The node deletes basically all data (blocks, chunks, state) from its storage. This is not an optimal solution, but it makes the implementation for combining state easier when there is no stale data in storage.
    3. For the state of each shard that the node needs to download, it first requests a header that contains some metadata the node needs to know about. Then the node computes the number of state parts it needs to download and requests those parts from different peers who track the shard.
    4. After all parts are downloaded, the node combines those state parts and then finalizes the state sync by applying the last chunk included in or before the sync block so that the node has the state after applying sync block to be able to apply the next block.
    5. The node resets heads properly after state sync.
  3. Block Sync. The node first gets the block with highest height that is on the canonical chain and request from there MAX_BLOCK_REQUESTS (which is set to 5) blocks from different peers in a round robin order. The block sync routine runs again if head has changed (progress is made) or if a timeout (which is set to 2s) has happened.

Note: when a block is received and its height is no more than 500 + the node’s current head height, then the node would request its previous block automatically. This is called orphan sync and helps to speed up the syncing process. If, on the other hand, the height is more than 500 + the node’s current head height, the block is simply dropped.

How ClientActor works

ClientActor has some periodically running routines that are worth noting:

  • Doomslug timer - This routine runs every doosmslug_step_period (set to 100ms by default) and updates consensus information. If the node is a validator node, it also sends approvals when necessary.
  • Block production - This routine runs every block_production_tracking_delay (which is set to 100ms by default) and checks if the node should produce a block.
  • Log summary - Prints a log line that summarizes block rate, average gas used, the height of the node, etc. every 10 seconds.
  • Resend chunk requests - This routine runs every chunk_request_retry_period (which is set to 400ms). It resends the chunk part requests for those that are not yet responded to.
  • Sync - This routine runs every sync_step_period (which is set to 10ms by default) and checks whether the node needs to sync from its peers and, if needed, also starts the syncing process.
  • Catch up - This routine runs every catchup_step_period (which is set to 100ms by default) and runs the catch up process. This only applies if a node validates shard A in epoch X and is going to validate a different shard B in epoch X+1. In this case, the node would start downloading the state for shard B at the beginning of epoch X. After the state downloading is complete, it would apply all blocks in the current epoch (epoch X) for shard B to ensure that the node has the state needed to validate shard B when epoch X+1 starts.

How Sync Works

Basics

While Sync and Catchup sounds similar - they are actually describing two completely different things.

Sync - is used when your node falls ‘behind’ other nodes in the network (for example because it was down for some time or it took longer to process some blocks etc).

Catchup - is used when you want (or have to) start caring about (a.k.a. tracking) additional shards in the future epochs. Currently it should be a no-op for 99% of nodes (see below).

Tracking shards: as you know our system has multiple shards (currently 4). Currently 99% of nodes are tracking all the shards: validators have to - as they have to validate the chunks from all the shards, and normal nodes mostly also track all the shards as this is default.

But in the future - we will have more and more people tracking only a subset of the shards, so the catchup will be increasingly important.

Sync

If your node is behind the head - it will start the sync process (this code is running periodically in the client_actor and if you’re behind for more than sync_height_threshold (currently 50) blocks - it will enable the sync.

The Sync behavior differs depending on whether you’re an archival node (which means you care about the state of each block) or ‘normal’ node - where you care mostly about the Tip of the network.

Step 1: Header Sync [archival node & normal node*] (“downloading headers”)

The goal of the header sync is to get all the block headers from your current HEAD all the way to the top of the chain.

As headers are quite small, we try to request multiple of them in a single call (currently we ask for 512 headers at once).

image

Step 1a: Epoch Sync [normal node*] // not implemented yet

While currently normal nodes are using Header sync, we could actually allow them to do something faster - “light client sync” a.k.a “epoch sync”.

The idea of the epoch sync, is to read “just” a single block header from each epoch - that has to contain additional information about validators.

This way it would drastically reduce both the time needed for the sync and the db resources.

Implementation target date is TBD.

image

Notice that in the image above - it is enough to only get the ‘last’ header from each epoch. For the ‘current’ epoch, we still need to get all the headers.

Step 2: State sync [normal node]

After header sync - if you notice that you’re too far behind, i.e. the chain head is at least two epochs ahead of your local head - the node will try to do the ‘state sync’.

The idea of the state sync is - rather than trying to process all the blocks - try to ‘jump’ ahead by downloading the freshest state instead - and continue processing blocks from that place in the chain. As a side effect, it is going to create a ‘gap’ in the chunks/state on this node (which is fine - as the data will be garbage collected after 5 epochs anyway). State sync will ONLY sync to the beginning of the epoch - it cannot sync to any random block.

This step is never run on the archival nodes - as these nodes want to have whole history and cannot have any gaps.

image

In this case, we can skip processing transactions that are in the blocks 124 - 128, and start from 129 (after sync state finishes)

See how-to to learn how to configure your node to state sync.

Step 3: Block sync [archival node, normal node] (“downloading blocks”)

The final step is to start requesting and processing blocks as soon as possible, hoping to catch up with the chain.

Block sync will request up to 5 (MAX_BLOCK_REQUESTS) blocks at a time - sending explicit Network BlockRequests for each one.

After the response (Block) is received - the code will execute the ‘standard’ path that tries to add this block to the chain (see section below).

image

In this case, we are processing each transaction for each block - until we catch up with the chain.

Side topic: how blocks are added to the chain?

A node can receive a Block in two ways:

  • Either by broadcasting - when a new block is produced, its contents are broadcasted within the network by the nodes
  • Or by explicitly sending a BlockRequest to another peer - and getting a Block in return.

(in case of broadcasting, the node will automatically reject any Blocks that are more than 500 (BLOCK_HORIZON) blocks away from the current HEAD).

When a given block is received, the node checks if it can be added to the current chain.

If block’s “parent” (prev_block) is not in the chain yet - the block gets added to the orphan list.

If the parent is already in the chain - we can try to add the block as the head of the chain.

Before adding the block, we want to download the chunks for the shards that we are tracking - so in many cases, we’ll call missing_chunks functions that will try to go ahead and request those chunks.

Note: as an optimization, we’re also sometimes trying to fetch chunks for the blocks that are in the orphan pool – but only if they are not more than 3 (NUM_ORPHAN_ANCESTORS_CHECK) blocks away from our head.

We also keep a separate job in client_actor that keeps retrying chunk fetching from other nodes if the original request fails.

After all the chunks for a given block are received (we have a separate HashMap that checks how many chunks are missing for each block) - we’re ready to process the block and attach it to the chain.

Afterwards, we look at other entries in the orphan pool to see if any of them are a direct descendant of the block that we just added - and if yes, we repeat the process.

Catchup

The goal of catchup

Catchup is needed when not all nodes in the network track all shards and nodes can change the shard they are tracking during different epochs.

For example, if a node tracks shard 0 at epoch T and tracks shard 1 at epoch T+1, it actually needs to have the state of shard 1 ready before the beginning of epoch T+1. We make sure this happens by making the node start downloading the state for shard 1 at the beginning of epoch T and applying blocks during epoch T to shard 1’s state. Because downloading state can take time, the node may have already processed some blocks (for shard 0 at this epoch), so when the state finishes downloading, the node needs to “catch up” processing these blocks for shard 1.

Right now, all nodes do track all shards, so technically we shouldn’t need the catchup process, but it is still implemented for the future.

Image below: Example of the node, that tracked only shard 0 in epoch T-1, and will start tracking shard 0 & 1 in epoch T+1.

At the beginning of the epoch T, it will initiate the state download (green) and afterwards will try to ‘catchup’ the blocks (orange). After blocks are caught up, it will continue processing as normal.

image

How catchup interact with normal block processing

The catchup process has two phases: downloading states for shards that we are going to care about in epoch T+1 and catching up blocks that have already been applied.

When epoch T starts, the node will start downloading states of shards that it will track for epoch T+1, which it doesn't track already. Downloading happens in a different thread so ClientActor can still process new blocks. Before the shard states for epoch T+1 are ready, processing new blocks only applies chunks for the shards that the node is tracking in epoch T. When the shard states for epoch T+1 finish downloading, the catchup process needs to reprocess the blocks that have already been processed in epoch T to apply the chunks for the shards in epoch T+1. We assume that it will be faster than regular block processing, because blocks are not full and block production has its own delays, so catchup can finish within an epoch.

In other words, there are three modes for applying chunks and two code paths, either through the normal process_block (blue) or through catchup_blocks (orange). When process_block, either that the shard states for the next epoch are ready, corresponding to IsCaughtUp and all shards the node is tracking in this, or will be tracking in the next, epoch will be applied, or when the states are not ready, corresponding to NotCaughtUp, then only the shards for this epoch will be applied. When catchup_blocks, shards for the next epoch will be applied.

#![allow(unused)]
fn main() {
enum ApplyChunksMode {
    IsCaughtUp,
    CatchingUp,
    NotCaughtUp,
}
}

How catchup works

The catchup process is initiated by process_block, where we check if the block is caught up and if we need to download states. The logic works as follows:

  • For the first block in an epoch T, we check if the previous block is caught up, which signifies if the state of the new epoch is ready. If the previous block is not caught up, the block will be orphaned and not processed for now because it is not ready to be processed yet. Ideally, this case should never happen, because the node will appear stalled until the blocks in the previous epoch are catching up.
  • Otherwise, we start processing blocks for the new epoch T. For the first block, we always mark it as not caught up and will initiate the process for downloading states for shards that we are going to care about in epoch T+1. Info about downloading states is persisted in DBCol::StateDlInfos.
  • For other blocks, we mark them as not caught up if the previous block is not caught up. This info is persisted in DBCol::BlocksToCatchup which stores mapping from previous block to vector of all child blocks to catch up.
  • Chunks for already tracked shards will be applied during process_block, as we said before mentioning ApplyChunksMode.
  • Once we downloaded state, we start catchup. It will take blocks from DBCol::BlocksToCatchup in breadth-first search order and apply chunks for shards which have to be tracked in the next epoch.
  • When catchup doesn't see any more blocks to process, DBCol::BlocksToCatchup is cleared, which means that catchup process is finished.

The catchup process is implemented through the function Client::run_catchup. ClientActor schedules a call to run_catchup every 100ms. However, the call can be delayed if ClientActor has a lot of messages in its actix queue.

Every time run_catchup is called, it checks DBCol::StateDlInfos to see if there are any shard states that should be downloaded. If so, it initiates the syncing process for these shards. After the state is downloaded, run_catchup will start to apply blocks that need to be caught up.

One thing to note is that run_catchup is located at ClientActor, but intensive work such as applying state parts and applying blocks is actually offloaded to SyncJobsActor in another thread, because we don’t want ClientActor to be blocked by this. run_catchup is simply responsible for scheduling SyncJobsActor to do the intensive job. Note that SyncJobsActor is state-less, it doesn’t have write access to the chain. It will return the changes that need to be made as part of the response to ClientActor, and ClientActor is responsible for applying these changes. This is to ensure only one thread (ClientActor) has write access to the chain state. However, this also adds a lot of limits, for example, SyncJobsActor can only be scheduled to apply one block at a time. Because run_catchup is only scheduled to run every 100ms, the speed of catching up blocks is limited to 100ms per block, even when blocks applying can be faster. Similar constraints happen to apply state parts.

Improvements

There are three improvements we can make to the current code.

First, currently we always initiate the state downloading process at the first block of an epoch, even when there are no new states to be downloaded for the new epoch. This is unnecessary.

Second, even though run_catchup is scheduled to run every 100ms, the call can be delayed if ClientActor has messages in its actix queue. A better way to do this is to move the scheduling of run_catchup to check_triggers.

Third, because of how run_catchup interacts with SyncJobsActor, run_catchup can catch up at most one block every 100 ms. This is because we don’t want to write to ChainStore in multiple threads. However, the changes that catching up blocks make do not interfere with regular block processing and they can be processed at the same time. However, to restructure this, we will need to re-implement ChainStore to separate the parts that can be shared among threads and the part that can’t.

Garbage Collection

This document covers the basics of Chain garbage collection.

Currently we run garbage collection only in non-archival nodes, to keep the size of the storage under control. Therefore, we remove blocks, chunks and state that is ‘old’ enough - which in current configuration means 5 epochs ago.

We run a single ‘round’ of GC after a new block is accepted to the chain - and in order not to delay the chain too much, we make sure that each round removes at most 2 blocks from the chain.

For more details look at function clear_data() in file chain/chain/src/chain.rs

How it works:

Imagine the following chain (with 2 forks)

In the pictures below, let’s assume that epoch length is 5 and we keep only 3 epochs (rather than 5 that is currently set in production) - otherwise the image becomes too large 😉.

If head is in the middle of the epoch, the gc_stop will be set to the first block of epoch T-2, and tail & fork_tail will be sitting at the last block of epoch T-3.

(and no GC is happening in this round - as tail is next to gc_stop).

Next block was accepted on the chain (head jumped ahead), but still no GC happening in this round:

Now interesting things will start happening once head ‘crosses’ over to the next epoch.

First, the gc_stop will jump to the beginning of the next epoch.

Then we’ll start the GC of the forks: by first moving the fork_tail to match the gc_stop and going backwards from there.

It will start removing all the blocks that don’t have a successor (a.k.a the tip of the fork). And then it will proceed to lower height.

Will keep going until it ‘hits’ the tail.

In order not to do too much in one go, we’d only remove up to 2 block in each run (that happens after each head update).

Now, the forks are gone, so we can proceed with GCing of the blocks from the canonical chain:

Same as before, we’d remove up to 2 blocks in each run:

Until we catch up to the gc_stop.

How Epoch Works

This short document will tell you all you need to know about Epochs in NEAR protocol.

You can also find additional information about epochs in nomicon.

What is an Epoch?

Epoch is a sequence of consecutive blocks. Within one epoch, the set of validators is fixed, and validator rotation happens at epoch boundaries.

Basically almost all the changes that we do are happening at epoch boundaries:

  • sharding changes
  • protocol version changes
  • validator changes
  • changing tracking shards
  • state sync

Where does the Epoch Id come from?

EpochId for epoch T+2 is the last hash of the block of epoch T.

image

Situation at genesis is interesting. We have three blocks:

dummy ← genesis ← first-block

Where do we set the epoch length?

Epoch length is set in the genesis config. Currently in mainnet it is set to 43200 blocks:

  "epoch_length": 43200

See the mainnet genesis for more details.

This means that each epoch lasts around 15 hours.

Important: sometimes there might be ‘troubles’ on the network, that might result in epoch lasting a little bit longer (if we cannot get enough signatures on the last blocks of the previous epoch).

You can read specific details on our nomicon page.

How do we pick the next validators?

TL;DR: in the last block of the epoch T, we look at the accounts that have highest stake and we pick them to become validators in T+2.

We are deciding on validators for T+2 (and not T+1) as we want to make sure that validators have enough time to prepare for block production and validation (they have to download the state of shards etc).

For more info on how we pick validators please look at nomicon.

Epoch and Sharding

Sharding changes happen only on epoch boundary - that’s why many of the requests (like which shard does my account belong to), require also an epoch_id as a parameter.

As of April 2022 we don’t have dynamic sharding yet, so the whole chain is simply using 4 shards.

How can I get more information about current/previous epochs?

We don’t show much information about Epochs in Explorer. Today, you can use state_viewer (if you have access to the network database).

At the same time, we’re working on a small debug dashboard, to show you the basic information about past epochs - stay tuned.

Technical details

Where do we store epoch info?

We use a couple of columns in the database to store epoch information:

  • ColEpochInfo = 11 - is storing the mapping from EpochId to EpochInfo structure that contains all the details.
  • ColEpochStart = 23 - has a mapping from EpochId to the first block height of that epoch.
  • ColEpochValidatorInfo = 47 - contains validator statistics (blocks produced etc.) for each epoch.

How does epoch info look like?

Here’s the example epoch info from a localnet node. As you can see below, EpochInfo mostly contains information about who is the validator and in which order should they produce the blocks.

EpochInfo.V3(
  epoch_height=7,
  validators=ListContainer([
    validator_stake.V1(account_id='node0', public_key=public_key.ED25519(tuple_data=ListContainer([b'7PGseFbWxvYVgZ89K1uTJKYoKetWs7BJtbyXDzfbAcqX'])), stake=51084320187874404740382878961615),
    validator_stake.V1(account_id='node2', public_key=public_key.ED25519(tuple_data=ListContainer([b'GkDv7nSMS3xcqA45cpMvFmfV1o4fRF6zYo1JRR6mNqg5'])), stake=51084320187874404740382878961615),
    validator_stake.V1(account_id='node1', public_key=public_key.ED25519(tuple_data=ListContainer([b'6DSjZ8mvsRZDvFqFxo8tCKePG96omXW7eVYVSySmDk8e'])), stake=50569171534262067815663761517574)]),

  validator_to_index={'node0': 0, 'node1': 2, 'node2': 1},

  block_producers_settlement=ListContainer([0, 1, 2]),
  chunk_producers_settlement=ListContainer([ListContainer([0, 1, 2]), ListContainer([0, 1, 2]), ListContainer([0, 1, 2]), ListContainer([0, 1, 2]), ListContainer([0, 1, 2])]),

  hidden_validators_settlement=ListContainer([]),
  fishermen=ListContainer([]),
  fishermen_to_index={},
  stake_change={'node0': 51084320187874404740382878961615, 'node1': 50569171534262067815663761517574, 'node2': 51084320187874404740382878961615},
  validator_reward={'near': 37059603312899067633082436, 'node0': 111553789870214657675206177, 'node1': 110428850075662293347329569, 'node2': 111553789870214657675206177},
  validator_kickout={},
  minted_amount=370596033128990676330824359,
  seat_price=24438049905601740367428723111,
  protocol_version=52
)

Transaction Routing

We all know that transactions are ‘added’ to the chain - but how do they get there?

Hopefully by the end of this article, the image below should make total sense.

image

Step 1: Transaction creator/author

The journey starts with the author of the transaction - who creates the transaction object (basically list of commands) - and signs them with their private key.

Basically, they prepare the payload that looks like this:

#![allow(unused)]
fn main() {
pub struct SignedTransaction {
    pub transaction: Transaction,
    pub signature: Signature,
}
}

With such a payload, they can go ahead and send it as a JSON-RPC request to ANY node in the system (they can choose between using ‘sync’ or ‘async’ options).

From now on, they’ll also be able to query the status of the transaction - by using the hash of this object.

Fun fact: the Transaction object also contains some fields to prevent attacks: like nonce to prevent replay attack, and block_hash to limit the validity of the transaction (it must be added within transaction_validity_period (defined in genesis) blocks of block_hash).

Step 2: Inside the node

Our transaction has made it to a node in the system - but most of the nodes are not validators - which means that they cannot mutate the chain.

That’s why the node has to forward it to someone who can - the upcoming validator.

The node, roughly, does the following steps:

  • verify transaction’s metadata - check signatures etc. (we want to make sure that we don’t forward bogus data)
  • forward it to the ‘upcoming’ validator - currently we pick the validators that would be a chunk creator in +2, +3, +4 and +8 blocks (this is controlled by TX_ROUTING_HEIGHT_HORIZON) - and send the transaction to all of them.

Step 3: En-route to validator/producer

Great, the node knows to send (forward) the transaction to the validator, but how does the routing work? How do we know which peer is hosting a validator?

Each validator is regularly (every config.ttl_account_id_router/2 seconds == 30 minutes in production) broadcasting so called AnnounceAccount, which is basically a pair of (account_id, peer_id), to the whole network. This way each node knows which peer_id to send the message to.

Then it asks the routing table about the shortest path to the peer, and sends the ForwardTx message to the peer.

Step 4: Chunk producer

When a validator receives such a forwarded transaction, it double-checks that it is about to produce the block, and if so, it adds the transaction to the mempool (TransactionPool) for this shard, where it waits to be picked up when the chunk is produced.

What happens afterwards will be covered in future episodes/articles.

Additional notes:

Transaction being added multiple times

But such an approach means, that we’re forwarding the same transaction to multiple validators (currently 4) - so can it be added multiple times?

No. Remember that a transaction has a concrete hash which is used as a global identifier. If the validator sees that the transaction is present in the chain, it removes it from its local mempool.

Can transaction get lost?

Yes - they can and they do. Sometimes a node doesn’t have a path to a given validator or it didn’t receive an AnnounceAccount for it, so it doesn’t know where to forward the message. And if this happens to all 4 validators that we try to send to, then the message can be silently dropped.

We’re working on adding some monitoring to see how often this happens.

Transactions, Receipts and Chunk Surprises

We finished the previous article (Transaction routing) where a transaction was successfully added to the soon-to-be block producer’s mempool.

In this article, we’ll cover what happens next: How it is changed into a receipt and executed, potentially creating even more receipts in the process.

First, let’s look at the ‘high-level view’:

image

Transaction vs Receipt

As you can see from the image above:

Transactions are ‘external’ communication - they are coming from the outside.

Receipts are used for ‘internal’ communication (cross shard, cross contract) - they are created by the block/chunk producers.

Life of a Transaction

If we ‘zoom-in‘, the chunk producer's work looks like this:

image

Step 1: Process Transaction into receipt

Once a chunk producer is ready to produce a chunk, it will fetch the transactions from its mempool, check that they are valid, and if so, prepare to process them into receipts.

Note: There are additional restrictions (e.g. making sure that we take them in the right order, that we don’t take too many, etc.) - that you can see in nomicon’s transaction page.

You can see this part in explorer:

image

Step 2: Sending receipt to the proper destination

Once we have a receipt, we have to send it to the proper destination - by adding it to the outgoing_receipt list, which will be forwarded to the chunk producers from the next block.

Note: There is a special case here - if the sender of the receipt is the same as the receiver, then the receipt will be added to the local_receipts queue and executed in the same block.

Step 3: When an incoming receipt arrives

(Note: this happens in the ‘next’ block)

When a chunk producer receives an incoming receipt, it will try to execute its actions (creating accounts, executing function calls etc).

Such actions might generate additional receipts (for example a contract might want to call other contracts). All these outputs are added to the outgoing receipt queue to be executed in the next block.

If the incoming receipt queue is too large to execute in the current chunk, the producer will put the remaining receipts onto the ‘delayed’ queue.

Step 4: Profit

When all the ‘dependant’ receipts are executed for a given transaction, we can consider the transaction to be successful.

[Advanced] But reality is more complex

Caution: In the section below, some things are simplified and do not match exactly how the current code works.

Let’s quickly also check what’s inside a Chunk:

#![allow(unused)]
fn main() {
pub struct ShardChunkV2 {
    pub chunk_hash: ChunkHash,
    pub header: ShardChunkHeader,
    pub transactions: Vec<SignedTransaction>,
    pub receipts: Vec<Receipt>, // outgoing receipts from 'previous' block
}
}

Yes, it is a little bit confusing, that receipts here are NOT the ‘incoming’ ones for this chunk, but instead the ‘outgoing’ ones from the previous block, i.e. all receipts from shard 0, block B are actually found in shard 0, block B+1. Why?!?!

This has to do with performance.

The steps usually followed for producing a block are as follows

  1. Chunk producer executes the receipts and creates a chunk. It sends the chunk to other validators. Note that it's the execution/processing of the receipts that usually takes the most time.
  2. Validators receive the chunk and validate it before signing the chunk. Validation involves executing/processing of the receipts in the chunk.
  3. Once the next block chunk producer receives the validation (signature), only then can it start producing the next chunk.

Simple approach

First, let’s imagine how the system would look like, if chunks contained things that we’d expect:

  • list of transactions
  • list of incoming receipts
  • list of outgoing receipts
  • hash of the final state

This means, that the chunk producer has to compute all this information first, before sending the chunk to other validators.

image

Once the other validators receive the chunk, they can start their own processing to verify those outgoing receipts/final state - and then do the signing. Only then, can the next chunk producer start creating the next chunk.

While this approach does work, we can do it faster.

Faster approach

What if the chunk didn’t contain the ‘output’ state? This changes our ‘mental’ model a little bit, as now when we’re singing the chunk, we’d actually be verifying the previous chunk - but that’s the topic for the next article (to be added).

For now, imagine if the chunk only had:

  • a list of transactions
  • a list of incoming receipts

In this case, the chunk producer could send the chunk a lot earlier, and validators (and chunk producer) could do their processing at the same time:

image

Now the last mystery: Why do we have ‘outgoing’ receipts from previous chunks rather than incoming to the current one?

This is yet another optimization. This way the chunk producer can send out the chunk a little bit earlier - without having to wait for all the other shards.

But that’s a topic for another article (to be added).

Cross shard transactions - deep dive

In this article, we'll look deeper into how cross-shard transactions are working on the simple example of user shard0 transferring money to user shard1.

These users are on separate shards (shard0 is on shard 0 and shard1 is on shard 1).

Imagine, we run the following command in the command line:

$ NEAR_ENV=local near send shard0 shard1 500

What happens under the hood? How is this transaction changed into receipts and processed by near?

From Explorer perspective

If you look at a simple token transfer in explorer (example), you can see that it is broken into three separate sections:

  • convert transaction into receipt ( executed in block B )
  • receipt that transfers tokens ( executed in block B+1 )
  • receipt that refunds gas ( executed in block B+2 )

But under the hood, the situation is a little bit more complex, as there is actually one more receipt (that is created after converting the transaction). Let's take a deeper look.

Internal perspective (Transactions & Receipts)

One important thing to remember is that NEAR is sharded - so in all our designs, we have to assume that each account is on a separate shard. So that the fact that some of them are colocated doesn't give any advantage.

Step 1 - Transaction

This is the part which we receive from the user (SignedTransaction) - it has 3 parts:

  • signer (account + key) who signed the transaction
  • receiver (in which account context should we execute this)
  • payload - a.k.a Actions to execute.

As the first step, we want to change this transaction into a Receipt (a.k.a 'internal' message) - but before doing that, we must verify that:

  • the message signature matches (that is - that this message was actually signed by this key)
  • that this key is authorized to act on behalf of that account (so it is a full access key to this account - or a valid function key).

The last point above means, that we MUST execute this (Transaction to Receipt) transition within the shard that the signer belongs to (as other shards don't know the state that belongs to signer - so they don't know which keys it has).

So actually if we look inside the chunk 0 (where shard0 belongs) at block B, we'll see the transaction:

Chunk: Ok(
    V2(
        ShardChunkV2 {
            chunk_hash: ChunkHash(
                8mgtzxNxPeEKfvDcNdFisVq8TdeqpCcwfPMVk219zRfV,
            ),
            header: V3(
                ShardChunkHeaderV3 {
                    inner: V2(
                        ShardChunkHeaderInnerV2 {
                            prev_block_hash: CgTJ7FFwmawjffrMNsJ5XhvoxRtQPXdrtAjrQjG91gkQ,
                            prev_state_root: 99pXnYjQbKE7bEf277urcxzG3TaN79t2NgFJXU5NQVHv,
                            outcome_root: 11111111111111111111111111111111,
                            encoded_merkle_root: 67zdyWTvN7kB61EgTqecaNgU5MzJaCiRnstynerRbmct,
                            encoded_length: 187,
                            height_created: 1676,
                            shard_id: 0,
                            gas_used: 0,
                            gas_limit: 1000000000000000,
                            balance_burnt: 0,
                            outgoing_receipts_root: 8s41rye686T2ronWmFE38ji19vgeb6uPxjYMPt8y8pSV,
                            tx_root: HyS6YfQbfBRniVSbWRnxsxEZi9FtLqHwyzNivrF6aNAM,
                            validator_proposals: [],
                        },
                    ),
                    height_included: 0,
                    signature: ed25519:uUvmvDV2cRVf1XW93wxDU8zkYqeKRmjpat4UUrHesJ81mmr27X43gFvFuoiJHWXz47czgX68eyBN38ejwL1qQTD,
                    hash: ChunkHash(
                        8mgtzxNxPeEKfvDcNdFisVq8TdeqpCcwfPMVk219zRfV,
                    ),
                },
            ),
            transactions: [
                SignedTransaction {
                    transaction: Transaction {
                        signer_id: AccountId(
                            "shard0",
                        ),
                        public_key: ed25519:Ht8EqXGUnY8B8x7YvARE1LRMEpragRinqA6wy5xSyfj5,
                        nonce: 11,
                        receiver_id: AccountId(
                            "shard1",
                        ),
                        block_hash: 6d5L1Vru2c4Cwzmbskm23WoUP4PKFxBHSP9AKNHbfwps,
                        actions: [
                            Transfer(
                                TransferAction {
                                    deposit: 500000000000000000000000000,
                                },
                            ),
                        ],
                    },
                    signature: ed25519:63ssFeMyS2N1khzNFyDqiwSELFaUqMFtAkRwwwUgrPbd1DU5tYKxz9YL2sg1NiSjaA71aG8xSB7aLy5VdwgpvfjR,
                    hash: 6NSJFsTTEQB4EKNKoCmvB1nLuQy4wgSKD51rfXhmgjLm,
                    size: 114,
                },
            ],
            receipts: [],
        },
    ),
)

Side note: When we're converting the transaction into a receipt, we also use this moment to deduct prepaid gas fees and transferred tokens from the 'signer' account. The details on how much gas is charged can be found at https://nomicon.io/RuntimeSpec/Fees/.

Step 2 - cross shard receipt

After transaction was changed into a receipt, this receipt must now be sent to the shard where the receiver is (in our example shard1 is on shard 1).

We can actually see this in the chunk of the next block:

Chunk: Ok(
    V2(
        ShardChunkV2 {
            chunk_hash: ChunkHash(
                DoF7yoCzyBSNzB8R7anWwx6vrimYqz9ZbEmok4eqHZ3m,
            ),
            header: V3(
                ShardChunkHeaderV3 {
                    inner: V2(
                        ShardChunkHeaderInnerV2 {
                            prev_block_hash: 82dKeRnE262qeVf31DXaxHvbYEugPUDvjGGiPkjm9Rbp,
                            prev_state_root: DpsigPFeVJDenQWVueGKyTLVYkQuQjeQ6e7bzNSC7JVN,
                            outcome_root: H34BZknAfWrPCcppcHSqbXwFvAiD9gknG8Vnrzhcc4w,
                            encoded_merkle_root: 3NDvQBrcRSAsWVPWkUTTrBomwdwEpHhJ9ofEGGaWsBv9,
                            encoded_length: 149,
                            height_created: 1677,
                            shard_id: 0,
                            gas_used: 223182562500,
                            gas_limit: 1000000000000000,
                            balance_burnt: 22318256250000000000,
                            outgoing_receipts_root: Co1UNMcKnuhXaHZz8ozMnSfgBKPqyTKLoC2oBtoSeKAy,
                            tx_root: 11111111111111111111111111111111,
                            validator_proposals: [],
                        },
                    ),
                    height_included: 0,
                    signature: ed25519:32hozA7GMqNqJzscEWzYBXsTrJ9RDhW5Ly4sp7FXP1bmxoCsma8Usxry3cjvSuywzMYSD8HvGntVtJh34G2dKJpE,
                    hash: ChunkHash(
                        DoF7yoCzyBSNzB8R7anWwx6vrimYqz9ZbEmok4eqHZ3m,
                    ),
                },
            ),
            transactions: [],
            receipts: [
                Receipt {
                    predecessor_id: AccountId(
                        "shard0",
                    ),
                    receiver_id: AccountId(
                        "shard1",
                    ),
                    receipt_id: 3EtEcg7QSc2CYzuv67i9xyZTyxBD3Dvx6X5yf2QgH83g,
                    receipt: Action(
                        ActionReceipt {
                            signer_id: AccountId(
                                "shard0",
                            ),
                            signer_public_key: ed25519:Ht8EqXGUnY8B8x7YvARE1LRMEpragRinqA6wy5xSyfj5,
                            gas_price: 103000000,
                            output_data_receivers: [],
                            input_data_ids: [],
                            actions: [
                                Transfer(
                                    TransferAction {
                                        deposit: 500000000000000000000000000,
                                    },
                                ),
                            ],
                        },
                    ),
                },
            ],
        },
    ),
)

Side comment: notice that the receipt itself no longer has a signer field, but a predecessor_id one.

Such a receipt is sent to the destination shard (we'll explain this process in a separate article) where it can be executed.

3. Gas refund.

When shard 1 processes the receipt above, it is then ready to refund the unused gas to the original account (shard0). So it also creates the receipt, and puts it inside the chunk. This time it is in shard 1 (as that's where it was executed).

Chunk: Ok(
    V2(
        ShardChunkV2 {
            chunk_hash: ChunkHash(
                8sPHYmBFp7cfnXDAKdcATFYfh9UqjpAyqJSBKAngQQxL,
            ),
            header: V3(
                ShardChunkHeaderV3 {
                    inner: V2(
                        ShardChunkHeaderInnerV2 {
                            prev_block_hash: Fj7iu26Yy9t5e9k9n1fSSjh6ZoTafWyxcL2TgHHHskjd,
                            prev_state_root: 4y6VL9BoMJg92Z9a83iqKSfVUDGyaMaVU1RNvcBmvs8V,
                            outcome_root: 7V3xRUeWgQa7D9c8s5jTq4dwdRcyTuY4BENRmbWaHiS5,
                            encoded_merkle_root: BnCE9LZgnFEjhQv1fSYpxPNw56vpcLQW8zxNmoMS8H4u,
                            encoded_length: 149,
                            height_created: 1678,
                            shard_id: 1,
                            gas_used: 223182562500,
                            gas_limit: 1000000000000000,
                            balance_burnt: 22318256250000000000,
                            outgoing_receipts_root: HYjZzyTL5JBfe1Ar4C4qPKc5E6Vbo9xnLHBKLVAqsqG2,
                            tx_root: 11111111111111111111111111111111,
                            validator_proposals: [],
                        },
                    ),
                    height_included: 0,
                    signature: ed25519:4FzcDw2ay2gAGosNpFdTyEwABJhhCwsi9g47uffi77N21EqEaamCg9p2tALbDt5fNeCXXoKxjWbHsZ1YezT2cL94,
                    hash: ChunkHash(
                        8sPHYmBFp7cfnXDAKdcATFYfh9UqjpAyqJSBKAngQQxL,
                    ),
                },
            ),
            transactions: [],
            receipts: [
                Receipt {
                    predecessor_id: AccountId(
                        "system",
                    ),
                    receiver_id: AccountId(
                        "shard0",
                    ),
                    receipt_id: 6eei79WLYHGfv5RTaee4kCmzFx79fKsX71vzeMjCe6rL,
                    receipt: Action(
                        ActionReceipt {
                            signer_id: AccountId(
                                "shard0",
                            ),
                            signer_public_key: ed25519:Ht8EqXGUnY8B8x7YvARE1LRMEpragRinqA6wy5xSyfj5,
                            gas_price: 0,
                            output_data_receivers: [],
                            input_data_ids: [],
                            actions: [
                                Transfer(
                                    TransferAction {
                                        deposit: 669547687500000000,
                                    },
                                ),
                            ],
                        },
                    ),
                },
            ],
        },
    ),
)

Such gas refund receipts are a little bit special - as we'll set the predecessor_id to be system - but the receiver is what we expect (shard0 account).

Note: system is a special account that doesn't really belong to any shard. As you can see in this example, the receipt was created within shard 1.

So putting it all together would look like this:

image

But wait - NEAR was saying that transfers are happening with 2 blocks - but here I see that it took 3 blocks. What's wrong?

The image above is a simplification, and reality is a little bit trickier - especially as receipts in a given chunks are actually receipts received as a result from running a PREVIOUS chunk from this shard.

We'll explain it more in the next section.

Advanced: What's actually going on?

As you could have read in Transactions And Receipts - the 'receipts' field in the chunk is actually representing 'outgoing' receipts from the previous block.

So our image should look more like this:

image

In this example, the black boxes are representing the 'processing' of the chunk, and red arrows are cross-shard communication.

So when we process Shard 0 from block 1676, we read the transaction, and output the receipt - which later becomes the input for shard 1 in block 1677.

But you might still be wondering - so why didn't we add the Receipt (transfer) to the list of receipts of shard0 1676?

That's because the shards & blocks are set BEFORE we do any computation. So the more correct image would look like this:

image

Here you can clearly see that chunk processing (black box), is happening AFTER the chunk is set.

In this example, the blue arrows are showing the part where we persist the result (receipt) into next block's chunk.

In a future article, we'll discuss how the actual cross-shard communication works (red arrows) in the picture, and how we could guarantee that a given shard really gets all the red arrows, before it starts processing.

Gas

This page describes the technical details around gas during the lifecycle of a transaction(*) while giving an intuition for why things are the way they are from a technical perspective. For a more practical, user-oriented angle, please refer to the gas section in the official protocol documentation.

(*) For this page, a transaction shall refer to the set of all recursively generated receipts by a SignedTransaction. When referring to only the original transaction object, we write SignedTransaction.

The topic is split into several sections.

  1. Gas Flow
  2. Gas Price:
  3. Tracking Gas: How the system keeps track of purchased gas during the transaction execution.

Gas Flow

On the highest level, gas is bought by the signer, burnt during execution, and contracts receive a part of the burnt gas as a reward. We will discuss each step in more details.

Buying Gas for a Transaction

A signer pays all the gas required for a transaction upfront. However, there is no explicit act of buying gas. Instead, the fee is subtracted directly in NEAR tokens from the balance of the signer's account. If we ignore all the details explained further down, the fee is calculated as gas amount * gas price.

The gas amount is not a field of SignedTransaction, nor is it something the signer can choose. It is only a virtual field that is computed on-chain following the protocol's rules.

The gas price is a variable that may change during the execution of the transaction. The way it is implemented today, a single transaction can be charged a different gas price for different receipts.

Already we can see a fundamental problem: Gas is bought once at the beginning but the gas price may change during execution. To solve this incompatibility, the protocol calculates a pessimistic gas price for the initial purchase. Later on, the delta between real and pessimistic gas prices is refunded at the end of every receipt execution.

An alternative implementation would instead charge the gas at every receipt, instead of once at the start. However, remember that the execution may happen on a different shard than the signer account. Therefore we cannot access the signer's balance while executing.

Burning Gas

Buying gas immediately removes a part of the signer's tokens from the total supply. However, the equivalent value in gas still exists in the form of the receipt and the unused gas will be converted back to tokens as a refund.

The gas spent on execution on the other hand is burnt and removed from total supply forever. Unlike gas in other chains, none of it goes to validators. This is roughly equivalent to the base fee burning mechanism which Ethereum added in EIP-1559. But in Near Protocol, the entire fee is burnt because there is no priority fee that Ethereum pays out to validators.

The following diagram shows how gas flows through the execution of a transaction. The transaction consists of a function call performing a cross contract call, hence two function calls in sequence. (Note: This diagram is heavily simplified, more accurate diagrams are further down.)

Very Simplified Gas Flow Diagram

Gas in Contract Calls

A function call has a fixed gas cost to be initiated. Then the execution itself draws gas from the attached_gas, sometimes also called prepaid_gas, until it reaches zero, at which point the function call aborts with a GasExceeded error. No changes are persisted on chain.

(Note on naming: If you see prepaid_fee: Balance in the nearcore code base, this is NOT only the fee for prepaid_gas. It also includes prepaid fees for other gas costs. However, prepaid_gas: Gas is used the same in the code base as described in this document.)

Attaching gas to function calls is the primary way for end-users and contract developers to interact with gas. All other gas fees are implicitly computed and are hidden from the users except for the fact that the equivalent in tokens is removed from their account balance.

To attach gas, the signer sets the gas field of the function call action. Wallets and CLI tools expose this to the users in different ways. Usually just as a gas field, which makes users believe this is the maximum gas the transaction will consume. Which is not true, the maximum is the specified number plus the fixed base cost.

Contract developers also have to pick the attached gas values when their contract calls another contract. They cannot buy additional gas, they have to work with the unspent gas attached to the current call. They can check how much gas is left by subtracting the used_gas() from the prepaid_gas() host function results. But they cannot use all the available gas, since that would prevent the current function call from executing to the end.

The gas attached to a function can be at most max_total_prepaid_gas, which is 300 Tgas since the mainnet launch. Note that this limit is per SignedTransaction, not per function call. In other words, batched function calls share this limit.

There is also a limit to how much single call can burn, max_gas_burnt, which used to be 200 Tgas but has been increased to 300 Tgas in protocol version 52. (Note: When attaching gas to an outgoing function call, this is not counted as gas burnt.) However, given a call can never burn more than was attached anyway, this second limit is obsolete with the current configuration where the two limits are equal.

Since protocol version 53, with the stabilization of NEP-264, contract developers do not have to specify the absolute amount of gas to attach to calls. promise_batch_action_function_call_weight allows to specify a ratio of unspent gas that is computed after the current call has finished. This allows attaching 100% of unspent gas to a call. If there are multiple calls, this allows attaching an equal fraction to each, or any other split as defined by the weight per call.

Contract Reward

A rather unique property of Near Protocol is that a part of the gas fee goes to the contract owner. This "smart contract gets paid" model is pretty much the opposite design choice from the "smart contract pays" model that for example Cycles in the Internet Computer implement.

The idea is that it gives contract developers a source of income and hence an incentive to create useful contracts that are commonly used. But there are also downsides, such as when implementing a free meta-transaction relayer one has to be careful not to be susceptible to faucet-draining attacks where an attacker extracts funds from the relayer by making calls to a contract they own.

How much contracts receive from execution depends on two things.

  1. How much gas is burnt on the function call execution itself. That is, only the gas taken from the attached_gas of a function call is considered for contract rewards. The base fees paid for creating the receipt, including the action_function_call fee, are burnt 100%.
  2. The remainder of the burnt gas is multiplied by the runtime configuration parameter burnt_gas_reward which currently is at 30%.

During receipt execution, nearcore code tracks the gas_burnt_for_function_call separately from other gas burning to enable this contract reward calculations.

In the (still simplified) flow diagram, the contract reward looks like this. For brevity, gas_burnt_for_function_call in the diagram is denoted as wasm fee.

Slightly Simplified Gas Flow Diagram

Gas Price

Gas pricing is a surprisingly deep and complicated topic. Usually, we only think about the value of the gas_price field in the block header. However, to understand the internals, this is not enough.

Block-Level Gas Price

gas_price is a field in the block header. It determines how much it costs to burn gas at the given block height. Confusingly, this is not the same price at which gas is purchased. (See Effective Gas Purchase Price.)

The price is measured in NEAR tokens per unit of gas. It dynamically changes in the range between 0.1 NEAR per Pgas and 2 NEAR per Pgas, based on demand. (1 Pgas = 1000 Tgas corresponds to a full chunk.)

The block producer has to set this field following the exact formula as defined by the protocol. Otherwise, the produced block is invalid.

Intuitively, the formula checks how much gas was used compared to the total capacity. If it exceeds 50%, the gas price increases exponentially within the limits. When the demand is below 50%, it decreases exponentially. In practice, it stays at the bottom most of the time.

Note that all shards share the same gas price. Hence, if one out of four shards is at 100% capacity, this will not cause the price to increase. The 50% capacity is calculated as an average across all shards.

Going slightly off-topic, it should also be mentioned that chunk capacity is not constant. Chunk producers can change it by 0.1% per chunk. The nearcore client does not currently make use of this option, so it really is a nitpick only relevant in theory. However, any client implementation such as nearcore must compute the total capacity as the sum of gas limits stored in the chunk headers to be compliant. Using a hard-coded 1000 Tgas * num_shards would lead to incorrect block header validation.

Pessimistic Gas Price

The pessimistic gas price calculation uses the fact that any transaction can only have a limited depth in the generated receipt DAG. For most actions, the depth is a constant 1 or 2. For function call actions, it is limited to a hand-wavy attached_gas / min gas per function call. (Note: attached_gas is a property of a single action and is only a part of the total gas costs of a receipt.)

Once the maximum depth is known, the protocol assumes that the gas price will not change more than 3% per receipt. This is not a guarantee since receipts can be delayed for virtually unlimited blocks.

The final formula for the pessimistic gas price is the following.

pessimistic(current_gas_price, max_depth) = current_gas_price × 1.03^max_depth

This still is not the price at which gas is purchased. But we are very close.

Effective Gas Purchase Cost

When a transaction is converted to its root action receipt, the gas costs are calculated in two parts.

Part one contains all the gas which is burnt immediately. Namely, the send costs for a receipt and all the actions it includes. This is charged at the current block-level gas price.

Part two is everything else, from execution costs of actions that are statically known such as CreateAccount all the way to attached_gas for function calls. All of this is purchased at the same pessimistic gas price, even if some actions inside might have a lower maximum call depth than others.

The deducted tokens are the sum of these two parts. If the account has insufficient balance to pay for this pessimistic pricing, it will fail with a NotEnoughBalance error, with the required balance included in the error message.

Inserting the pessimistic gas pricing into the flow diagram, we finally have a complete picture. Note how an additional refund receipt is required. Also, check out the updated formula for the effective purchase price at the top left and the resulting higher number.

Complete Gas Flow Diagram

Tracking Gas in Receipts

The previous section explained how gas is bought and what determines its price. This section details the tracking that enables correct refunds.

First, when a SignedTransaction is converted to a receipt, the pessimistic gas price is written to the receipt's gas_price field.

Later on, when the receipt has been executed, a gas refund is created at the value of receipt.gas_burnt * (block_header.gas_price - receipt.gas_price).

Some gas goes attaches to outgoing receipts. We commonly refer to this as used gas that was not burnt, yet. The refund excludes this gas. But it includes the receipt send cost.

Finally, unspent gas is refunded at the full receipt.gas_price. This refund is merged with the refund for burnt gas of the same receipt outcome to reduce the number of spawned system receipts. But it makes it a bit harder to interpret refunds when backtracking for how much gas a specific refund receipt covers.

Receipt Congestion

Near Protocol executes transactions in multiple steps, or receipts. Once a transaction is accepted, the system has committed to finish all those receipts even if it does not know ahead of time how many receipts there will be or on which shards they will execute.

This naturally leads to the problem that if shards just keep accepting more transactions, we might accept workload at a higher rate than we can execute.

Cross-shard congestion as flow problem

For a quick formalized discussion on congestion, let us model the Near Protocol transaction execution as a flow network.

Each shard has a source that accepts new transactions and a sink for burning receipts. The flow is measured in gas. Edges to sinks have a capacity of 1000 Tgas. (Technically, it should be 1300 but let's keep it simple for this discussion.)

graph

The edges between shards are not limited in this model. In reality, we are eventually limited by the receipt sizes and what we can send within a block time through the network links. But if we only look at that limit, we can send very many receipts with a lot of gas attached to them. Thus, the model considers it unlimited.

Okay, we have the capacities of the network modeled. Now let's look at how a receipt execution maps onto it.

Let's say a receipt starts at shard 1 with 300 Tgas. While executing, it burns 100 Tgas and creates an outgoing receipts with 200 Tgas to another shard. We can represent this in the flow network with 100 Tgas to the sink of shard 1 and 200 Tgas to shard 2.

graph

Note: The graph includes the execution of the next block with the 200 Tgas to the sink of shard 2. This should be interpreted as if we continue sending the exact same workload on all shards every block. Then we reach this steady state where we continue to have these gas assignments per edge.

Now we can do some flow analysis. It is immediately obvious that the total outflow per is limited to N * 1000 Tgas but the incoming flow is unlimited.

For a finite amount of time, we can accept more inflow than outflow, we just have to add buffers to store what we cannot execute, yet. But to stay within finite memory requirements, we need to fall back to a flow diagram where outflows are greater or equal to inflows within a finite time frame.

Next, we look at ideas one at a time before combining some of them into the cross-shard congestion design proposed in NEP-539.

Idea 1: Compute the minimum max-flow and stay below that limit

One approach to solve congestion would be to never allow more work into the system than we can execute.

But this is not ideal. Just consider this example where everybody tries to access a contract on the same shard.

graph

In this workload where everyone want to use the capacity of the same shard, the max-flow of the system is essentially the 1000 Tgas that shard 3 can execute. No matter how many additional shards we add, this 1000 Tgas does not increase.

Consequently, if we want to limit inflow to be the same or lower than the outflow, we cannot accept more than 1000 Tgas / NUM_SHARDS of new transactions per chunk.

graph

So, can we just put a constant limit on sources that's 1000 Tgas / NUM_SHARDS? Not really, as this limit is hardly practical. It means we limit global throughput to that of a single shard. Then why would we do sharding in the first place?

The sad thing is, there is no way around it in the most general case. A congestion control strategy that does not apply this limit to this workload will always have infinitely sized queues.

Of course, we won't give up. We are not limited to a constant capacity limit, we can instead adjust it dynamically. We simply have to find a strategy that detects such workload and eventually applies the required limit.

Most of these strategies can be gamed by malicious actors and probably that means we eventually fall back to the minimum of 1000 Tgas / NUM_SHARDS. But at this stage our ambition isn't to have 100% utilization under all malicious cases. We are instead trying to find a solution that can give 100% utilization for normal operation and then falls back to 1000 Tgas / NUM_SHARDS when it has to, in order to prevent out-of-memory crashes.

Idea 2: Limit transactions when we use too much memory

What if we have no limit at the source until we notice we are above the memory threshold we are comfortable with? Then we can reduce the source capacity in steps, potentially down to 0, until buffers are getting emptier and we use less memory again.

If we do that, we can decide between either applying a global limit on all sources (allow only 1000 Tgas / NUM_SHARDS new transactions on all shards like in idea 1) or applying the limit only to transactions that go to the shard with the congestion problem.

The first choice is certainly safe. But it means that a single congested shard leads to all shards slowing down, even if they could keep working faster without ever sending receipts to the congested shard. This is a hit to utilization we want to avoid. So let's try the second way.

In that case we filter transactions by receiver and keep accepting transactions that go to non-congested shards. This would work fine, if all transactions would only have depth 1.

But receipts produced by an accepted transaction can produce more receipts to any other shard. Therefore, we might end up accepting more inflow that indirectly requires bandwidth on the congested shard.

graph

Crucially, when accepting a transaction, we don't know ahead of time which shards will be affected by the full directed graph of receipts in a transaction. We only know the first step. For multi-hop transactions, there is no easy way out.

But it is worth mentioning, that in practice the single-hop function call is the most common case. And this case can be handled nicely by rejecting incoming transactions to congested shards.

Idea 3: Apply backpressure to stop all flows to a congested shard

On top of stopping transactions to congested shards, we can also stop receipts if they have a congested shard as the receiver. We simply put them in a buffer of the sending shard and keep them there until the congested shard has space again for the receipts.

graph

The problem with this idea is that it leads to deadlocks where all receipts in the system are waiting in outgoing buffers but cannot make progress because the receiving shard already has too high memory usage.

graph

Idea 4: Keep minimum incoming queue length to avoid deadlocks

This is the final idea we need. To avoid deadlocks, we ensure that we can always send receipts to a shard that does not have enough work in the delayed receipts queue already.

Basically, the backpressure limits from idea 3 are only applied to incoming receipts but not for the total size. This guarantees that in the congested scenario that previously caused a deadlock, we always have something in the incoming queue to work on, otherwise there wouldn't be backpressure at all.

graph

We decided to measure the incoming congestion level using gas rather than bytes, because it is here to maximize utilization, not to minimize memory consumption. And utilization is best measured in gas. If we have a queue of 10_000 Tgas waiting, even if only 10% of that is burnt in this step of the transaction, we still have 1000 Tgas of useful work we can contribute to the total flow. Thus under the assumption that at least 10% of gas is being burnt, we have 100% utilization.

A limit in bytes would be better to argue how much memory we need exactly. But in some sense, the two are equivalent, as producing large receipts should cost a linear amount of gas. What exactly the conversion rate is, is rather complicated and warrants its own investigation with potential protocol changes to lower the ratio in the most extreme cases. And this is important regardless of how congestion is handled, given that network bandwidth is becoming more and more important as we add more shards. Issue #8214 tracks our effort on estimating what that cost should be and #9378 tracks our best progress on calculating what it is today.

Of course, we can increase the queue to have even better utility guarantees. But it comes at the cost of longer delays for every transaction or receipt that goes through a congested shard.

This strategy also preserves the backpressure property in the sense that all shards on a path from sources to sinks that contribute to congestion will eventually end up with full buffers. Combined with idea 2, eventually all transactions to those shards are rejected. All of this without affecting shards that are not on the critical path.

Putting it all together

The proposal in NEP-539 combines all ideas 2, 3, and 4.

We have a limit of how much memory we consider to be normal operations (for example 500 MB). Then we stop new transaction coming in to that shard but still allow more incoming transactions to other shards if those are not congested. That alone already solves all problems with single-hop transactions.

In the congested shard itself, we also keep accepting transactions to other shards. But we heavily reduce the gas allocated for new transactions, in order to have more capacity to work on finishing the waiting receipts. This is technically not necessary for any specific property, but it should make sense intuitively that this helps to reduce congestion quicker and therefore lead to a better user experience. This is why we added this feature. And our simulations also support this intuition.

Then we apply backpressure for multi-hop receipts and avoid deadlocks by only applying the backpressure when we still have enough work queued up that holding it back cannot lead to a slowed down global throughput.

Another design decision was to linearly interpolate the limits, as opposed to binary on and off states. This way, we don't have to be too precise in finding the right parameters, as the system should balance itself around a specific limit that works for each workload.

Meta Transactions

NEP-366 introduced the concept of meta transactions to Near Protocol. This feature allows users to execute transactions on NEAR without owning any gas or tokens. In order to enable this, users construct and sign transactions off-chain. A third party (the relayer) is used to cover the fees of submitting and executing the transaction.

The MVP for meta transactions is currently in the stabilization process. Naturally, the MVP has some limitations, which are discussed in separate sections below. Future iterations have the potential to make meta transactions more flexible.

Overview

Flow chart of meta transactions Credits for the diagram go to the NEP authors Alexander Fadeev and Egor Uleyskiy.

The graphic shows an example use case for meta transactions. Alice owns an amount of the fungible token $FT. She wants to transfer some to John. To do that, she needs to call ft_transfer("john", 10) on an account named FT.

In technical terms, ownership of $FT is an entry in the FT contract's storage that tracks the balance for her account. Note that this is on the application layer and thus not a part of Near Protocol itself. But FT relies on the protocol to verify that the ft_transfer call actually comes from Alice. The contract code checks that predecessor_id is "Alice" and if that is the case then the call is legitimately from Alice, as only she could create such a receipt according to the Near Protocol specification.

The problem is, Alice has no NEAR tokens. She only has a NEAR account that someone else funded for her and she owns the private keys. She could create a signed transaction that would make the ft_transfer("john", 10) call. But validator nodes will not accept it, because she does not have the necessary Near token balance to purchase the gas.

With meta transactions, Alice can create a DelegateAction, which is very similar to a transaction. It also contains a list of actions to execute and a single receiver for those actions. She signs the DelegateAction and forwards it (off-chain) to a relayer. The relayer wraps it in a transaction, of which the relayer is the signer and therefore pays the gas costs. If the inner actions have an attached token balance, this is also paid for by the relayer.

On chain, the SignedDelegateAction inside the transaction is converted to an action receipt with the same SignedDelegateAction on the relayer's shard. The receipt is forwarded to the account from Alice, which will unpacked the SignedDelegateAction and verify that it is signed by Alice with a valid Nonce etc. If all checks are successful, a new action receipt with the inner actions as body is sent to FT. There, the ft_transfer call finally executes.

Relayer

Meta transactions only work with a relayer. This is an application layer concept, implemented off-chain. Think of it as a server that accepts a SignedDelegateAction, does some checks on them and eventually forwards it inside a transaction to the blockchain network.

A relayer may choose to offer their service for free but that's not going to be financially viable long-term. But they could easily have the user pay using other means, outside of Near blockchain. And with some tricks, it can even be paid using fungible tokens on Near.

In the example visualized above, the payment is done using $FT. Together with the transfer to John, Alice also adds an action to pay 0.1 $FT to the relayer. The relayer checks the content of the SignedDelegateAction and only processes it if this payment is included as the first action. In this way, the relayer will be paid in the same transaction as John.

Note that the payment to the relayer is still not guaranteed. It could be that Alice does not have sufficient $FT and the transfer fails. To mitigate, the relayer should check the $FT balance of Alice first.

Unfortunately, this still does not guarantee that the balance will be high enough once the meta transaction executes. The relayer could waste NEAR gas without compensation if Alice somehow reduces her $FT balance in just the right moment. Some level of trust between the relayer and its user is therefore required.

The vision here is that there will be mostly application-specific relayers. A general-purpose relayer is difficult to implement with just the MVP. See limitations below.

Limitation: Single receiver

A meta transaction, like a normal transaction, can only have one receiver. It's possible to chain additional receipts afterwards. But crucially, there is no atomicity guarantee and no roll-back mechanism.

For normal transactions, this has been widely accepted as a fact for how Near Protocol works. For meta transactions, there was a discussion around allowing multiple receivers with separate lists of actions per receiver. While this could be implemented, it would only create a false sense of atomicity. Since each receiver would require a separate action receipt, there is no atomicity, the same as with chains of receipts.

Unfortunately, this means the trick to compensate the relayer in the same meta transaction as the serviced actions only works if both happen on the same receiver. In the example, both happen on FT and this case works well. But it would not be possible to send $FT1 and pay the relayer in $FT2. Nor could one deploy a contract code on Alice and pay in $FT in one meta transaction. It would require two separate meta transactions to do that. Due to timing problems, this again requires some level of trust between the relayer and Alice.

A potential solution could involve linear dependencies between the action receipts spawned from a single meta transaction. Only if the first succeeds, will the second start executing, and so on. But this quickly gets too complicated for the MVP and is therefore left open for future improvements.

Constraints on the actions inside a meta transaction

A transaction is only allowed to contain one single delegate action. Nested delegate actions are disallowed and so are delegate actions next to each other in the same receipt.

Nested delegate actions have no known use case and it would be complicated to implement. Consequently, it was omitted.

For delegate actions beside each other, there was a bit of back and forth during the NEP-366 design phase. The potential use case here is essentially the same as having multiple receivers in a delegate action. Naturally, it runs into all the same complications (false sense of atomicity) and ends with the same conclusion: Omitted from the MVP and left open for future improvement.

Limitation: Accounts must be initialized

Any transaction, including meta transactions, must use NONCEs to avoid replay attacks. The NONCE must be chosen by Alice and compared to a NONCE stored on chain. This NONCE is stored on the access key information that gets initialized when creating an account.

Implicit accounts don't need to be initialized in order to receive NEAR tokens, or even $FT. This means users could own $FT but no NONCE is stored on chain for them. This is problematic because we want to enable this exact use case with meta transactions, but we have no NONCE to create a meta transaction.

For the MVP, the proposed solution, or work-around, is that the relayer will have to initialize the account of Alice once if it does not exist. Note that this cannot be done as part of the meta transaction. Instead, it will be a separate transaction that executes first. Only then can Alice even create a SignedDelegateAction with a valid NONCE.

Once again, some trust is required. If Alice wanted to abuse the relayer's helpful service, she could ask the relayer to initialize her account. Afterwards, she does not sign a meta transaction, instead she deletes her account and cashes in the small token balance reserved for storage. If this attack is repeated, a significant amount of tokens could be stolen from the relayer.

One partial solution suggested here was to remove the storage staking cost from accounts. This means there is no financial incentive for Alice to delete her account. But it does not solve the problem that the relayer has to pay for the account creation and Alice can simply refuse to send a meta transaction afterwards. In particular, anyone creating an account would have financial incentive to let a relayer create it for them instead of paying out of their own pockets. This would still be better than Alice stealing tokens but fundamentally, there still needs to be some trust.

An alternative solution discussed is to do NONCE checks on the relayer's access key. This prevents replay attacks and allows implicit accounts to be used in meta transactions without even initializing them. The downside is that meta transactions share the same NONCE counter(s). That means, a meta transaction sent by Bob may invalidate a meta transaction signed by Alice that was created and sent to the relayer at the same time. Multiple access keys by the relayer and coordination between relayer and user could potentially alleviate this problem. But for the MVP, nothing along those lines has been approved.

Gas costs for meta transactions

Meta transactions challenge the traditional ways of charging gas for actions. To see why, let's first list the normal flow of gas, outside of meta transactions.

  1. Gas is purchased (by deducting NEAR from the transaction signer account), when the transaction is converted into a receipt. The amount of gas is implicitly defined by the content of the receipt. For function calls, the caller decides explicitly how much gas is attached on top of the minimum required amount. The NEAR token price per gas unit is dynamically adjusted on the blockchain. In today's nearcore code base, this happens as part of verify_and_charge_transaction which gets called in process_transaction.
  2. For all actions listed inside the transaction, the SEND cost is burned immediately. Depending on the condition sender == receiver, one of two possible SEND costs is chosen. The EXEC cost is not burned, yet. But it is implicitly part of the transaction cost. The third and last part of the transaction cost is the gas attached to function calls. The attached gas is also called prepaid gas. (Not to be confused with total_prepaid_exec_fees which is the implicitly prepaid gas for EXEC action costs.)
  3. On the receiver shard, EXEC costs are burned before the execution of an action starts. Should the execution fail and abort the transaction, the remaining gas will be refunded to the signer of the transaction.

Ok, now adapt for meta transactions. Let's assume Alice uses a relayer to execute actions with Bob as the receiver.

  1. The relayer purchases the gas for all inner actions, plus the gas for the delegate action wrapping them.
  2. The cost of sending the inner actions and the delegate action from the relayer to Alice's shard will be burned immediately. The condition relayer == Alice determines which action SEND cost is taken (sir or not_sir). Let's call this SEND(1).
  3. On Alice's shard, the delegate action is executed, thus the EXEC gas cost for it is burned. Alice sends the inner actions to Bob's shard. Therefore, we burn the SEND fee again. This time based on Alice == Bob to figure out sir or not_sir. Let's call this SEND(2).
  4. On Bob's shard, we execute all inner actions and burn their EXEC cost.

Each of these steps should make sense and not be too surprising. But the consequence is that the implicit costs paid at the relayer's shard are SEND(1) + SEND(2) + EXEC for all inner actions plus SEND(1) + EXEC for the delegate action. This might be surprising but hopefully with this explanation it makes sense now!

Gas refunds in meta transactions

Gas refund receipts work exactly like for normal transaction. At every step, the difference between the pessimistic gas price and the actual gas price at that height is computed and refunded. At the end of the last step, additionally all remaining gas is also refunded at the original purchasing price. The gas refunds go to the signer of the original transaction, in this case the relayer. This is only fair, since the relayer also paid for it.

Balance refunds in meta transactions

Unlike gas refunds, the protocol sends balance refunds to the predecessor (a.k.a. sender) of the receipt. This makes sense, as we deposit the attached balance to the receiver, who has to explicitly reattach a new balance to new receipts they might spawn.

In the world of meta transactions, this assumption is also challenged. If an inner action requires an attached balance (for example a transfer action) then this balance is taken from the relayer.

The relayer can see what the cost will be before submitting the meta transaction and agrees to pay for it, so nothing wrong so far. But what if the transaction fails execution on Bob's shard? At this point, the predecessor is Alice and therefore she receives the token balance refunded, not the relayer. This is something relayer implementations must be aware of since there is a financial incentive for Alice to submit meta transactions that have high balances attached but will fail on Bob's shard.

Function access keys in meta transactions

Assume alice sends a meta transaction and signs with a function access key. How exactly are permissions applied in this case?

Function access keys can limit the allowance, the receiving contract, and the contract methods. The allowance limitation acts slightly strange with meta transactions.

But first, both the methods and the receiver will be checked as expected. That is, when the delegate action is unwrapped on Alice's shard, the access key is loaded from the DB and compared to the function call. If the receiver or method is not allowed, the function call action fails.

For allowance, however, there is no check. All costs have been covered by the relayer. Hence, even if the allowance of the key is insufficient to make the call directly, indirectly through meta transaction it will still work.

This behavior is in the spirit of allowance limiting how much financial resources the user can use from a given account. But if someone were to limit a function access key to one trivial action by setting a very small allowance, that is circumventable by going through a relayer. An interesting twist that comes with the addition of meta transactions.

Serialization: Borsh, Json, ProtoBuf

If you spent some time looking at NEAR code, you’ll notice that we have different methods of serializing structures into strings. So in this article, we’ll compare these different approaches, and explain how and where we’re using them.

ProtocolSchema

All structs which need to be persisted or sent over the network must derive the ProtocolSchema trait:

#![allow(unused)]
fn main() {
// First, the schema checksums (hashes) are calculated at compile time and
// require `TypeId` for cross-navigation. However, it is a nightly feature,
// so we enable it manually by putting this in lib.rs:

#![cfg_attr(enable_const_type_id, feature(const_type_id))]

// Then, import schema calculation functionality by putting in Cargo.toml:
// near-schema-checker-lib.workspace = true

[features]
protocol_schema = [
  "near-schema-checker-lib/protocol_schema",
  ...the same feature in all dependent crates...
]
 
// Then, mark your struct with `#[derive(ProtocolSchema)]`:

use near_schema_checker_lib::ProtocolSchema;

#[derive(ProtocolSchema)]
pub struct BlockHeader {
  pub hash: CryptoHash,
  pub height: BlockHeight,
}

// Lastly, mark your crate in `tools/protocol-schema-check/Cargo.toml`
// as dependency with `protocol_schema` feature enabled.

}

This is done to protect structures from accidental changes that could corrupt the database or disrupt the protocol. Dedicated CI check is responsible to check the consistency of the schema. See README for more details.

All these structures are likely to implement BorshSerialize and BorshDeserialize (see below).

Borsh

Borsh is our custom serializer (link), that we use mostly for things that have to be hashed.

The main feature of Borsh is that, there are no two binary representations that deserialize into the same object.

You can read more on how Borsh serializes the data, by looking at the Specification tab on borsh.io.

The biggest pitfall/risk of Borsh, is that any change to the structure, might cause previous data to no longer be parseable.

For example, inserting a new enum ‘in the middle’:

#![allow(unused)]
fn main() {
pub enum MyCar {
  Bmw,
  Ford,
}

If we change our enum to this:

pub enum MyCar {
  Bmw,
  Citroen,
  Ford, // !! WRONG - Ford objects cannot be deserialized anymore
}
}

This is especially tricky if we have conditional compilation:

#![allow(unused)]
fn main() {
pub enum MyCar {
  Bmw,
  #[cfg(feature = "french_cars")]
  Citroen,
  Ford,
}
}

Is such a scenario - some of the objects created by binaries with this feature enabled, will not be parseable by binaries without this feature.

Removing and adding fields to structures is also dangerous.

Basically - the only ‘safe’ thing that you can do with Borsh - is add a new Enum value at the end.

JSON

JSON doesn’t need much introduction. We’re using it for external APIs (jsonrpc) and configuration. It is a very popular, flexible and human-readable format.

Proto (Protocol Buffers)

We started using proto recently - and we plan to use it mostly for our network communication. Protocol buffers are strongly typed - they require you to create a .proto file, where you describe the contents of your message.

For example:

message HandshakeFailure {
  // Reason for rejecting the Handshake.
  Reason reason = 1;

  // Data about the peer.
  PeerInfo peer_info = 2;
  // GenesisId of the NEAR chain that the peer belongs to.
  GenesisId genesis_id = 3;
}

Afterwards, such a proto file is fed to protoc ‘compiler’ that returns auto-generated code (in our case Rust code) - that can be directly imported into your library.

The main benefit of protocol buffers is their backwards compatibility (as long as you adhere to the rules and don’t reuse the same field ids).

Summary

So to recap what we’ve learned:

JSON - mostly used for external APIs - look for serde::Serialize/Deserialize

Proto - currently being developed to be used for network connections - objects have to be specified in proto file.

Borsh - for things that we hash (and currently also for all the things that we store on disk - but we might move to proto with this in the future). Look for BorshSerialize/BorshDeserialize

Questions

Why don’t you use JSON for everything?

While this is a tempting option, JSON has a few drawbacks:

  • size (json is self-describing, so all the field names etc are included every time)
  • non-canonical: JSON doesn’t specify strict ordering of the fields, so we’d have to do additional restrictions/rules on that - otherwise the same ‘conceptual’ message would end up with different hashes.

Ok - so how about proto for everything?

There are couple risks related with using proto for things that have to be hashed. A Serialized protocol buffer can contain additional data (for example fields with tag ids that you’re not using) and still successfully parse (that’s how it achieves backward compatibility).

For example, in this proto:

message First {
  string foo = 1;
  string bar = 2;
}
message Second {
  string foo = 1;
}

Every ‘First’ message will be successfully parsed as ‘Second’ message - which could lead to some programmatic bugs.

Advanced section - RawTrieNode

There is one more place in the code where we use a ‘custom’ encoding: RawTrieNodeWithSize defined in store/src/trie/raw_node.rs. While the format uses Borsh derives and API, there is a difference in how branch children ([Option<CryptoHash>; 16]) are encoded. Standard Borsh encoding would encode Option<CryptoHash> sixteen times. Instead, RawTrieNodeWithSize uses a bitmap to indicate which elements are set resulting in a different layout.

Imagine a children vector like this:

#![allow(unused)]
fn main() {
[Some(0x11), None, Some(0x12), None, None, …]
}

Here, we have children at index 0 and 2 which has a bitmap of 101

Custom encoder:

// Number of children determined by the bitmask
[16 bits bitmask][32 bytes child][32 bytes child]
[5][0x11][0x12]
// Total size: 2 + 32 + 32 = 68 bytes

Borsh:

[8 bits - 0 or 1][32 bytes child][8 bits 0 or 1][8 bits ]
[1][0x11][0][1][0x11][0][0]...
// Total size: 16 + 32 + 32 = 80 bytes

Code for encoding children is given in BorshSerialize implementation for ChildrenRef type and code for decoding in BorshDeserialize implementation for Children. All of that is in aforementioned store/src/trie/raw_node.rs file.

Proofs

“Don’t trust, but verify” - let’s talk about proofs

Was your transaction included?

How do you know that your transaction was actually included in the blockchain? Sure, you can “simply” ask the RPC node, and it might say “yes”, but is it enough?

The other option would be to ask many nodes - hoping that at least one of them would be telling the truth. But what if that is not enough?

The final solution would be to run your own node - this way you’d check all the transactions yourself, and then you could be sure - but this can become a quite expensive endeavor - especially when many shards are involved.

But there is actually a better solution - that doesn’t require you to trust the single (or many) RPC nodes, and to verify, by yourself, that your transaction was actually executed.

Let’s talk about proofs (merkelization):

Imagine you have 4 values that you’d like to store, in such a way, that you can easily prove that a given value is present.

image

One way to do it, would be to create a binary tree, where each node would hold a hash:

  • leaves would hold the hashes that represent the hash of the respective value.
  • internal nodes would hold the hash of “concatenation of hashes of their children”
  • the top node would be called a root node (in this image it is the node n7)

With such a setup, you can prove that a given value exists in this tree, by providing a “path” from the corresponding leaf to the root, and including all the siblings.

For example to prove that value v[1] exists, we have to provide all the nodes marked as green, with the information about which sibling (left or right) they are:

image

# information needed to verify that node v[1] is present in a tree
# with a given root (n7)
[(Left, n0), (Right, n6)]

# Verification
assert_eq!(root, hash(hash(n0, hash(v[1])), n6))

We use the technique above (called merkelization) in a couple of places in our protocol, but for today’s article, I’d like to focus on receipts & outcome roots.

Merkelization, receipts and outcomes

In order to prove that a given receipt belongs to a given block, we will need to fetch some additional information.

As NEAR is sharded, the receipts actually belong to “Chunks” not Blocks themselves, so the first step is to find the correct chunk and fetch its ChunkHeader.

ShardChunkHeaderV3 {
    inner: V2(
        ShardChunkHeaderInnerV2 {
            prev_block_hash: `C9WnNCbNvkQvnS7jdpaSGrqGvgM7Wwk5nQvkNC9aZFBH`,
            prev_state_root: `5uExpfRqAoZv2dpkdTxp1ZMcids1cVDCEYAQwAD58Yev`,
            outcome_root: `DBM4ZsoDE4rH5N1AvCWRXFE9WW7kDKmvcpUjmUppZVdS`,
            encoded_merkle_root: `2WavX3DLzMCnUaqfKPE17S1YhwMUntYhAUHLksevGGfM`,
            encoded_length: 425,
            height_created: 417,
            shard_id: 0,
            gas_used: 118427363779280,
            gas_limit: 1000000000000000,
            balance_burnt: 85084341232595000000000,
            outgoing_receipts_root: `4VczEwV9rryiVSmFhxALw5nCe9gSohtRpxP2rskP3m1s`,
            tx_root: `11111111111111111111111111111111`,
            validator_proposals: [],
        },
    ),
}

The field that we care about is called outcome_root. This value represents the root of the binary merkle tree, that is created based on all the receipts that were processed in this chunk.

Note: You can notice that we also have a field here called encoded_merkle_root - this is another case where we use merkelization in our chain - this field is a root of a tree that holds hashes of all the "partial chunks" into which we split the chunk to be distributed over the network.

So, in order to verify that a given receipt/transaction was really included, we have to compute its hash (see details below), get the path to the root, and voila, we can confirm that it was really included.

But how do we get the siblings on the path to the root? This is actually something that RPC nodes do return in their responses.

If you ever looked closely at NEAR’s tx-status response, you can notice a "proof" section there. For every receipt, you'd see something like this:

proof: [
    {
        direction: 'Right',
        hash: '2wTFCh2phFfANicngrhMV7Po7nV7pr6gfjDfPJ2QVwCN'
    },
    {
        direction: 'Right',
        hash: '43ei4uFk8Big6Ce6LTQ8rotsMzh9tXZrjsrGTd6aa5o6'
    },
    {
        direction: 'Left',
        hash: '3fhptxeChNrxWWCg8woTWuzdS277u8cWC9TnVgFviu3n'
    },
    {
        direction: 'Left',
        hash: '7NTMqx5ydMkdYDFyNH9fxPNEkpgskgoW56Y8qLoVYZf7'
    }
]

And the values in there are exactly the siblings (plus info on which side of the tree the sibling is), on the path to the root.

Note: proof section doesn’t contain the root itself and also doesn’t include the hash of the receipt.

[Advanced section]: Let’s look at a concrete example

Imagine that we have the following receipt:

{
  block_hash: '7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND',
  id: '6bdKUtGbybhYEQ2hb2BFCTDMrtPBw8YDnFpANZHGt5im',
  outcome: {
    executor_id: 'node0',
    gas_burnt: 223182562500,
    logs: [],
    metadata: { gas_profile: [], version: 1 },
    receipt_ids: [],
    status: { SuccessValue: '' },
    tokens_burnt: '0'
  },
  proof: [
    {
      direction: 'Right',
      hash: 'BWwZ4wHuzaUxdDSrhAEPjFQtDgwzb8K4zoNzfX9A3SkK'
    },
    {
      direction: 'Left',
      hash: 'Dpg4nQQwbkBZMmdNYcZiDPiihZPpsyviSTdDZgBRAn2z'
    },
    {
      direction: 'Right',
      hash: 'BruTLiGx8f71ufoMKzD4H4MbAvWGd3FLL5JoJS3XJS3c'
    }
  ]
}

Remember that the outcomes of the execution will be added to the NEXT block, so let’s find the next block hash, and the proper chunk.

(in this example, I’ve used the view-state chain from neard)

417 7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND |      node0 | parent: 416 C9WnNCbNvkQvnS7jdpaSGrqGvgM7Wwk5nQvkNC9aZFBH | .... 0: E6pfD84bvHmEWgEAaA8USCn2X3XUJAbFfKLmYez8TgZ8 107 Tgas |1: Ch1zr9TECSjDVaCjupNogLcNfnt6fidtevvKGCx8c9aC 104 Tgas |2: 87CmpU6y7soLJGTVHNo4XDHyUdy5aj9Qqy4V7muF5LyF   0 Tgas |3: CtaPWEvtbV4pWem9Kr7Ex3gFMtPcKL4sxDdXD4Pc7wah   0 Tgas
418 J9WQV9iRJHG1shNwGaZYLEGwCEdTtCEEDUTHjboTLLmf |      node0 | parent: 417 7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND | .... 0: 7APjALaoxc8ymqwHiozB5BS6mb3LjTgv4ofRkKx2hMZZ   0 Tgas |1: BoVf3mzDLLSvfvsZ2apPSAKjmqNEHz4MtPkmz9ajSUT6   0 Tgas |2: Auz4FzUCVgnM7RsQ2noXsHW8wuPPrFxZToyLaYq6froT   0 Tgas |3: 5ub8CZMQmzmZYQcJU76hDC3BsajJfryjyShxGF9rzpck   1 Tgas

I know that the receipt should belong to Shard 3 so let’s fetch the chunk header:

$ neard view-state chunks --chunk-hash 5ub8CZMQmzmZYQcJU76hDC3BsajJfryjyShxGF9rzpck
ShardChunkHeaderV3 {
  inner: V2(
      ShardChunkHeaderInnerV2 {
          prev_block_hash: `7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND`,
          prev_state_root: `6rtfqVEXx5STLv5v4zwLVqAfq1aRAvLGXJzZPK84CPpa`,
          outcome_root: `2sZ81kLj2cw5UHTjdTeMxmaWn2zFeyr5pFunxn6aGTNB`,
          encoded_merkle_root: `6xxoqYzsgrudgaVRsTV29KvdTstNYVUxis55KNLg6XtX`,
          encoded_length: 8,
          height_created: 418,
          shard_id: 3,
          gas_used: 1115912812500,
          gas_limit: 1000000000000000,
          balance_burnt: 0,
          outgoing_receipts_root: `8s41rye686T2ronWmFE38ji19vgeb6uPxjYMPt8y8pSV`,
          tx_root: `11111111111111111111111111111111`,
          validator_proposals: [],
      },
  ),
  height_included: 0,
  signature: ed25519:492i57ZAPggqWEjuGcHQFZTh9tAKuQadMXLW7h5CoYBdMRnfY4g7A749YNXPfm6yXnJ3UaG1ahzcSePBGm74Uvz3,
  hash: ChunkHash(
      `5ub8CZMQmzmZYQcJU76hDC3BsajJfryjyShxGF9rzpck`,
  ),
}

So the outcome_root is 2sZ81kLj2cw5UHTjdTeMxmaWn2zFeyr5pFunxn6aGTNB - let’s verify it then.

Our first step is to compute the hash of the receipt, which is equal to hash([receipt_id, hash(borsh(receipt_payload)])

# this is a borsh serialized ExecutionOutcome struct.
# computing this, we leave as an exercise for the reader :-)
receipt_payload_hash = "7PeGiDjssz65GMCS2tYPHUm6jYDeBCzpuPRZPmLNKSy7"

receipt_hash = base58.b58encode(hashlib.sha256(struct.pack("<I", 2) + base58.b58decode("6bdKUtGbybhYEQ2hb2BFCTDMrtPBw8YDnFpANZHGt5im") + base58.b58decode(receipt_payload_hash)).digest())

And then we can start reconstructing the tree:

def combine(a, b):
   return hashlib.sha256(a + b).digest()

# one node example
# combine(receipt_hash, "BWwZ4wHuzaUxdDSrhAEPjFQtDgwzb8K4zoNzfX9A3SkK")
# whole tree
combine(combine("Dpg4nQQwbkBZMmdNYcZiDPiihZPpsyviSTdDZgBRAn2z", combine(receipt_hash, "BWwZ4wHuzaUxdDSrhAEPjFQtDgwzb8K4zoNzfX9A3SkK")), "BruTLiGx8f71ufoMKzD4H4MbAvWGd3FLL5JoJS3XJS3c")
# result == 2sZ81kLj2cw5UHTjdTeMxmaWn2zFeyr5pFunxn6aGTNB

And success - our result is matching the outcome root, so it means that our receipt was indeed processed by the blockchain.

Resharding V2

DISCLAIMER: This document describes Resharding V2 and it may not be up to date with recent changes to nearcore.

Resharding

Resharding is the process in which the shard layout changes. The primary purpose of resharding is to keep the shards small so that a node meeting minimum hardware requirements can safely keep up with the network while tracking some set minimum number of shards.

Specification

The resharding is described in more detail in the following NEPs:

Shard layout

The shard layout determines the number of shards and the assignment of accounts to shards (as single account cannot be split between shards).

There are two versions of the ShardLayout enum.

  • v0 - maps the account to a shard taking hash of the account id modulo number of shards
  • v1 - maps the account to a shard by looking at a set of predefined boundary accounts and selecting the shard where the accounts fits by using alphabetical order

At the time of writing there are three pre-defined shard layouts but more can be added in the future.

  • v0 - The first shard layout that contains only a single shard encompassing all the accounts.
  • simple nightshade - Splits the accounts into 4 shards.
  • simple nightshade v2 - Splits the accounts into 5 shards.

IMPORTANT: Using alphabetical order applies to the full account name, so a.near could belong to shard 0, while z.a.near to shard 3.

Currently in mainnet & testnet, we use the fixed shard split (which is defined in get_simple_nightshade_layout):

vec!["aurora", "aurora-0", "kkuuue2akv_1630967379.near"]

In the near future we are planning on switching to simple nightshade v2 (which is defined in get_simple_nightshade_layout_v2)

vec!["aurora", "aurora-0", "kkuuue2akv_1630967379.near", "tge-lockup.sweat"]

Shard layout changes

Shard Layout is determined at epoch level in the AllEpochConfig based on the protocol version of the epoch.

The shard layout can change at the epoch boundary. Currently in order to change the shard layout it is necessary to manually determine the new shard layout and setting it for the desired protocol version in the AllEpochConfig.

Deeper technical details

It all starts in preprocess_block - if the node sees, that the block it is about to preprocess is the first block of the epoch (X+1) - it calls get_state_sync_info, which is responsible for figuring out which shards will be needed in next epoch (X+2).

This is the moment, when node can request new shards that it didn't track before (using StateSync) - and if it detects that the shard layout would change in the next epoch, it also involves the StateSync - but skips the download part (as it already has the data) - and starts from resharding.

StateSync in this phase would send the ReshardingRequest to the SyncJobsActor (you can think about the SyncJobsActor as a background thread).

We'd use the background thread to perform resharding: the goal is to change the one trie (that represents the state of the current shard) - to multiple tries (one for each of the new shards).

In order to split a trie into children tries we use a snapshot of the flat storage. We iterate over all of the entries in the flat storage and we build the children tries by inserting the parent entry into either of the children tries.

Extracting of the account from the key happens in parse_account_id_from_raw_key - and we do it for all types of data that we store in the trie (contract code, keys, account info etc) EXCEPT for Delayed receipts. Then, we figure out the shard that this account is going to belong to, and we add this key/value to that new trie.

This way, after going over all the key/values from the original trie, we end up with X new tries (one for each new shard).

IMPORTANT: in the current code, we only support such 'splitting' (so a new shard can have just one parent).

Why delayed receipts are special?

For all the other columns, there is no dependency between entries, but in case of delayed receipts - we are forming a 'queue'. We store the information about the first index and the last index (in DelayedReceiptIndices struct).

Then, when receipt arrives, we add it as the 'DELAYED_RECEIPT + last_index' key (and increment last_index by 1).

That is why we cannot move this trie entry type in the same way as others where account id is part of the key. Instead we do it by iterating over this queue and inserting entries to the queue of the relevant child shard.

Constraints

The state sync of the parent shard, the resharing and the catchup of the children shards must all complete within a single epoch.

Rollout

Flow

The resharding will be initiated by having it included in a dedicated protocol version together with neard. Here is the expected flow of events:

  • A new neard release is published and protocol version upgrade date is set to D, roughly a week from the release.
  • All node operators upgrade their binaries to the newly released version within the given timeframe, ideally as soon as possible but no later than D.
  • The protocol version upgrade voting takes place at D in an epoch E and nodes vote in favour of switching to the new protocol version in epoch E+2.
  • The resharding begins at the beginning of epoch E+1.
  • The network switches to the new shard layout in the first block of epoch E+2.

Monitoring

Resharding exposes a number of metrics and logs that allow for monitoring the resharding process as it is happening. Resharding requires manual recovery in case anything goes wrong and should be monitored in order to ensure smooth node operation.

  • near_resharding_status is the primary metric that should be used for tracking the progress of resharding. It's tagged with a shard_uid label of the parent shard. It's set to corresponding ReshardingStatus enum and can take one of the following values
    • 0 - Scheduled - resharding is scheduled and waiting to be executed.
    • 1 - Building - resharding is running. Only one shard at a time can be in that state while the rest will be either finished or waiting in the Scheduled state.
    • 2 - Finished - resharding is finished.
    • -1 - Failed - resharding failed and manual recovery action is required. The node will operate as usual until the end of the epoch but will then stop being able to process blocks.
  • near_resharding_batch_size and near_resharding_batch_count - those two metrics show how much data has been resharded. Both metrics should progress with the near_resharding_status as follows.
    • While in the Scheduled state both metrics should remain 0.
    • While in the Building state both metrics should be gradually increasing.
    • While in the Finished state both metrics should remain at the same value.
  • near_resharding_batch_prepare_time_bucket, near_resharding_batch_apply_time_bucket and near_resharding_batch_commit_time_bucket - those three metrics can be used to track the performance of resharding and fine tune throttling if needed. As a rule of thumb the combined time of prepare, apply and commit for a batch should remain at the 100ms-200ms level on average. Higher batch processing time may lead to disruptions in block processing, missing chunks and blocks.

Here are some example metric values when finished for different shards and networks. The duration column reflects the duration of the building phase. Those were captured in production like environment in November 2023 and actual times at the time of resharding in production may be slightly higher.

mainnetdurationbatch countbatch size
total2h23min
shard 032min12,5106.6GB
shard 130min12,2036.1GB
shard 226min10,6196.0GB
shard 355min21,07011.5GB
testnetdurationbatch countbatch size
total5h32min
shard 021min10,25910.9GB
shard 118min7,0343.5GB
shard 22h31min75,52975.6GB
shard 32h22min63,62149.2GB

Here is an example of what that may look like in a grafana dashboard. Please keep in mind that the values and duration is not representative as the sample data below is captured in a testing environment with different configuration.

Screenshot 2023-12-01 at 10 10 20 Screenshot 2023-12-01 at 10 10 50 Screenshot 2023-12-01 at 10 10 42

Throttling

The resharding process can be quite resource intensive and affect the regular operation of a node. In order to mitigate that as well as limit any need for increasing hardware specifications of the nodes throttling was added. Throttling slows down resharding to not have it impact other node operations. Throttling can be configured by adjusting the resharding_config in the node config file.

  • batch_size - controls the size of batches in which resharding moves data around. Setting a smaller batch size will slow down the resharding process and make it less resource-consuming.
  • batch_delay - controls the delay between processing of batches. Setting a smaller batch delay will speed up the resharding process and make it more resource-consuming.

The remaining fields in the ReshardingConfig are only intended for testing purposes and should remain set to their default values.

The default configuration for ReshardingConfig should provide a good and safe setting for resharding in the production networks. There is no need for node operators to make any changes to it unless they observe issues.

The resharding config can be adjusted at runtime, without restarting the node. The config needs to be updated first and then a SIGHUP signal should be sent to the neard process. When received the signal neard will update the config and print a log message showing what fields were changed. It's recommended to check the log to make sure the relevant config change was correctly picked up.

Future possibilities

Localize resharding to a single shard

Currently when resharding we need to move the data for all shards even if only a single shard is being split. That is due to having the version field in the storage key that needs to be updated when changing shard layout version.

This can be improved by changing how ShardUId works e.g. removing the version and instead using globally unique shard ids.

Dynamic resharding

The current implementation relies on having the shard layout determined offline and manually added to the node implementation.

The dynamic resharding would mean that the network itself can automatically determine that resharding is needed, what should be the new shard layout and schedule the resharding.

Support different changes to shard layout

The current implementation only supports splitting a shard. In the future we can consider adding support for other operations such as merging two shards or moving an existing boundary account.

How neard will work

The documents under this chapter are talking about the future of NEAR - what we're planning on improving and how.

(This also means that they can get out of date quickly :-).

If you have comments, suggestions or want to help us designing and implementing some of these things here - please reach out on Zulip or github.

This document is still a DRAFT.

This document covers our improvement plans for state sync and catchup. Before reading this doc, you should take a look at How sync works

State sync is used in two situations:

  • when your node is behind for more than 2 epochs (and it is not an archival node) - then rather than trying to apply block by block (that can take hours) - you 'give up' and download the fresh state (a.k.a state sync) and apply blocks from there.
  • when you're a block (or chunk) producer - and in the upcoming epoch, you'll have to track a shard that you are not currently tracking.

In the past (and currently) - the state sync was mostly used in the first scenario (as all block & chunk producers had to track all the shards for security reasons - so they didn't actually have to do catchup at all).

As we progress towards phase 2 and keep increasing number of shards - the catchup part starts being a lot more critical. When we're running a network with a 100 shards, the single machine is simply not capable of tracking (a.k.a applying all transactions) of all shards - so it will have to track just a subset. And it will have to change this subset almost every epoch (as protocol rebalances the shard-to-producer assignment based on the stakes).

This means that we have to do some larger changes to the state sync design, as requirements start to differ a lot:

  • catchups are high priority (the validator MUST catchup within 1 epoch - otherwise it will not be able to produce blocks for the new shards in the next epoch - and therefore it will not earn rewards).
  • a lot more catchups in progress (with lots of shards basically every validator would have to catchup at least one shard at each epoch boundary) - this leads to a lot more potential traffic on the network
  • malicious attacks & incentives - the state data can be large and can cause a lot of network traffic. At the same time it is quite critical (see point above), so we'll have to make sure that the nodes are incentivised to provide the state parts upon request.
  • only a subset of peers will be available to request the state sync from (as not everyone from our peers will be tracking the shard that we're interested in).

Things that we're actively analysing

Performance of state sync on the receiver side

We're looking at the performance of state sync:

  • how long does it take to create the parts,
  • pro-actively creating the parts as soon as epoch starts
  • creating them in parallel
  • allowing user to ask for many at once
  • allowing user to provide a bitmask of parts that are required (therefore allowing the server to return only the ones that it already cached).

Better performance on the requestor side

Currently the parts are applied only once all of them are downloaded - instead we should try to apply them in parallel - after each part is received.

When we receive a part, we should announce this information to our peers - so that they know that they can request it from us if they need it.

Ideas - not actively working on them yet

Better networking (a.k.a Tier 3)

Currently our networking code is picking the peers to connect at random (as most of them are tracking all the shards). With phase2 it will no longer be the case, so we should work on improvements of our peer-selection mechanism.

In general - we should make sure that we have direct connection to at least a few nodes that are tracking the same shards that we're tracking right now (or that we'll want to track in the near future).

Dedicated nodes optimized towards state sync responses

The idea is to create a set of nodes that would specialize in state sync responses (similar to how we have archival nodes today).

The sub-idea of this, is to store such data on one of the cloud providers (AWS, GCP).

Sending deltas instead of full state syncs

In case of catchup, the requesting node might have tracked that shard in the past. So we could consider just sending a delta of the state rather than the whole state.

While this helps with the amount of data being sent - it might require the receiver to do a lot more work (as the data that it is about to send cannot be easily cached).

Malicious producers in phase 2 of sharding.

In this document, we'll compare the impact of the hypothetical malicious producer on the NEAR system (both in the current setup and how it will work when phase2 is implemented).

Current state (Phase 1)

Let's assume that a malicious chunk producer C1 has produced a bad chunk and sent it to the block producer at this height B1.

The block producer IS going to add the chunk to the block (as we don't validate the chunks before adding to blocks - but only when signing the block - see Transactions and receipts - last section).

After this block is produced, it is sent to all the validators to get the signatures.

As currently all the validators are tracking all the shards - they will quickly notice that the chunk is invalid, so they will not sign the block.

Therefore the next block producer B2 is going to ignore B1's block, and select block from B0 as a parent instead.

So TL;DR - a bad chunk would not be added to the chain.

Phase 2 and sharding

Unfortunately things get a lot more complicated, once we scale.

Let's assume the same setup as above (a single chunk producer C1 being malicious). But this time, we have 100 shards - each validator is tracking just a few (they cannot track all - as today - as they would have to run super powerful machines with > 100 cores).

So in the similar scenario as above - C1 creates a malicious chunks, and sends it to B1, which includes it in the block.

And here's where the complexity starts - as most of the validators will NOT track the shard which C1 was producing - so they will still sign the block.

The validators that do track that shard will of course (assuming that they are non-malicious) refuse the sign. But overall, they will be a small majority - so the block is going to get enough signatures and be added to the chain.

Challenges, Slashing and Rollbacks

So we're in a pickle - as a malicious chunk was just added to the chain. And that's why need to have mechanisms to automatically recover from such situations: Challenges, Slashing and Rollbacks.

Challenge

Challenge is a self-contained proof, that something went wrong in the chunk processing. It must contain all the inputs (with their merkle proof), the code that was executed, and the outputs (also with merkle proofs).

Such a challenge allows anyone (even nodes that don't track that shard or have any state) to verify the validity of the challenge.

When anyone notices that a current chain contains a wrong transition - they submit such challenge to the next block producer, which can easily verify it and it to the next block.

Then the validators do the verification themselves, and if successful, they sign the block.

When such block is successfully signed, the protocol automatically slashes malicious nodes (more details below) and initiates the rollback to bring the state back to the state before the bad chunk (so in our case, back to the block produced by B0).

Slashing

Slashing is the process of taking away the part of the stake from validators that are considered malicious.

In the example above, we'll definitely need to slash the C1 - and potentially also any validators that were tracking that shard and did sign the bad block.

Things that we'll have to figure out in the future:

  • how much do we slash? all of the stake? some part?
  • what happens to the slashed stake? is it burned? does it go to some pool?

State rollbacks

// TODO: add

Problems with the current Phase 2 design

Is slashing painful enough?

In the example above, we'd successfully slash the C1 producer - but was it
enough?

Currently (with 4 shards) you need around 20k NEAR to become a chunk producer. If we increase the number of shards to 100, it would drop the minimum stake to around 1k NEAR.

In such scenario, by sacrificing 1k NEAR, the malicious node can cause the system to rollback a couple blocks (potentially having bad impact on the bridge contracts etc).

On the other side, you could be a non-malicious chunk producer with a corrupted database (or a nasty bug in the code) - and the effect would be the same - the chunk that you produced would be marked as malicious, and you'd lose your stake (which will be a super-scary even for any legitimate validator).

So the open question is - can we do something 'smarter' in the protocol to detect the case, where there is 'just a single' malicious (or buggy) chunk producer and avoid the expensive rollback?

Storage

This is our work-in-progress storage documentation. Things are raw and incomplete. You are encouraged to help improve it, in any capacity you can!

Flow

Here we present the flow of a single read or write request from the transaction runtime all the way to the OS. As you can see, there are many layers of read-caching and write-buffering involved.

Blue arrow means a call triggered by read.

Red arrow means a call triggered by write.

Black arrow means a non-trivial data dependency. For example:

  • Nodes which are read on TrieStorage go to TrieRecorder to generate proof, so they are connected with black arrow.
  • Memtrie lookup needs current state of accounting cache to compute costs. When query completes, accounting cache is updated with memtrie nodes. So they are connected with bidirectional black arrow.

Diagram with read and write request flow

Trie

We use Merkle-Patricia Trie to store blockchain state. Trie is persistent, which means that insertion of new node actually leads to creation of a new path to this node, and thus root of Trie after insertion will also be represented by a new object.

Here we describe its implementation details which are closely related to Runtime.

Main structures

Trie

Trie stores the state - accounts, contract codes, access keys, etc. Each state item corresponds to the unique trie key. You can read more about this structure on Wikipedia.

There are two ways to access trie - from memory and from disk. The first one is currently the main one, where only the loading stage requires disk, and the operations are fully done in memory. The latter one relies only on disk with several layers of caching. Here we describe the disk trie.

Disk trie is stored in the RocksDB, which is persistent across node restarts. Trie communicates with database using TrieStorage. On the database level, data is stored in key-value format in DBCol::State column. There are two kinds of records:

  • trie nodes, for the which key is constructed from shard id and RawTrieNodeWithSize hash, and value is a RawTrieNodeWithSize serialized by a custom algorithm;
  • values (encoded contract codes, postponed receipts, etc.), for which the key is constructed from shard id and the hash of value, which maps to the encoded value.

So, value can be obtained from TrieKey as follows:

  • start from the hash of RawTrieNodeWithSize corresponding to the root;
  • descend to the needed node using nibbles from TrieKey;
  • extract underlying RawTrieNode;
  • if it is a Leaf or Branch, it should contain the hash of the value;
  • get value from storage by its hash and shard id.

Note that Trie is almost never called directly from Runtime, modifications are made using TrieUpdate.

TrieUpdate

Provides a way to access storage and record changes to commit in the future. Update is prepared as follows:

  • changes are made using set and remove methods, which are added to prospective field,
  • call commit method which moves prospective changes to committed,
  • call finalize method which prepares TrieChanges and state changes based on committed field.

Prospective changes correspond to intermediate state updates, which can be discarded if the transaction is considered invalid (because of insufficient balance, invalidity, etc.). While they can't be applied yet, they must be cached this way if the updated keys are accessed again in the same transaction.

Committed changes are stored in memory across transactions and receipts. Similarly, they must be cached if the updated keys are accessed across transactions. They can be discarded only if the chunk is discarded.

Note that finalize, Trie::insert and Trie::update do not update the database storage. These functions only modify trie nodes in memory. Instead, these functions prepare the TrieChanges object, and Trie is actually updated when ShardTries::apply_insertions is called, which puts new values to DBCol::State part of the key-value database.

TrieStorage

Stores all Trie nodes and allows to get serialized nodes by TrieKey hash using the retrieve_raw_bytes method.

There are two major implementations of TrieStorage:

  • TrieCachingStorage - caches all big values ever read by retrieve_raw_bytes.
  • TrieMemoryPartialStorage - used for validating recorded partial storage.

Note that these storages use database keys, which are retrieved using hashes of trie nodes using the get_key_from_shard_id_and_hash method.

ShardTries

This is the main struct that is used to access all Tries. There's usually only a single instance of this and it contains stores and caches. We use this to gain access to the Trie for a single shard by calling the get_trie_for_shard or equivalent methods.

Each shard within ShardTries has their own cache and view_cache. The cache stores the most frequently accessed nodes and is usually used during block production. The view_cache is used to serve user request to get data, which usually come in via network. It is a good idea to have an independent cache for this as we can have patterns in accessing user data independent of block production.

Primitives

TrieChanges

Stores result of updating Trie.

  • old_root: root before updating Trie, i.e. inserting new nodes and deleting old ones,
  • new_root: root after updating Trie,
  • insertions, deletions: vectors of TrieRefcountChange, describing all inserted and deleted nodes.

This way to update trie allows to add new nodes to storage and remove old ones separately. The former corresponds to saving new block, the latter - to garbage collection of old block data which is no longer needed.

TrieRefcountChange

Because we remove unused nodes during garbage collection, we need to track the reference count (rc) for each node. Another reason is that we can dedup values. If the same contract is deployed 1000 times, we only store one contract binary in storage and track its count.

This structure is used to update rc in the database:

  • trie_node_or_value_hash - hash of the trie node or value, used for uniting with shard id to get DB key,
  • trie_node_or_value - serialized trie node or value,
  • rc - change of reference count.

Note that for all reference-counted records, the actual value stored in DB is the concatenation of trie_node_or_value and rc. The reference count is updated using a custom merge operation refcount_merge.

On-Disk Database Format

We store the database in RocksDB. This document is an attempt to give hints about how to navigate it.

RocksDB

  • The column families are defined in DBCol, defined in core/store/src/columns.rs
  • The column families are seen on the rocksdb side as per the col_name function defined in core/store/src/db/rocksdb.rs

The Trie (col5)

  • The trie is stored in column family State, number 5
  • In this family, each key is of the form ShardUId | CryptoHash where ShardUId: u64 and CryptoHash: [u8; 32]

All Historical State Changes (col35)

  • The state changes are stored in column family StateChanges, number 35
  • In this family, each key is of the form BlockHash | Column | AdditionalInfo where:
    • BlockHash: [u8; 32] is the block hash for this change
    • Column: u8 is defined near the top of core/primitives/src/trie_key.rs
    • AdditionalInfo depends on Column and it can be found in the code for the TrieKey struct, same file as Column

Contract Deployments

  • Contract deployments happen with Column = 0x01
  • AdditionalInfo is the account id for which the contract is being deployed
  • The key value contains the contract code alongside other pieces of data. It is possible to extract the contract code by removing everything until the wasm magic number, 0061736D01000000
  • As such, it is possible to dump all the contracts that were ever deployed on-chain using this command on an archival node:
    ldb --db=~/.near/data scan --column_family=col35 --hex | \
        grep -E '^0x.{64}01' | \
        sed 's/0061736D01000000/x/' | \
        sed 's/^.*x/0061736D01000000/' | \
        grep -v ' : '
    
    (Note that the last grep is required because not every such value appears to contain contract code) We should implement a feature to state-viewer that’d allow better visualization of this data, but in the meantime this seems to work.

High level overview

When mainnet launched, the neard client stored all the chain's state in a single RocksDB column DBCol::State. This column embeds the entire NEAR state trie directly in the key-value database, using roughly hash(borsh_encode(trie_node)) as the key to store a trie_node. This gives a content-addressed storage system that can easily self-verify.

Flat storage is a bit like a database index for the values stored in the trie. It stores a copy of the data in a more accessible way to speed up the lookup time.

Drastically oversimplified, flat storage uses a hashmap instead of a trie. This reduces the lookup time from O(d) to O(1) where d is the tree depth of the value.

But the devil is in the detail. Below is a high-level summary of implementation challenges, which are the reasons why the above description is an oversimplification.

Time dimension of the state

The blockchain state is modified with every chunk. Neard must be able to travel back in time and resolve key lookups for older versions, as well as travel to alternative universes to resolve requests for chunks that belong to a different fork.

Using the full trie embedding in RocksDB, this is almost trivial. We only need to know the state root for a chunk and we can start traversing from the root to any key. As long as we do not delete (garbage collect) unused trie nodes, the data remains available. The overhead is minimal, too, since all the trie nodes that have not been changed are shared between tries.

Enter flat storage territory: A simple hashmap only stores a snapshot. When we write a new value to the same key, the old value is overwritten and no longer accessible. A solution to access different versions on each shard is required.

The minimal solution only tracks the final head and all forks building on top. A full implementation would also allow replaying older chunks and doing view calls. But even this minimal solution pulls in all complicated details regarding consensus and multiplies them with all the problems listed below.

Fear of data corruption

State and FlatState keep a copy of the same data and need to be in sync at all times. This is a source for errors which any implementation needs to test properly. Ideally, there are also tools to quickly compare the two and verify which is correct.

Note: The trie storage is verifiable by construction of the hashes. The flat state is not directly verifiable. But it is possible to reconstruct the full trie just from the leaf values and use that for verification.

Gas accounting requires a protocol change

The trie path we take in a trie lookup affects the gas costs, and hence the balance subtracted from the user. In other words, the lookup algorithm leaks into the protocol. Hence, we cannot switch between different ways of looking up state without a protocol change.

This makes things a whole lot more complicated. We have to do the data migration and prepare flat storage, while still using trie storage. Keeping flat storage up to date at this point is pure overhead. And then, at the epoch switch where the new protocol version begins, we have to start using the new storage in all clients simultaneously. Anyone that has not finished migration, yet, will fail to produce a chunk due to invalid gas results.

In an ideal future, we want to make gas costs independent of the position in the trie and then this would no longer be a problem.

Performance reality check

Going from O(d) to O(1) sounds great. But let's look at actual numbers.

A flat state lookup requires exactly 2 database requests. One for finding the ValueRef and one for dereferencing the value. (Dereferencing only happens if the value is present and if enough gas was paid to cover reading the potentially large value, otherwise we short-circuit.)

A trie lookup requires exactly d trie node lookups where d is the depth in the trie, plus one more for dereferencing the final value.

Clearly, d + 1 is worse than 1 + 1, right? Well, as it turns out, we can cache the trie nodes with surprisingly high effectiveness. In mainnet workloads (which were often optimized to work well with the trie shape) we observe a 99% cache hit rate in many cases.

Combine that with the fact that a typical value for d is somewhere between 10 and 20. Then we may conclude that, in expectation, a trie lookup (d * 0.01 + 1 requests) requires less DB requests than a flat state lookup (1 + 1 requests).

In practice, however, flat state still has an edge over accessing trie storage directly. And that is due to two reasons.

  1. DB keys are in better order, leading to more cache hits in RocksDB's block cache.
  2. The flat state column is much smaller than the state column.

We observed a speedup of 100x and beyond for reading a single value from DBCol::FlatState compared to reading it from DBCol::State. So that is clearly a win. But one that is due to the data layout inside RocksDB, not due to algorithmic improvements.

Updating the Merkle tree

When we update a value in the blockchain state, all ancestor nodes change their value due to the recursive nature of Merkle trees.

Updating flat state is easy enough but then we would not know the new state root. Annoyingly, it is rather important to have the state root in due time, as it is included in the chunk header.

To update the state root, we need to read all nodes between the root and all changed values. At which point, we are doing all the same reads we were trying to avoid in the first place.

This makes flat state for writes algorithmically useless, as we still have to do O(d) requests, no matter what. But there are potential benefits from data locality, as flat state stores values sorted by a trie key instead of a perfectly random hash.

For that, we would need a fast index not only for the state value but also for all intermediate trie nodes. At this point, we would be back to a full embedding of the trie in a key-value store / hashmap. Just changing the keys we use for the database.

Note that we can enjoy the benefits of read-only flat storage to improve read heavy contracts. But a read-modify-write pattern using this hybrid solution is strictly worse than the original implementation without flat storage.

Implementation status and future ideas

As of March 2023, we have implemented a read-only flat storage that only works for the frontier of non-final blocks and the final block itself. Archival calls and view calls still use the trie directly.

Things we have solved

We are fairly confident we have solved the time-travel issues, the data migration / protocol upgrade problems, and got a decent handle on avoiding accidental data corruption.

This improves the worst case (no cache hits) for reads dramatically. And it paves the way for further improvements, as we have now a good understanding of all these seemingly-simple but actually-really-hard problems.

Flat state for writes

How to use flat storage for writes is not fully designed, yet, but we have some rough ideas on how to do it. But we don't know the performance we should expect. Algorithmically, it can only get worse but the speedup on RocksDB we found with the read-only flat storage is promising. But one has to wonder if there are not also simpler ways to achieve better data locality in RocksDB.

Inlining values

We hope to get a jump in performance by avoiding dereferencing ValueRef after the flat state lookup. At least for small values we could store the value itself also in flat state. (to be defined what small means)

This seems very promising because the value dereferencing still happens in the much slower DBCol::State. If we assume small values are the common case, we would thus expect huge performance improvements for the average case.

It is not clear yet, if we can also optimize large lookups somehow. If not, we could at least charge them at a higher rate than we do today, to reflect the real DB cost better.

Code guide

Here we describe structures used for flat storage implementation.

FlatStorage

This is the main structure which owns information about ValueRefs for all keys from some fixed shard for some set of blocks. It is shared by multiple threads, so it is guarded by RwLock:

  • Chain thread, because it sends queries like:
    • "new block B was processed by chain" - supported by add_block
    • "flat storage head can be moved forward to block B" - supported by update_flat_head
  • Thread that applies a chunk, because it sends read queries "what is the ValueRef for key for block B"
  • View client (not fully decided)

Requires ChainAccessForFlatStorage on creation because it needs to know the tree of blocks after the flat storage head, to support getting queries correctly.

FlatStorageManager

It holds all FlatStorages which NightshadeRuntime knows about and:

  • provides views for flat storage for some fixed block - supported by new_flat_state_for_shard
  • sets initial flat storage state for genesis block - set_flat_storage_for_genesis
  • adds/removes/gets flat storage if we started/stopped tracking a shard or need to create a view - create_flat_storage_for_shard, etc.

FlatStorageChunkView

Interface for getting ValueRefs from flat storage for some shard for some fixed block, supported by get_ref method.

Other notes

Chain dependency

If storage is fully empty, then we need to create flat storage from scratch. FlatStorage is stored inside NightshadeRuntime, and it itself is stored inside Chain, so we need to create them in the same order and dependency hierarchy should be the same. But at the same time, we parse genesis file only during Chain creation. That’s why FlatStorageManager has set_flat_storage_for_genesis method which is called during Chain creation.

Regular block processing vs. catchups

For these two usecases we have two different flows: first one is handled in Chain.postprocess_block, the second one in Chain.block_catch_up_postprocess. Both, when results of applying chunk are ready, should call Chain.process_apply_chunk_result → RuntimeAdapter.get_flat_storage_for_shard → FlatStorage.add_block, and when results of applying ALL processed/postprocessed chunks are ready, should call RuntimeAdapter.get_flat_storage_for_shard → FlatStorage.update_flat_head.

(because applying some chunk may result in error and we may need to exit there without updating flat head - ?)

This document describes how our network works. At this moment, it is known to be somewhat outdated, as we are in the process of refactoring the network protocol somewhat significantly.

1. Overview

Near Protocol uses its own implementation of a custom peer-to-peer network. Peers who join the network are represented by nodes and connections between them by edges.

The purpose of this document is to describe the inner workings of the near-network package; and to be used as reference by future engineers to understand the network code without any prior knowledge.

2. Code structure

near-network runs on top of the actor framework called Actix. Code structure is split between 4 actors PeerManagerActor, PeerActor, RoutingTableActor, EdgeValidatorActor

2.1 EdgeValidatorActor (currently called EdgeVerifierActor in the code)

EdgeValidatorActor runs on separate thread. The purpose of this actor is to validate edges, where each edge represents a connection between two peers, and it's signed with a cryptographic signature of both parties. The process of edge validation involves verifying cryptographic signatures, which can be quite expensive, and therefore was moved to another thread.

Responsibilities:

  • Validating edges by checking whenever cryptographic signatures match.

2.2 RoutingTableActor

RoutingTableActor maintains a view of the P2P network represented by a set of nodes and edges.

In case a message needs to be sent between two nodes, that can be done directly through a TCP connection. Otherwise, RoutingTableActor is responsible for pinging the best path between them.

Responsibilities:

  • Keep set of all edges of P2P network called routing table.
  • Connects to EdgeValidatorActor, and asks for edges to be validated, when needed.
  • Has logic related to exchanging edges between peers.

2.3 PeerActor

Whenever a new connection gets accepted, an instance of PeerActor gets created. Each PeerActor keeps a physical TCP connection to exactly one peer.

Responsibilities:

  • Maintaining physical connection.
  • Reading messages from peers, decoding them, and then forwarding them to the right place.
  • Encoding messages, sending them to peers on physical layer.
  • Routing messages between PeerManagerActor and other peers.

2.4 PeerManagerActor

PeerManagerActor is the main actor of near-network crate. It acts as a bridge connecting to the world outside, the other peers, and ClientActor and ClientViewActor, which handle processing any operations on the chain. PeerManagerActor maintains information about p2p network via RoutingTableActor, and indirectly, through PeerActor, connections to all nodes on the network. All messages going to other nodes, or coming from other nodes will be routed through this Actor. PeerManagerActor is responsible for accepting incoming connections from the outside world and creating PeerActors to manage them.

Responsibilities:

  • Accepting new connections.
  • Maintaining the list of PeerActors, creating, deleting them.
  • Routing information about new edges between PeerActors and RoutingTableManager.
  • Routing messages between ViewClient, ViewClientActor and PeerActors, and consequently other peers.
  • Maintains RouteBack structure, which has information on how to send replies to messages.

3. Code flow - initialization

First, the PeerManagerActor actor gets started. PeerManagerActor opens the TCP server, which listens to incoming connections. It starts the RoutingTableActor, which then starts the EdgeValidatorActor. When an incoming connection gets accepted, it starts a new PeerActor on its own thread.

4. NetworkConfig

near-network reads configuration from NetworkConfig, which is a part of client config.

Here is a list of features read from config:

  • boot_nodes - list of nodes to connect to on start.
  • addr - listening address.
  • max_num_peers - by default we connect up to 40 peers, current implementation supports up to 128.

5. Connecting to other peers.

Each peer maintains a list of known peers. They are stored in the database. If the database is empty, the list of peers, called boot nodes, will be read from the boot_nodes option in the config. The peer to connect to is chosen at random from a list of known nodes by the PeerManagerActor::sample_random_peer method.

6. Edges & network - in code representation

P2P network is represented by a list of peers, where each peer is represented by a structure PeerId, which is defined by the peer's public key PublicKey, and a list of edges, where each edge is represented by the structure Edge.

Both are defined below.

6.1 PublicKey

We use two types of public keys:

  • a 256 bit ED25519 public key.
  • a 512 bit Secp256K1 public key.

Public keys are defined in the PublicKey enum, which consists of those two variants.

#![allow(unused)]
fn main() {
pub struct ED25519PublicKey(pub [u8; 32]);
pub struct Secp256K1PublicKey([u8; 64]);
pub enum PublicKey {
    ED25519(ED25519PublicKey),
    SECP256K1(Secp256K1PublicKey),
}
}

6.2 PeerId

Each peer is uniquely defined by its PublicKey, and represented by PeerId struct.

#![allow(unused)]
fn main() {
pub struct PeerId(PublicKey);
}

6.3 Edge

Each edge is represented by the Edge structure. It contains the following:

  • pair of nodes represented by their public keys.
  • nonce - a unique number representing the state of an edge. Starting with 1. Odd numbers represent an active edge. Even numbers represent an edge in which one of the nodes, confirmed that the edge is removed.
  • Signatures from both peers for active edges.
  • Signature from one peer in case an edge got removed.

6.4 Graph representation

RoutingTableActor is responsible for storing and maintaining the set of all edges. They are kept in the edge_info data structure of the type HashSet<Edge>.

#![allow(unused)]
fn main() {
pub struct RoutingTableActor {
    /// Collection of edges representing P2P network.
    /// It's indexed by `Edge::key()` key and can be search through by calling `get()` function
    /// with `(PeerId, PeerId)` as argument.
    pub edges_info: HashSet<Edge>,
    /// ...
}
}

7. Code flow - connecting to a peer - handshake

When PeerManagerActor starts, it starts to listen to a specific port.

7.1 - Step 1 - monitor_peers_trigger runs

PeerManager checks if we need to connect to another peer by running the PeerManager::is_outbound_bootstrap_needed method. If true we will try to connect to a new node. Let's call the current node, node A.

7.2 - Step 2 - choosing the node to connect to

Method PeerManager::sample_random_peer will be called, and it returns node B that we will try to connect to.

7.3 - Step 3 - OutboundTcpConnect message

PeerManagerActor will send itself a message OutboundTcpConnect in order to connect to node B.

#![allow(unused)]
fn main() {
pub struct OutboundTcpConnect {
    /// Peer information of the outbound connection
    pub target_peer_info: PeerInfo,
}
}

7.4 - Step 4 - OutboundTcpConnect message

On receiving the message the handle_msg_outbound_tcp_connect method will be called, which calls TcpStream::connect to create a new connection.

7.5 - Step 5 - Connection gets established

Once connection with the outgoing peer gets established. The try_connect_peer method will be called. And then a new PeerActor will be created and started. Once the PeerActor starts it will send a Handshake message to the outgoing node B over a tcp connection.

This message contains protocol_version, node A's metadata, as well as all information necessary to create an Edge.

#![allow(unused)]
fn main() {
pub struct Handshake {
    /// Current protocol version.
    pub(crate) protocol_version: u32,
    /// Oldest supported protocol version.
    pub(crate) oldest_supported_version: u32,
    /// Sender's peer id.
    pub(crate) sender_peer_id: PeerId,
    /// Receiver's peer id.
    pub(crate) target_peer_id: PeerId,
    /// Sender's listening addr.
    pub(crate) sender_listen_port: Option<u16>,
    /// Peer's chain information.
    pub(crate) sender_chain_info: PeerChainInfoV2,
    /// Represents new `edge`. Contains only `none` and `Signature` from the sender.
    pub(crate) partial_edge_info: PartialEdgeInfo,
}
}

7.6 - Step 6 - Handshake arrives at node B

Node B receives a Handshake message. Then it performs various validation checks. That includes:

  • Check signature of edge from the other peer.
  • Whenever nonce is the edge, send matches.
  • Check whether the protocol is above the minimum OLDEST_BACKWARD_COMPATIBLE_PROTOCOL_VERSION.
  • Other node view of chain state.

If everything is successful, PeerActor will send a RegisterPeer message to PeerManagerActor. This message contains everything needed to add PeerActor to the list of active connections in PeerManagerActor.

Otherwise, PeerActor will be stopped immediately or after some timeout.

#![allow(unused)]
fn main() {
pub struct RegisterPeer {
    pub(crate) actor: Addr<PeerActor>,
    pub(crate) peer_info: PeerInfo,
    pub(crate) peer_type: PeerType,
    pub(crate) chain_info: PeerChainInfoV2,
    // Edge information from this node.
    // If this is None it implies we are outbound connection, so we need to create our
    // EdgeInfo part and send it to the other peer.
    pub(crate) this_edge_info: Option<EdgeInfo>,
    // Edge information from other node.
    pub(crate) other_edge_info: EdgeInfo,
    // Protocol version of new peer. May be higher than ours.
    pub(crate) peer_protocol_version: ProtocolVersion,
}
}

7.7 - Step 7 - PeerManagerActor receives RegisterPeer message - node B

In the handle_msg_consolidate method, the RegisterPeer message will be validated. If successful, the register_peer method will be called, which adds the PeerActor to the list of connected peers.

Each connected peer is represented in PeerActorManager in ActivePeer the data structure.

#![allow(unused)]
fn main() {
/// Contains information relevant to an active peer.
struct ActivePeer { // will be renamed to `ConnectedPeer` see #5428
    addr: Addr<PeerActor>,
    full_peer_info: FullPeerInfo,
    /// Number of bytes we've received from the peer.
    received_bytes_per_sec: u64,
    /// Number of bytes we've sent to the peer.
    sent_bytes_per_sec: u64,
    /// Last time requested peers.
    last_time_peer_requested: Instant,
    /// Last time we received a message from this peer.
    last_time_received_message: Instant,
    /// Time where the connection was established.
    connection_established_time: Instant,
    /// Who started connection. Inbound (other) or Outbound (us).
    peer_type: PeerType,
}
}

7.8 - Step 8 - Exchange routing table part 1 - node B

At the end of the register_peer method node B will perform a RoutingTableSync sync. Sending the list of known edges representing a full graph, and a list of known AnnounceAccount. Those will be covered later, in their dedicated sections see sections (to be added).

message: PeerMessage::RoutingTableSync(SyncData::edge(new_edge)),
#![allow(unused)]
fn main() {
/// Contains metadata used for routing messages to particular `PeerId` or `AccountId`.
pub struct RoutingTableSync { // also known as `SyncData` (#5489)
    /// List of known edges from `RoutingTableActor::edges_info`.
    pub(crate) edges: Vec<Edge>,
    /// List of known `account_id` to `PeerId` mappings.
    /// Useful for `send_message_to_account` method, to route message to particular account.
    pub(crate) accounts: Vec<AnnounceAccount>,
}
}

7.9 - Step 9 - Exchange routing table part 2 - node A

Upon receiving a RoutingTableSync message. Node A will reply with its own RoutingTableSync message.

7.10 - Step 10 - Exchange routing table part 2 - node B

Node B will get the message from A and update its routing table.

8. Adding new edges to routing tables

This section covers the process of adding new edges, received from another node, to the routing table. It consists of several steps covered below.

8.1 Step 1

PeerManagerActor receives RoutingTableSync message containing list of new edges to add. RoutingTableSync contains list of edges of the P2P network. This message is then forwarded to RoutingTableActor.

8.2 Step 2

PeerManagerActor forwards those edges to RoutingTableActor inside of the ValidateEdgeList struct.

ValidateEdgeList contains:

  • list of edges to verify.
  • peer who sent us the edges.

8.3 Step 3

RoutingTableActor gets the ValidateEdgeList message. Filters out edges that have already been verified, those that are already in RoutingTableActor::edges_info.

Then, it updates edge_verifier_requests_in_progress to mark that edge verifications are in progress, and edges shouldn't be pruned from Routing Table (see section (to be added)).

Then, after removing already validated edges, the modified message is forwarded to EdgeValidatorActor.

8.4 Step 4

EdgeValidatorActor goes through the list of all edges. It checks whether all edges are valid (their cryptographic signatures match, etc.).

If any edge is not valid, the peer will be banned.

Edges that are validated are written to a concurrent queue ValidateEdgeList::sender. This queue is used to transfer edges from EdgeValidatorActor back to PeerManagerActor.

8.5 Step 5

broadcast_validated_edges_trigger runs, and gets validated edges from EdgeVerifierActor.

Every new edge will be broadcast to all connected peers.

And then, all validated edges received from EdgeVerifierActor will be sent again to RoutingTableActor inside AddVerifiedEdges.

8.5 Step 6

When RoutingTableActor receives RoutingTableMessages::AddVerifiedEdges, the method add_verified_edges_to_routing_table will be called. It will add edges to RoutingTableActor::edges_info struct, and mark routing table, that it needs a recalculation (see RoutingTableActor::needs_routing_table_recalculation).

9 Routing table computation

Routing table computation does a few things:

  • For each peer B, calculates set of peers |C_b|, such that each peer is on the shortest path to B.
  • Removes unreachable edges from memory and stores them to disk.
  • The distance is calculated as the minimum number of nodes on the path from given node A, to each other node on the network. That is, A has a distance of 0 to itself. Its neighbors will have a distance of 1. The neighbors of their neighbors will have a distance of 2, etc.

9.1 Step 1

PeerManagerActor runs a update_routing_table_trigger every UPDATE_ROUTING_TABLE_INTERVAL seconds.

RoutingTableMessages::RoutingTableUpdate message is sent to RoutingTableActor to request routing table re-computation.

9.2 Step 2

RoutingTableActor receives the message, and then:

  • calls recalculate_routing_table method, which computes RoutingTableActor::peer_forwarding: HashMap<PeerId, Vec<PeerId>>. For each PeerId on the network, gives a list of connected peers, which are on the shortest path to the destination. It marks reachable peers in the peer_last_time_reachable struct.
  • calls prune_edges which removes from memory all the edges that were not reachable for at least 1 hour, based on the peer_last_time_reachable data structure. Those edges are then stored to disk.

9.3 Step 3

RoutingTableActor sends a RoutingTableUpdateResponse message back to PeerManagerActor.

PeerManagerActor keeps a local copy of edges_info, called local_edges_info containing only edges adjacent to current node.

  • RoutingTableUpdateResponse contains a list of local edges, which PeerManagerActor should remove.
  • peer_forwarding which represents how to route messages in the P2P network
  • peers_to_ban represents a list of peers to ban for sending us edges, which failed validation in EdgeVerifierActor.

9.4 Step 4

PeerManagerActor receives RoutingTableUpdateResponse and then:

  • updates local copy of peer_forwarding, used for routing messages.
  • removes local_edges_to_remove from local_edges_info.
  • bans peers, who sent us invalid edges.

10. Message transportation layers.

This section describes different protocols of sending messages currently used in Near.

10.1 Messages between Actors.

Near is built on Actix's actor framework. Usually each actor runs on its own dedicated thread. Some, like PeerActor have one thread per each instance. Only messages implementing actix::Message, can be sent using between threads. Each actor has its own queue; Processing of messages happens asynchronously.

We should not leak implementation details into the spec.

Actix messages can be found by looking for impl actix::Message.

10.2 Messages sent through TCP

Near is using borsh serialization to exchange messages between nodes (See borsh.io). We should be careful when making changes to them. We have to maintain backward compatibility. Only messages implementing BorshSerialize, BorshDeserialize can be sent. We also use borsh for database storage.

10.3 Messages sent/received through chain/jsonrpc

Near runs a json REST server. (See actix_web::HttpServer). All messages sent and received must implement serde::Serialize and serde::Deserialize.

11. Code flow - routing a message

This is the example of the message that is being sent between nodes RawRoutedMessage.

Each of these methods have a target - that is either the account_id or peer_id or hash (which seems to be used only for route back...). If target is the account - it will be converted using routing_table.account_owner to the peer.

Upon receiving the message, the PeerManagerActor will sign it and convert into RoutedMessage (which also have things like TTL etc.).

Then it will use the routing_table, to find the route to the target peer (add route_back if needed) and then send the message over the network as PeerMessage::Routed. Details about routing table computations are covered in section 8.

When Peer receives this message (as PeerMessage::Routed), it will pass it to PeerManager (as RoutedMessageFrom), which would then check if the message is for the current PeerActor. (if yes, it would pass it to the client) and if not - it would pass it along the network.

All these messages are handled by receive_client_message in Peer. (NetworkClientMessages) - and transferred to ClientActor in (chain/client/src/client_actor.rs)

NetworkRequests to PeerManager actor trigger the RawRoutedMessage for messages that are meant to be sent to another peer.

lib.rs (ShardsManager) has a network_adapter - coming from the client’s network_adapter that comes from ClientActor that comes from the start_client call that comes from start_with_config (that creates PeerManagerActor - that is passed as target to network_recipent).

12. Database

12.1 Storage of deleted edges

Every time a group of peers becomes unreachable at the same time; We store edges belonging to them in components. We remove all of those edges from memory, and save them to the database. If any of them were to be reachable again, we would re-add them. This is useful in case there is a network split, to recover edges if needed.

Each component is assigned a unique nonce, where first one is assigned nonce 0. Each new component gets assigned a consecutive integer.

To store components, we have the following columns in the DB.

  • DBCol::LastComponentNonce Stores component_nonce: u64, which is the last used nonce.
  • DBCol::ComponentEdges Mapping from component_nonce to a list of edges.
  • DBCol::PeerComponent Mapping from peer_id to the last component nonce it belongs to.

12.2 Storage of account_id to peer_id mapping

ColAccountAnouncements -> Stores a mapping from account_id to a tuple (account_id, peer_id, epoch_id, signature).

Gas Cost Parameters

Gas in NEAR Protocol solves two problems.

  1. To avoid spam, validator nodes only perform work if a user's tokens are burned. Tokens are automatically converted to gas using the current gas price.
  2. To synchronize shards, they must all produce chunks following a strict schedule of 1 second execution time. Gas is used to measure how heavy the workload of a transaction is, so that the number of transactions that fit in a block can be deterministically computed by all nodes.

In other words, each transaction costs a fixed amount of gas. This gas cost determines how much a user has to pay and how much time nearcore has to execute the transaction.

What happens if nearcore executes a transaction too slowly? Chunk production for the shard gets delayed, which delays block production for the entire blockchain, increasing latency and reducing throughput for everybody. If the chunk is really late, the block producer will decide to not include the chunk at all and inserts an empty chunk. The chunk may be included in the next block.

By now, you probably wonder how we can know the time it takes to execute a transaction, given that validators use hardware of their choice. Getting these timings right is indeed a difficult problem. Or flipping the problem, assuming the timings are already known, then we must implement nearcore such that it guarantees to operate within the given time constraints. How we tackle this is the topic of this chapter.

If you want to learn more about Gas from a user perspective, Gas basic concepts, Gas advanced concepts, and the runtime fee specification are good places to dig deeper.

Hardware and Timing Assumptions

For timing to make sense at all, we must first define hardware constraints. The official hardware requirements for a validator are published on near-nodes.io/validator/hardware-validator. They may change over time but the main principle is that a moderately configured, cloud-hosted virtual machine suffices.

For our gas computation, we assume the minimum required hardware. Then we define 1015 gas to be executed in at most 1s. We commonly use 1 Tgas (= 1012 gas) in conversation, which corresponds to 1ms execution time.

Obviously, this definition means that a validator running more powerful hardware will execute the transactions faster. That is perfectly okay, as far as the protocol is concerned we just need to make sure the chunk is available in time. If it is ready in even less time, no problem.

Less obviously, this means that even a minimally configured validator is often idle. Why is that? Well, the hardware must be prepared to execute chunks that are always full. But that is rarely the case, as the gas price increases exponentially when chunks are full, which would cause traffic to go back eventually.

Furthermore, the hardware has to be ready for transactions of all types, including transactions chosen by a malicious actor selecting only the most complex transactions. Those transactions can also be unbalanced in what bottlenecks they hit. For example, a chunk can be filled with transactions that fully utilize the CPU's floating point units. Or they could be using all the available disk IO bandwidth.

Because the minimum required hardware needs to meet the timing requirements for any of those scenarios, the typical, more balanced case is usually computed faster than the gas rule states.

Transaction Gas Cost Model

A transaction is essentially just a list of actions to be executed on the same account. For example it could be CreateAccount combined with FunctionCall("hello_world").

The reference for available actions shows the conclusive list of possible actions. The protocol defines fixed fees for each of them. More details on actions fees follow below.

Fixed fees are an important design decision. It means that a given action will always cost the exact same amount of gas, no matter on what hardware it executes. But the content of the action can impact the cost, for example a DeployContract action's cost scales with the size of the contract code.

So, to be more precise, the protocol defines fixed gas cost parameters for each action, together with a formula to compute the gas cost for the action. All actions today either use a single fixed gas cost or they use a base cost and a linear scaling parameter. With one important exception, FunctionCall, which shall be discussed further below.

There is an entire section on Parameter Definitions that explains how to find the source of truth for parameter values in the nearcore repository, how they can be referenced in code, and what steps are necessary to add a new parameter.

Let us dwell a bit more on the linear scaling factors. The fact that contract deployment cost, which includes code compilation, scales linearly limits the compiler to use only algorithms of linear complexity. Either that, or the parameters must be set to match the 1ms = 1Tgas rule at the largest possible contract size. Today, we limit ourselves to linear-time algorithms in the compiler.

Likewise, an action that has no scaling parameters must only use constant time to execute. Taking the CreateAccount action as an example, with a cost of 0.1 Tgas, it has to execute within 0.1ms. Technically, the execution time depends ever so slightly on the account name length. But there is a fairly low upper limit on that length and it makes sense to absorb all the cost in the constant base cost.

This concept of picking parameters according to algorithmic complexity is key. If you understand this, you know how to think about gas as a nearcore developer. This should be enough background to understand what the estimator does.

The runtime parameter estimator is a separate binary within the nearcore repository. It contains benchmarking-like code used to validate existing parameter values against the 1ms = 1 Tgas rule. When implementing new features, code should be added there to estimate the safe values of the new parameters. This section is for you if you are adding new features such as a new pre-compiled method or other host functions.

Next up are more details on the specific costs that occur when executing NEAR transactions, which help to understand existing parameters and how they are organized.

Action Costs

Actions are executed in two steps. First, an action is verified and inserted to an action receipt, which is sent to the receiver of the action. The send fee is paid for this. It is charged either in fn process_transaction(..) if the action is part of a fresh transaction, or inside logic.rs through fn pay_action_base(..) if the action is generated by a function call. The send fee is meant to cover the cost to validate an action and transmit it over the network.

The second step is action execution. It is charged in fn apply_action(..). The execution cost has to cover everything required to apply the action to the blockchain's state.

These two steps are done on the same shard for local receipts. Local receipts are defined as those where the sender account is also the receiver, abbreviated as sir which stands for "sender is receiver".

For remote receipts, which is any receipt where the sender and receiver accounts are different, we charge a different fee since sending between shards is extra work. Notably, we charge that extra work even if the accounts are on the same shard. In terms of gas costs, each account is conceptually its own shard. This makes dynamic resharding possible without user-observable impact.

When the send step is performed, the minimum required gas to start execution of that action is known. Thus, if the receipt does not have enough gas, it can be aborted instead of forwarding it. Here we have to introduce the concept of used gas.

gas_used is different from gas_burnt. The former includes the gas that needs to be reserved for the execution step whereas the latter only includes the gas that has been burnt in the current chunk. The difference between the two is sometimes also called prepaid gas, as this amount of gas is paid for during the send step and it is available in the execution step for free.

If execution fails, the prepaid cost that has not been burned will be refunded. But this is not the reason why it must burn on the receiver shard instead of the sender shard. The reason is that we want to properly compute the gas limits on the chunk that does the execution work.

In conclusion, each action parameter is split into three costs, send_sir, send_not_sir, and execution. Local receipts charge the first and last parameters, remote receipts charge the second and third. They should be estimated, defined, and charged separately. But the reality is that today almost all actions are estimated as a whole and the parameters are split 50/50 between send and execution cost, without discrimination on local vs remote receipts i.e. send_sir cost is the same as send_not_sir.

The Gas Profile section goes into more details on how gas costs of a transaction are tracked in nearcore.

Dynamic Function Call Costs

Costs that occur while executing a function call on a deployed WASM app (a.k.a. smart contract) are charged only at the receiver. Thus, they have only one value to define them, in contrast to action costs.

The most fundamental dynamic gas cost is wasm_regular_op_cost. It is multiplied by the exact number of WASM operations executed. You can read about Gas Instrumentation if you are curious how we count WASM ops.

Currently, all operations are charged the same, although it could be more efficient to charge less for opcodes like i32.add compared to f64.sqrt.

The remaining dynamic costs are for work done during host function calls. Each host function charges a base cost. Either the general wasm_base cost, or a specific cost such as wasm_utf8_decoding_base, or sometimes both. New host function calls should define a separate base cost and not charge wasm_base.

Additional host-side costs can be scaled per input byte, such as wasm_sha256_byte, or costs related to moving data between host and guest, or any other cost that is specific to the host function. Each host function must clearly define what its costs are and how they depend on the input.

Non-gas parameters

Not all runtime parameters are directly related to gas costs. Here is a brief overview.

  • Gas economics config: Defines the conversion rate when purchasing gas with NEAR tokens and how gas rewards are split.
  • Storage usage config: Costs in tokens, not gas, for storing data on chain.
  • Account creation config: Rules for account creation.
  • Smart contract limits: Rules for WASM execution.

None of the above define any gas costs directly. But there can be interplay between those parameters and gas costs. For example, the limits on smart contracts changes the assumptions for how slow a contract compilation could be, hence it affects the deploy action costs.

Parameter Definitions

Gas parameters are a subset of runtime parameters that are defined in runtime_configs/parameters.yaml. IMPORTANT: This is not the final list of parameters, it contains the base values which can be overwritten per protocol version. For example, 53.yaml changes several parameters starting from version 53. You can see the final list of parameters in runtime_configs/parameters.snap. This file is automatically updated whenever any of the parameters changes. To see all parameter values for a specific version, check out the list of JSON snapshots generated in this directory: parameters/src/snapshots.

Using Parameters in Code

As the introduction on this page already hints at it, parameter values are versioned. In other words, they can change if the protocol version changes. A nearcore binary has to support multiple versions and choose the correct parameter value at runtime.

To make this easy, there is RuntimeConfigStore. It contains a sparse map from protocol versions to complete runtime configurations (BTreeMap<ProtocolVersion, Arc<RuntimeConfig>>). The runtime then uses store.get_config(protocol_version) to access a runtime configuration for a specific version.

It is crucial to always use this runtime config store. Never hard-code parameter values. Never look them up in a different way.

In practice, this usually translates to a &RuntimeConfig argument for any function that depends on parameter values. This config object implicitly defines the protocol version. It should therefore not be cached. It should be read from the store once per chunk and then passed down to all functions that need it.

How to Add a New Parameter

First and foremost, if you are feeling lost, open a topic in our Zulip chat (pagoda/contract-runtime). We are here to help.

Principles

Before adding anything, please review the basic principles for gas parameters.

  • A parameter must correspond to a clearly defined workload.
  • When the workload is scalable by a factor N that depends on user input, it will likely require a base parameter and a second parameter that is multiplied by N. (Example: N = number of bytes when reading a value from storage.)
  • Charge gas before executing the workload.
  • Parameters should be independent of specific implementation choices in nearcore.
  • Ideally, contract developers can easily understand what the cost is simply by reading the name in a gas profile.

The section on Gas Profiles explains how to charge gas, please also consider that when defining a new parameter.

Necessary Code Changes

Adding the parameter in code involves several steps.

  1. Define the parameter by adding it to the list in core/primitives/res/runtime_configs/parameters.yaml.
  2. Update the Rust view of parameters by adding a variant to enum Parameter in core/primitives-core/src/parameter.rs. In the same file, update enum FeeParameter if you add an action cost or update ext_costs() if you add a cost inside function calls.
  3. Update RuntimeConfig, the configuration used to reference parameters in code. Depending on the type of parameter, you will need to update RuntimeFeesConfig (for action costs) or ExtCostsConfig (for gas costs).
  4. Update the list used for gas profiles. This is defined by enum Cost in core/primitives-core/src/profile.rs. You need to add a variant to either enum ActionCosts or enum ExtCost. Please also update fn index() that maps each profile entry to a unique position in serialized gas profiles.
  5. The parameter should be available to use in the code section you need it in. Now is a good time to ensure cargo check and cargo test --no-run pass. Most likely you have to update some testing code, such as ExtCostsConfig::test().
  6. To merge your changes into nearcore, you will have to hide your parameter behind a feature flag. Add the feature to the Cargo.toml of each crate touched in steps 3 and 4 and hide the code behind #[cfg(feature = "protocol_feature_MY_NEW_FEATURE")]. Do not hide code in step 2 so that non-nightly builds can still read parameters.yaml. Also, add your feature as a dependency on nightly in core/primitives/Cargo.toml to make sure it gets included when compiling for nightly. After that, check cargo check and cargo test --no-run with and without features=nightly.

What Gas Value Should the Parameter Have?

For a first draft, the exact gas value used in the parameter is not crucial. Make sure the right set of parameters exists and try to set a number that roughly makes sense. This should be enough to enable discussions on the NEP around the feasibility and usefulness of the proposed feature. If you are not sure, a good rule of thumb is 0.1 Tgas for each disk operation and at least 1 Tgas for each ms of CPU time. Then round it up generously.

The value will have to be refined later. This is usually the last step after the implementation is complete and reviewed. Have a look at the section on estimating gas parameters in the book.

Gas Profile

What if you want to understand the exact gas spending of a smart contract call? It would be very complicated to predict exactly how much gas executing a piece of WASM code will require, including all host function calls and actions. An easier approach is to just run the code on testnet and see how much gas it burns. Gas profiles allow one to dig deeper and understand the breakdown of the gas costs per parameter.

Gas profiles are not very reliable, in that they are often incomplete and the details of how they are computed can change without a protocol version bump.

Example Transaction Gas Profile

You can query the gas profile of a transaction with NEAR CLI.

NEAR_ENV=mainnet near tx-status 8vYxsqYp5Kkfe8j9LsTqZRsEupNkAs1WvgcGcUE4MUUw  \
  --accountId app.nearcrowd.near  \
  --nodeUrl https://archival-rpc.mainnet.near.org  # Allows to retrieve older transactions.
Transaction app.nearcrowd.near:8vYxsqYp5Kkfe8j9LsTqZRsEupNkAs1WvgcGcUE4MUUw
{
  receipts_outcome: [
    {
      block_hash: '2UVQKpxH6PhEqiKr6zMggqux4hwMrqqjpsbKrJG3vFXW',
      id: '14bwmJF21PXY9YWGYN1jpjF3BRuyCKzgVWfhXhZBKH4u',
      outcome: {
        executor_id: 'app.nearcrowd.near',
        gas_burnt: 5302170867180,
        logs: [],
        metadata: {
          gas_profile: [
            {
              cost: 'BASE',
              cost_category: 'WASM_HOST_COST',
              gas_used: '15091782327'
            },
            {
              cost: 'CONTRACT_LOADING_BASE',
              cost_category: 'WASM_HOST_COST',
              gas_used: '35445963'
            },
            {
              cost: 'CONTRACT_LOADING_BYTES',
              cost_category: 'WASM_HOST_COST',
              gas_used: '117474381750'
            },
            {
              cost: 'READ_CACHED_TRIE_NODE',
              cost_category: 'WASM_HOST_COST',
              gas_used: '615600000000'
            },
            # ...
            # skipping entries for presentation brevity
            # ...
            {
              cost: 'WRITE_REGISTER_BASE',
              cost_category: 'WASM_HOST_COST',
              gas_used: '48713882262'
            },
            {
              cost: 'WRITE_REGISTER_BYTE',
              cost_category: 'WASM_HOST_COST',
              gas_used: '4797573768'
            }
          ],
          version: 2
        },
        receipt_ids: [ '46Qsorkr6hy36ZzWmjPkjbgG28ko1iwz1NT25gvia51G' ],
        status: { SuccessValue: 'ZmFsc2U=' },
        tokens_burnt: '530217086718000000000'
      },
      proof: [ ... ]
    },
    { ... }
  ],
  status: { SuccessValue: 'ZmFsc2U=' },
  transaction: { ... },
  transaction_outcome: {
    block_hash: '7MgTTVi3aMG9LiGV8ezrNvoorUwQ7TwkJ4Wkbk3Fq5Uq',
    id: '8vYxsqYp5Kkfe8j9LsTqZRsEupNkAs1WvgcGcUE4MUUw',
    outcome: {
      executor_id: 'evgeniya.near',
      gas_burnt: 2428068571644,
      ...
      tokens_burnt: '242806857164400000000'
    },
  }
}

The gas profile is in receipts_outcome.outcome.metadata.gas_profile. It shows gas costs per parameter and with associated categories such as WASM_HOST_COST or ACTION_COST. In the example, all costs are of the former category, which is gas expended on smart contract execution. The latter is for gas spent on actions.

To be complete, the output above should also have a gas profile entry for the function call action. But currently this is not included since gas profiles only work properly on function call receipts. Improving this is planned, see nearcore#8261.

The tx-status query returns one gas profile for each receipt. The output above contains a single gas profile because the transaction only spawned one receipt. If there was a chain of cross contract calls, there would be multiple profiles.

Besides receipts, also note the transaction_outcome in the output. It contains the gas cost for converting the transaction into a receipt. To calculate the full gas cost, add up the transaction cost with all receipt costs.

The transaction outcome currently does not have a gas profile, it only shows the total gas spent converting the transaction. Arguably, it is not necessary to provide the gas profile since the costs only depend on the list of actions. With sufficient understanding of the protocol, one could reverse-engineer the exact breakdown simply by looking at the action list. But adding the profile would still make sense to make it easier to understand.

Gas Profile Versions

Depending on the version in receipts_outcome.outcome.metadata.version, you should expect a different format of the gas profile. Version 1 has no profile data at all. Version 2 has a detailed profile but some parameters are conflated, so you cannot extract the exact gas spending in some cases. Version 3 will have the cost exactly per parameter.

Which version of the profile an RPC node returns depends on the version it had when it first processed the transaction. The profiles are stored in the database with one version and never updated. Therefore, older transactions will usually only have old profiles. However, one could replay the chain from genesis with a new nearcore client and generate the newest profile for all transactions in this way.

Note: Due to bugs, some nodes will claim they send version 1 but actually send version 2. (Did I mention that profiles are unreliable?)

How Gas Profiles are Created

The transaction runtime charges gas in various places around the code. ActionResult keeps a summary of all costs for an action. The gas_burnt and gas_used fields track the total gas burned and reserved for spawned receipts. These two fields are crucial for the protocol to function correctly, as they are used to determine when execution runs out of gas.

Additionally, ActionResult also has a profile field which keeps a detailed breakdown of the gas spending per parameter. Profiles are not stored on chain but RPC nodes and archival nodes keep them in their databases. This is mostly a debug tool and has no direct impact on the correct functioning of the protocol.

Charging Gas

Generally speaking, gas is charged right before the computation that it pays for is executed. It has to be before to avoid cheap resource exhaustion attacks. Imagine the user has only 1 gas unit left, but if we start executing an expensive step, we would waste a significant duration of computation on all validators without anyone paying for it.

When charging gas for an action, the ActionResult can be updated directly. But when charging WASM costs, it would be too slow to do a context switch each time, Therefore, a fast gas counter exists that can be updated from within the VM. (See gas_counter.rs) At the end of a function call execution, the gas counter is read by the host and merged into the ActionResult.

Runtime Parameter Estimator

The runtime parameter estimator is a byzantine benchmarking suite. Byzantine benchmarking is not a commonly used term but I feel it describes it quite well. It measures the performance assuming that up to a third of validators and all users collude to make the system as slow as possible.

This benchmarking suite is used to check that the gas parameters defined in the protocol are correct. Correct in this context means, a chunk filled with 1 Pgas (Peta gas) will take at most 1 second to be applied. Or more generally, per 1 Tgas of execution, we spend no more than 1ms wall-clock time.

For now, nearcore timing is the only one that matters. Things will become more complicated once there are multiple client implementations. But knowing that nearcore can serve requests fast enough proves that it is possible to be at least as fast. However, we should be careful not to couple costs too tightly with the specific implementation of nearcore to allow for innovation in new clients.

The estimator code is part of the nearcore repository in the directory runtime/runtime-params-estimator.

For a practical guide on how to run the estimator, please take a look at Running the Estimator in the workflows chapter.

Code Structure

The estimator contains a binary and a library module. The main.rs contains the CLI arguments parsing code and logic to fill the test database.

The interesting code lives in lib.rs and its submodules. The comments at the top of that file provide a high-level overview of how estimations work. More details on specific estimations are available as comments on the enum variants of Cost in costs.rs.

If you roughly understand the three files above, you already have a great overview of the estimator. estimator_context.rs is another central file. A full estimation run creates a single EstimatorContext. Each estimation will use it to spawn a new Testbed with a fresh database that contains the same data as the setup in the estimator context.

Most estimations fill blocks with transactions to be executed and hand them to Testbed::measure_blocks. To allow for easy repetitions, the block is usually filled by an instance of the TransactionBuilder, which can be retrieved from a testbed.

But even filling blocks with transactions becomes repetitive since many parameters are estimated similarly. utils.rs has a collection of helpful functions that let you write estimations very quickly.

Estimation Metrics

The estimation code is generally not concerned with the metric used to estimate gas. We use let clock = GasCost::measure(); and clock.elapsed() to measure the cost in whatever metric has been specified in the CLI argument --metric. But when you run estimations and especially when you want to interpret the results, you want to understand the metric used. Available metrics are time and icount.

Starting with time, this is a simple wall-clock time measurement. At the end of the day, this is what counts in a validator setup. But unfortunately, this metric is very dependent on the specific hardware and what else is running on that hardware right now. Dynamic voltage and frequency scaling (DVFS) also plays a role here. To a certain degree, all these factors can be controlled. But it requires full control over a system (often not the case when running on cloud-hosted VMs) and manual labor to set it up.

The other supported metric icount is much more stable. It uses qemu to emulate an x86 CPU. We then insert a custom TCG plugin (counter.c) that counts the number of executed x86 instructions. It also intercepts system calls and counts the number of bytes seen in sys_read, sys_write and their variations. This gives an approximation for IO bytes, as seen on the interface between the operating system and nearcore. To convert to gas, we use three constants to multiply with instruction count, read bytes, and write bytes.

We run qemu inside a Docker container using the Podman runtime, to make sure the qemu and qemu plugin versions match with system libraries. Make sure to add --containerize when running with --metric icount.

The great thing about icount is that you can run it on different machines and it will always return the same result. It is not 100% deterministic but very close, so it can usually detect code changes that degrade performance in major ways.

The problem with icount is how unrepresentative it is for real-life performance. First, x86 instructions are not all equally complex. Second, how many of them are executed per cycle depends on instruction level pipelining, branch prediction, memory prefetching, and more CPU features like that which are just not captured by an emulator like qemu. Third, the time it takes to serve bytes in system calls depends less on the sum of all bytes and more on data locality and how it can be cached in the OS page cache. But regardless of all these inaccuracies, it can still be useful to compare different implementations both measured using icount.

From Estimations to Parameter Values

To calculate the final gas parameter values, there is more to be done than just running a single command. After all, these parameters are part of the protocol specification. They cannot be changed easily. And setting them to a wrong value can cause severe system instability.

Our current strategy is to run estimations with two different metrics and do so on standardized cloud hardware. The output is then sanity checked manually by several people. Based on that, the final gas parameter value is determined. Usually, it will be the higher output of the two metrics rounded up.

The PR #8031 to set the ed25519 verification gas parameters is a good example of how such an analysis and report could look like.

More details on the process will be added to this document in due time.

Overview

This chapter describes various development processes and best practices employed at nearcore.

Rust 🦀

This short chapter collects various useful general resources about the Rust programming language. If you are already familiar with Rust, skip this chapter. Otherwise, this chapter is for you!

Getting Help

Rust community actively encourages beginners to ask questions, take advantage of that!

We have a dedicated stream for Rust questions on our Zulip: Rust 🦀.

There's a general Rust forum at https://users.rust-lang.org.

For a more interactive chat, take a look at Discord: https://discord.com/invite/rust-lang.

Reference Material

Rust is very well documented. It's possible to learn the whole language and most of the idioms by just reading the official docs. Starting points are

Alternatives are:

Rust has some great tooling, which is also documented:

  • Cargo, the build system. Worth at least skimming through!
  • For IDE support, see IntelliJ Rust if you like JetBrains products or rust-analyzer if you use any other editor (fun fact: NEAR was one of the sponsors of rust-analyzer!).
  • Rustup manages versions of Rust itself. It's unobtrusive, so feel free to skip this.

Cheat Sheet

This is a thing in its category, do check it out:

https://cheats.rs

Language Mastery

  • Rust for Rustaceans — the book to read after "The Book".
  • Tokio docs explain asynchronous programming in Rust (async/await).
  • Rust API Guidelines codify rules for idiomatic Rust APIs. Note that guidelines apply to semver surface of libraries, and most of the code in nearcore is not on the semver boundary. Still, a lot of insight there!
  • Rustonomicon explains unsafe. (any resemblance to https://nomicon.io is purely coincidental)

Selected Blog Posts

A lot of finer knowledge is hidden away in various dusty corners of Web-2.0. Here are some favorites:

And on the easiest topic of error handling specifically:

Finally, as a dessert, the first rust slide deck: http://venge.net/graydon/talks/rust-2012.pdf.

Workflows

This chapter documents various ways you can run neard during development: running a local net, joining a test net, doing benchmarking and load testing.

Run a Node

This chapter focuses on the basics of running a node you've just built from source. It tries to explain how the thing works under the hood and pays relatively little attention to the various shortcuts we have.

Building the Node

Start with the following command:

$ cargo run --profile dev-release -p neard -- --help

This command builds neard and asks it to show --help. Building neard takes a while, take a look at Fast Builds chapter to learn how to speed it up.

Let's dissect the command:

  • cargo run asks Cargo, the package manager/build tool, to run our application. If you don't have cargo, install it via https://rustup.rs
  • --profile dev-release is our custom profile to build a somewhat optimized version of the code. The default debug profile is faster to compile, but produces a node that is too slow to participate in a real network. The --release profile produces a fully optimized node, but that's very slow to compile. So --dev-release is a sweet spot for us! However, never use it for actual production nodes.
  • -p neard asks to build the neard package. We use cargo workspaces to organize our code. The neard package in the top-level /neard directory is the final binary that ties everything together.
  • -- tells cargo to pass the rest of the arguments through to neard.
  • --help instructs neard to list available CLI arguments and subcommands.

Note: Building neard might fail with an openssl or CC error. This means that you lack some non-rust dependencies we use (openssl and rocksdb mainly). We currently don't have docs on how to install those, but (basically) you want to sudo apt install (or whichever distro/package manager you use) missing bits.

Preparing Tiny Network

Typically, you want neard to connect to some network, like mainnet or testnet. We'll get there in time, but we'll start small. For the current chapter, we will run a network consisting of just a single node -- our own.

The first step there is creating the required configuration. Run the init command to create config files:

$ cargo run --profile dev-release -p neard -- init
INFO neard: version="trunk" build="1.1.0-3091-ga8964d200-modified" latest_protocol=57
INFO near: Using key ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ for test.near
INFO near: Using key ed25519:34d4aFJEmc2A96UXMa9kQCF8g2EfzZG9gCkBAPcsVZaz for node
INFO near: Generated node key, validator key, genesis file in ~/.near

As the log output says, we are just generating some things in ~/.near. Let's take a look:

$ ls ~/.near
config.json
genesis.json
node_key.json
validator_key.json

The most interesting file here is perhaps genesis.json -- it specifies the initial state of our blockchain. There are a bunch of hugely important fields there, which we'll ignore here. The part we'll look at is the .records, which contains the actual initial data:

$ cat ~/.near/genesis.json | jq '.records'
[
  {
    "Account": {
      "account_id": "test.near",
      "account": {
        "amount": "1000000000000000000000000000000000",
        "locked": "50000000000000000000000000000000",
        "code_hash": "11111111111111111111111111111111",
        "storage_usage": 0,
        "version": "V1"
      }
    }
  },
  {
    "AccessKey": {
      "account_id": "test.near",
      "public_key": "ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ",
      "access_key": {
        "nonce": 0,
        "permission": "FullAccess"
      }
    }
  },
  {
    "Account": {
      "account_id": "near",
      "account": {
        "amount": "1000000000000000000000000000000000",
        "locked": "0",
        "code_hash": "11111111111111111111111111111111",
        "storage_usage": 0,
        "version": "V1"
      }
    }
  },
  {
    "AccessKey": {
      "account_id": "near",
      "public_key": "ed25519:546XB2oHhj7PzUKHiH9Xve3Ze5q1JiW2WTh6abXFED3c",
      "access_key": {
        "nonce": 0,
        "permission": "FullAccess"
      }
    }
  }

(I am using the jq utility here)

We see that we have two accounts here, and we also see their public keys (but not the private ones).

One of these accounts is a validator:

$ cat ~/.near/genesis.json | jq '.validators'
[
  {
    "account_id": "test.near",
    "public_key": "ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ",
    "amount": "50000000000000000000000000000000"
  }
]

Now, if we

$ cat ~/.near/validator_key.json

we'll see

{
  "account_id": "test.near",
  "public_key": "ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ",
  "secret_key": "ed25519:3x2dUQgBoEqNvKwPjfDE8zDVJgM8ysqb641PYHV28mGPu61WWv332p8keMDKHUEdf7GVBm4f6z4D1XRgBxnGPd7L"
}

That is, we have a secret key for the sole validator in our network, how convenient.

To recap, neard init without arguments creates a config for a new network that starts with a single validator, for which we have the keys.

You might be wondering what ~/.near/node_key.json is. That's not too important, but, in our network, there's no 1-1 correspondence between machines participating in the peer-to-peer network and accounts on the blockchain. So the node_key specifies the keypair we'll use when signing network packets. These packets internally will contain messages signed with the validator's key, and these internal messages will drive the evolution of the blockchain state.

Finally, ~/.near/config.json contains various configs for the node itself. These are configs that don't affect the rules guiding the evolution of the blockchain state, but rather things like timeouts, database settings and such.

The only field we'll look at is boot_nodes:

$ cat ~/.near/config.json | jq '.network.boot_nodes'
""

It's empty! The boot_nodes specify IPs of the initial nodes our node will try to connect to on startup. As we are looking into running a single-node network, we want to leave it empty. But, if you would like to connect to mainnet, you'd have to set this to some nodes from the mainnet you already know. You'd also have to ensure that you use the same genesis as the mainnet though -- if the node tries to connect to a network with a different genesis, it is rejected.

Running the Network

Finally,

$ cargo run --profile dev-release -p neard -- run
INFO neard: version="trunk" build="1.1.0-3091-ga8964d200-modified" latest_protocol=57
INFO near: Creating a new RocksDB database path=/home/matklad/.near/data
INFO db: Created a new RocksDB instance. num_instances=1
INFO stats: #       0 4xecSHqTKx2q8JNQNapVEi5jxzewjxAnVFhMd4v5LqNh Validator | 1 validator 0 peers ⬇ 0 B/s ⬆ 0 B/s NaN bps 0 gas/s CPU: 0%, Mem: 50.8 MB
INFO near_chain::doomslug: ready to produce block @ 1, has enough approvals for 59.907Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 2, has enough approvals for 40.732Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 3, has enough approvals for 65.341Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 4, has enough approvals for 51.916Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 5, has enough approvals for 37.155Âľs, has enough chunks
...

🎉 it's alive!

So, what's going on here?

Our node is running a single-node network. As the network only has a single validator, and the node has the keys for the validator, the node can produce blocks by itself. Note the increasing @ 1, @ 2, ... numbers. That means that our network grows.

Let's stop the node with ^C and look around

INFO near_chain::doomslug: ready to produce block @ 42, has enough approvals for 56.759Âľs, has enough chunks
^C WARN neard: SIGINT, stopping... this may take a few minutes.
INFO neard: Waiting for RocksDB to gracefully shutdown
INFO db: Waiting for remaining RocksDB instances to shut down num_instances=1
INFO db: All RocksDB instances shut down
$

The main change now is that we have a ~/.near/data directory which holds the state of the network in various rocksdb tables:

$ ls ~/.near/data
 000004.log
 CURRENT
 IDENTITY
 LOCK
 LOG
 MANIFEST-000005
 OPTIONS-000107
 OPTIONS-000109

It doesn't matter what those are, "rocksdb stuff" is a fine level of understanding here. The important bit here is that the node remembers the state of the network, so, when we restart it, it continues from around the last block:

$ cargo run --profile dev-release -p neard -- run
INFO neard: version="trunk" build="1.1.0-3091-ga8964d200-modified" latest_protocol=57
INFO db: Created a new RocksDB instance. num_instances=1
INFO db: Dropped a RocksDB instance. num_instances=0
INFO near: Opening an existing RocksDB database path=/home/matklad/.near/data
INFO db: Created a new RocksDB instance. num_instances=1
INFO stats: #       5 Cfba39eH7cyNfKn9GoKTyRg8YrhoY1nQxQs66tLBYwRH Validator | 1 validator 0 peers ⬇ 0 B/s ⬆ 0 B/s NaN bps 0 gas/s CPU: 0%, Mem: 49.4 MB
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 366.58789ms, has enough approvals for 78.776Âľs
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 265.547148ms, has enough approvals for 101.119518ms
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 164.509153ms, has enough approvals for 202.157513ms
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 63.176926ms, has enough approvals for 303.48974ms
INFO near_chain::doomslug: ready to produce block @ 43, has enough approvals for 404.41498ms, does not have enough chunks
INFO near_chain::doomslug: ready to produce block @ 44, has enough approvals for 50.07Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 45, has enough approvals for 45.093Âľs, has enough chunks

Interacting With the Node

Ok, now our node is running, let's poke it! The node exposes a JSON RPC interface which can be used to interact with the node itself (to, e.g., do a health check) or with the blockchain (to query information about the blockchain state or to submit a transaction).

$ http get http://localhost:3030/status
HTTP/1.1 200 OK
access-control-allow-credentials: true
access-control-expose-headers: accept-encoding, accept, connection, host, user-agent
content-length: 1010
content-type: application/json
date: Tue, 15 Nov 2022 13:58:13 GMT
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers

{
    "chain_id": "test-chain-rR8Ct",
    "latest_protocol_version": 57,
    "node_key": "ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf",
    "node_public_key": "ed25519:5A5QHyLayA9zksJZGBzveTgBRecpsVS4ohuxujMAFLLa",
    "protocol_version": 57,
    "rpc_addr": "0.0.0.0:3030",
    "sync_info": {
        "earliest_block_hash": "6gJLCnThQENYFbnFQeqQvFvRsTS5w87bf3xf8WN1CMUX",
        "earliest_block_height": 0,
        "earliest_block_time": "2022-11-15T13:45:53.062613669Z",
        "epoch_id": "6gJLCnThQENYFbnFQeqQvFvRsTS5w87bf3xf8WN1CMUX",
        "epoch_start_height": 501,
        "latest_block_hash": "9JC9o3rZrDLubNxVr91qMYvaDiumzwtQybj1ZZR9dhbK",
        "latest_block_height": 952,
        "latest_block_time": "2022-11-15T13:58:13.185721125Z",
        "latest_state_root": "9kEYQtWczrdzKCCuFzPDX3Vtar1pFPXMdLU5HJyF8Ght",
        "syncing": false
    },
    "uptime_sec": 570,
    "validator_account_id": "test.near",
    "validator_public_key": "ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf",
    "validators": [
        {
            "account_id": "test.near",
            "is_slashed": false
        }
    ],
    "version": {
        "build": "1.1.0-3091-ga8964d200-modified",
        "rustc_version": "1.65.0",
        "version": "trunk"
    }
}

(I am using HTTPie here)

Note how "latest_block_height": 952 corresponds to @ 952 we see in the logs.

Let's query the blockchain state:

$ http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
     params:='{"request_type": "view_account", "finality": "final", "account_id": "test.near"}'
Îť http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
           params:='{"request_type": "view_account", "finality": "final", "account_id": "test.near"}'

HTTP/1.1 200 OK
access-control-allow-credentials: true
access-control-expose-headers: content-length, accept, connection, user-agent, accept-encoding, content-type, host
content-length: 294
content-type: application/json
date: Tue, 15 Nov 2022 14:04:54 GMT
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers

{
    "id": "1",
    "jsonrpc": "2.0",
    "result": {
        "amount": "1000000000000000000000000000000000",
        "block_hash": "Hn4v5CpfWf141AJi166gdDK3e3khCxgfeDJ9dSXGpAVi",
        "block_height": 1611,
        "code_hash": "11111111111111111111111111111111",
        "locked": "50003138579594550524246699058859",
        "storage_paid_at": 0,
        "storage_usage": 182
    }
}

Note how we use an HTTP post method when we interact with the blockchain RPC. The full set of RPC endpoints is documented at

https://docs.near.org/api/rpc/introduction

Sending Transactions

Transactions are submitted via RPC as well. Submitting a transaction manually with http is going to be cumbersome though — transactions are borsh encoded to bytes, then signed, then encoded in base64 for JSON.

So we will use the official NEAR CLI utility.

Install it via npm:

$ npm install -g near-cli
$ near -h
Usage: near <command> [options]

Commands:
  near create-account <accountId>    create a new developer account
....

Note that, although you install near-cli, the name of the utility is near.

As a first step, let's redo the view_account call we did with raw httpie with near-cli:

$ NEAR_ENV=local near state test.near
Loaded master account test.near key from ~/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Account test.near
{
  amount: '1000000000000000000000000000000000',
  block_hash: 'ESGN7H1kVLp566CTQ9zkBocooUFWNMhjKwqHg4uCh2Sg',
  block_height: 2110,
  code_hash: '11111111111111111111111111111111',
  locked: '50005124762657986708532525400812',
  storage_paid_at: 0,
  storage_usage: 182,
  formattedAmount: '1,000,000,000'
}

NEAR_ENV=local tells near-cli to use our local network, rather than the mainnet.

Now, let's create a couple of accounts and send tokes between them:

$ NEAR_ENV=local near create-account alice.test.near --masterAccount test.near
NOTE: In most cases, when connected to network "local", masterAccount will end in ".node0"
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Saving key to 'undefined/local/alice.test.near.json'
Account alice.test.near for network "local" was created.

$ NEAR_ENV=local near create-account bob.test.near --masterAccount test.near
NOTE: In most cases, when connected to network "local", masterAccount will end in ".node0"
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Saving key to 'undefined/local/bob.test.near.json'
Account bob.test.near for network "local" was created.

$ NEAR_ENV=local near send alice.test.near bob.test.near 10
Sending 10 NEAR to bob.test.near from alice.test.near
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Transaction Id BBPndo6gR4X8pzoDK7UQfoUXp5J8WDxkf8Sq75tK5FFT
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/BBPndo6gR4X8pzoDK7UQfoUXp5J8WDxkf8Sq75tK5FFT

Note: You can export the variable NEAR_ENV in your shell if you are planning to do multiple commands to avoid repetition:

$ export NEAR_ENV=local

NEAR CLI printouts are not always the most useful or accurate, but this seems to work.

Note that near automatically creates keypairs and stores them at .near-credentials:

$ ls ~/.near-credentials/local
  alice.test.near.json
  bob.test.near.json

To verify that this did work, and that near-cli didn't cheat us, let's query the state of accounts manually:

$ http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
    params:='{"request_type": "view_account", "finality": "final", "account_id": "alice.test.near"}' \
    | jq '.result.amount'
"89999955363487500000000000"

14:30:52|~
Îť http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
    params:='{"request_type": "view_account", "finality": "final", "account_id": "bob.test.near"}' \
    | jq '.result.amount'
"110000000000000000000000000"

Indeed, some amount of tokes was transferred from alice to bob, and then some amount of tokens was deducted to account for transaction fees.

Recap

Great! So we've learned how to run our very own single-node NEAR network using a binary we've built from source. The steps are:

  • Create configs with cargo run --profile dev-release -p neard -- init
  • Run the node with cargo run --profile dev-release -p neard -- run
  • Poke the node with httpie or
  • Install near-cli via npm install -g near-cli
  • Submit transactions via NEAR_ENV=local near create-account ...

In the next chapter, we'll learn how to deploy a simple WASM contract.

Deploy a Contract

In this chapter, we'll learn how to build, deploy, and call a minimal smart contract on our local node.

Preparing Ground

Let's start with creating a fresh local network with an account to which we'll deploy a contract. You might want to re-read how to run a node to understand what's going on here:

$ cargo run --profile dev-release -p neard -- init
$ cargo run --profile dev-release -p neard -- run
$ NEAR_ENV=local near create-account alice.test.near --masterAccount test.near

As a sanity check, querying the state of alice.test.near account should work:

$ NEAR_ENV=local near state alice.test.near
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:7tU4NtFozPWLotcfhbT9KfBbR3TJHPfKJeCri8Me6jU7
Account alice.test.near
{
  amount: '100000000000000000000000000',
  block_hash: 'EEMiLrk4ZiRzjNJXGdhWPJfKXey667YBnSRoJZicFGy9',
  block_height: 24,
  code_hash: '11111111111111111111111111111111',
  locked: '0',
  storage_paid_at: 0,
  storage_usage: 182,
  formattedAmount: '100'
}

Minimal Contract

NEAR contracts are WebAssembly blobs of bytes. To create a contract, a contract developer typically uses an SDK for some high-level programming language, such as JavaScript, which takes care of producing the right .wasm.

In this guide, we are interested in how things work under the hood, so we'll do everything manually, and implement a contract in Rust without any help from SDKs.

As we are looking for something simple, let's create a contract with a single "method", hello, which returns a "hello world" string. To "define a method", a wasm module should export a function. To "return a value", the contract needs to interact with the environment to say "hey, this is the value I am returning". Such "interactions" are carried through host functions, which are quite a bit like syscalls in traditional operating systems.

The set of host functions that the contract can import is defined in imports.rs.

In this particular case, we need the value_return function:

value_return<[value_len: u64, value_ptr: u64] -> []>

This means that the value_return function takes a pointer to a slice of bytes, the length of the slice, and returns nothing. If the contract calls this function, the slice would be considered a result of the function.

To recap, we want to produce a .wasm file with roughly the following content:

(module
  (import "env" "value_return" (func $value_return (param i64 i64)))
  (func (export "hello") ... ))

Cargo Boilerplate

Armed with this knowledge, we can write Rust code to produce the required WASM. Before we start doing that, some amount of setup code is required.

Let's start with creating a new crate:

$ cargo new hello-near --lib

To compile to wasm, we also need to add a relevant rustup toolchain:

$ rustup toolchain add wasm32-unknown-unknown

Then, we need to tell Cargo that the final artifact we want to get is a WebAssembly module.

This requires the following cryptic spell in Cargo.toml:

# hello-near/Cargo.toml

[lib]
crate-type = ["cdylib"]

Here, we ask Cargo to build a "C dynamic library". When compiling for wasm, that'll give us a .wasm module. This part is a bit confusing, sorry about that :(

Next, as we are aiming for minimalism here, we need to disable optional bits of the Rust runtime. Namely, we want to make our crate no_std (this means that we are not going to use the Rust standard library), set panic=abort as our panic strategy and define a panic handler to abort execution.

# hello-near/Cargo.toml

[package]
name = "hello-near"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

[profile.release]
panic = "abort"
#![allow(unused)]
fn main() {
// hello-near/src/lib.rs

#![no_std]

#[panic_handler]
fn panic_handler(_info: &core::panic::PanicInfo) -> ! {
    core::arch::wasm32::unreachable()
}
}

At this point, we should be able to compile our code to wasm, and it should be fairly small. Let's do that:

$ cargo b -r --target wasm32-unknown-unknown
   Compiling hello-near v0.1.0 (~/hello-near)
    Finished release [optimized] target(s) in 0.24s
$ ls target/wasm32-unknown-unknown/release/hello_near.wasm
.rwxr-xr-x 106 matklad 15 Nov 15:34 target/wasm32-unknown-unknown/release/hello_near.wasm

106 bytes is pretty small! Let's see what's inside. For that, we'll use the wasm-tools suite of CLI utilities.

$ cargo install wasm-tools
Îť wasm-tools print target/wasm32-unknown-unknown/release/hello_near.wasm
(module
  (memory (;0;) 16)
  (global $__stack_pointer (;0;) (mut i32) i32.const 1048576)
  (global (;1;) i32 i32.const 1048576)
  (global (;2;) i32 i32.const 1048576)
  (export "memory" (memory 0))
  (export "__data_end" (global 1))
  (export "__heap_base" (global 2))
)

Rust Contract

Finally, let's implement an actual contract. We'll need an extern "C" block to declare the value_return import, and a #[unsafe(no_mangle)] extern "C" function to declare the hello export:

#![allow(unused)]
fn main() {
// hello-near/src/lib.rs

#![no_std]

extern "C" {
    fn value_return(len: u64, ptr: u64);
}

#[unsafe(no_mangle)]
pub extern "C" fn hello() {
    let msg = "hello world";
    unsafe { value_return(msg.len() as u64, msg.as_ptr() as u64) }
}

#[panic_handler]
fn panic_handler(_info: &core::panic::PanicInfo) -> ! {
    core::arch::wasm32::unreachable()
}
}

After building the contract, the output wasm shows us that it's roughly what we want:

$ cargo b -r --target wasm32-unknown-unknown
   Compiling hello-near v0.1.0 (/home/matklad/hello-near)
    Finished release [optimized] target(s) in 0.05s
$ wasm-tools print target/wasm32-unknown-unknown/release/hello_near.wasm
(module
  (type (;0;) (func (param i64 i64)))
  (type (;1;) (func))
  (import "env" "value_return"        (; <- Here's our import. ;)
    (func $value_return (;0;) (type 0)))
  (func $hello (;1;) (type 1)
    i64.const 11
    i32.const 1048576
    i64.extend_i32_u
    call $value_return
  )
  (memory (;0;) 17)
  (global $__stack_pointer (;0;) (mut i32) i32.const 1048576)
  (global (;1;) i32 i32.const 1048587)
  (global (;2;) i32 i32.const 1048592)
  (export "memory" (memory 0))
  (export "hello" (func $hello))      (; <- And export! ;)
  (export "__data_end" (global 1))
  (export "__heap_base" (global 2))
  (data $.rodata (;0;) (i32.const 1048576) "hello world")
)

Deploying the Contract

Now that we have the WASM, let's deploy it!

$ NEAR_ENV=local near deploy alice.test.near \
    ./target/wasm32-unknown-unknown/release/hello_near.wasm
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:ChLD1qYic3G9qKyzgFG3PifrJs49CDYeERGsG58yaSoL
Starting deployment. Account id: alice.test.near, node: http://127.0.0.1:3030, helper: http://localhost:3000, file: ./target/wasm32-unknown-unknown/release/hello_near.wasm
Transaction Id GDbTLUGeVaddhcdrQScVauYvgGXxSssEPGUSUVAhMWw8
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/GDbTLUGeVaddhcdrQScVauYvgGXxSssEPGUSUVAhMWw8
Done deploying to alice.test.near

And, finally, let's call our contract:

$ NEAR_ENV=local $near call alice.test.near hello --accountId alice.test.near
Scheduling a call: alice.test.near.hello()
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:ChLD1qYic3G9qKyzgFG3PifrJs49CDYeERGsG58yaSoL
Doing account.functionCall()
Transaction Id 9WMwmTf6pnFMtj1KBqjJtkKvdFXS4kt3DHnYRnbFpJ9e
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/9WMwmTf6pnFMtj1KBqjJtkKvdFXS4kt3DHnYRnbFpJ9e
'hello world'

Note that we pass alice.test.near twice: the first time to specify which contract we are calling, the second time to determine who calls the contract. That is, the second account is the one that spends tokens. In the following example bob spends NEAR to call the contact deployed to the alice account:

$ NEAR_ENV=local $near call alice.test.near hello --accountId bob.test.near
Scheduling a call: alice.test.near.hello()
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:ChLD1qYic3G9qKyzgFG3PifrJs49CDYeERGsG58yaSoL
Doing account.functionCall()
Transaction Id 4vQKtP6zmcR4Xaebw8NLF6L5YS96gt5mCxc5BUqUcC41
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/4vQKtP6zmcR4Xaebw8NLF6L5YS96gt5mCxc5BUqUcC41
'hello world'

Running the Estimator

This workflow describes how to run the gas estimator byzantine-benchmark suite. To learn about its background and purpose, refer to Runtime Parameter Estimator in the architecture chapter.

Type this in your console to quickly run estimations on a couple of action costs.

cargo run -p runtime-params-estimator --features required -- \
    --accounts-num 20000 --additional-accounts-num 20000 \
    --iters 3 --warmup-iters 1 --metric time \
    --costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase

You should get an output like this.

[elapsed 00:00:17 remaining 00:00:00] Writing into storage ████████████████████   20000/20000
ActionReceiptCreation         4_499_673_502_000 gas [  4.499674ms]    (computed in 7.22s)
ActionTransfer                  410_122_090_000 gas [   410.122Âľs]    (computed in 4.71s)
ActionCreateAccount             237_495_890_000 gas [   237.496Âľs]    (computed in 4.64s)
ActionFunctionCallBase          770_989_128_914 gas [   770.989Âľs]    (computed in 4.65s)


Finished in 40.11s, output saved to:

    /home/you/near/nearcore/costs-2022-11-11T11:11:11Z-e40863c9b.txt

This shows how much gas a parameter should cost to satisfy the 1ms = 1Tgas rule. It also shows how much time that corresponds to and how long it took to compute each of the estimations.

Note that the above does not produce very accurate results and it can have high variance as well. It runs an unoptimized binary, the state is small, and the metric used is wall-clock time which is always prone to variance in hardware and can be affected by other processes currently running on your system.

Once your estimation code is ready, it is better to run it with a larger state and an optimized binary.

cargo run --release -p runtime-params-estimator --features required -- \
    --accounts-num 20000 --additional-accounts-num 2000000 \
    --iters 3 --warmup-iters 1 --metric time \
    --costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase

You might also want to run a hardware-agnostic estimation using the following command. It uses podman and qemu under the hood, so it will be quite a bit slower. You will need to install podman to run this command.

cargo run --release -p runtime-params-estimator --features required -- \
    --accounts-num 20000 --additional-accounts-num 2000000 \
    --iters 3 --warmup-iters 1 --metric icount --containerize \
    --costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase

Note how the output looks a bit different now. The i, r and w values show instruction count, read IO bytes, and write IO bytes respectively. The IO byte count is known to be inaccurate.

+ /host/nearcore/runtime/runtime-params-estimator/emu-cost/counter_plugin/qemu-x86_64 -plugin file=/host/nearcore/runtime/runtime-params-estimator/emu-cost/counter_plugin/libcounter.so -cpu Haswell-v4 /host/nearcore/target/release/runtime-params-estimator --home /.near --accounts-num 20000 --iters 3 --warmup-iters 1 --metric icount --costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase --skip-build-test-contract --additional-accounts-num 0 --in-memory-db
ActionReceiptCreation         214_581_685_500 gas [  1716653.48i 0.00r 0.00w]     (computed in 6.11s)
ActionTransfer                 21_528_212_916 gas [   172225.70i 0.00r 0.00w]     (computed in 4.71s)
ActionCreateAccount            26_608_336_250 gas [   212866.69i 0.00r 0.00w]     (computed in 4.67s)
ActionFunctionCallBase         12_193_364_898 gas [    97546.92i 0.00r 0.00w]     (computed in 2.39s)


Finished in 17.92s, output saved to:

    /host/nearcore/costs-2022-11-01T16:27:36Z-e40863c9b.txt

The difference between the metrics is discussed in the Estimation Metrics chapter.

You should now be all set up for running estimations on your local machine. Also, check cargo run -p runtime-params-estimator --features required -- --help for the list of available options.

Running near localnet on 2 machines

Quick instructions on how to run a localnet on 2 separate machines.

Setup

  • Machine1: "pc" - 192.168.0.1
  • Machine2: "laptop" - 192.168.0.2

Run on both machines (make sure that they are using the same version of the code):

cargo build -p neard

Then on machine1 run the command below, which will generate the configurations:

./target/debug/neard --home ~/.near/localnet_multi localnet --shards 3 --v 2

This command has generated configuration for 3 shards and 2 validators (in directories ~/.near/localnet_multi/node0 and ~/.near/localnet_multi/node1).

Now - copy the contents of node1 directory to the machine2

rsync -r ~/.near/localnet_multi/node1 192.168.0.2:~/.near/localnet_multi/node1

Now open the config.json file on both machines (node0/config.json on machine1 and node1/config.json on machine2) and:

  • for rpc->addr and network->addr:
    • Change the address from 127.0.0.1 to 0.0.0.0 (this means that the port will be accessible from other computers)
    • Remember the port numbers (they are generated randomly).
  • Also write down the node0's node_key (it is probably: "ed25519:7PGseFbWxvYVgZ89K1uTJKYoKetWs7BJtbyXDzfbAcqX")

Running

On machine1:

./target/debug/neard --home ~/.near/localnet_multi/node0 run

On machine2:

./target/debug/neard --home ~/.near/localnet_multi/node1 run --boot-nodes ed25519:7PGseFbWxvYVgZ89K1uTJKYoKetWs7BJtbyXDzfbAcqX@192.168.0.1:37665

The boot node address should be the IP of the machine1 + the network addr port from the node0/config.json

And if everything goes well, the nodes should communicate and start producing blocks.

Troubleshooting

The debug mode is enabled by default, so you should be able to see what's going on by going to http://machine1:RPC_ADDR_PORT/debug

If node keeps saying "waiting for peers"

See if you can see the machine1's debug page from machine2. (if not - there might be a firewall blocking the connection).

Make sure that you set the right ports (it should use node0's NETWORK port) and that you set the ip add there to 0.0.0.0

Resetting the state

Simply stop both nodes, and remove the data subdirectory (~/.near/localnet_multi/node0/data and ~/.near/localnet_multi/node1/data).

Then after restart, the nodes will start the blockchain from scratch.

IO tracing

When should I use IO traces?

IO traces can be used to identify slow receipts and to understand why they are slow. Or to detect general inefficiencies in how we use the DB.

The results give you counts of DB requests and some useful statistics such as trie node cache hits. It will NOT give you time measurements, use Graphana to observe those.

The main use cases in the past were to estimate the performance of new storage features (prefetcher, flat state) and to find out why specific contracts produce slow receipts.

Setup

When compiling neard (or the parameter estimator) with feature=io_trace it instruments the binary code with fine-grained database operations tracking.

Aside: We don't enable it by default because we are afraid the overhead could be too much, since we store some information for very hot paths such as trie node cache hits. Although we haven't properly evaluated if it really is a performance problem.

This allows using the --record-io-trace=/path/to/output.io_trace CLI flag on neard. Run it in combination with the subcommands neard run, neard view-state, or runtime-params-estimator and it will record an IO trace. Make sure to provide the flag to neard itself, however, not to the subcommands. (See examples below)

# Example command for normal node
# (Careful! This will quickly fill your disk if you let it run.)
cargo build --release -p neard --features=io_trace
target/release/neard \
    --record-io-trace=/mnt/disks/some_disk_with_enough_space/my.io_trace \
    run
# Example command for state viewer, applying a range of chunks in shard 0
cargo build --release -p neard --features=io_trace
target/release/neard \
    --record-io-trace=75220100-75220101.s0.io_trace \
    view-state apply-range --start-index 75220100 --end-index 75220101 \
    --shard-id 0 sequential
# Example command for params estimator
cargo run --release -p runtime-params-estimator --features=required,io_trace \
-- --accounts-num 200000 --additional-accounts-num 200000 \
--iters 3 --warmup-iters 1 --metric time \
--record-io-trace=tmp.io \
--costs ActionReceiptCreation

IO trace content

Once you have collected an IO trace, you can inspect its content manually, or use existing tools to extract statistics. Let's start with the manual approach.

Simple example trace: Estimator

An estimator trace typically starts something like this:

commit
  SET DbVersion "'VERSION'" size=2
commit
  SET DbVersion "'KIND'" size=3
apply num_transactions=0 shard_cache_miss=7
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
  GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
  GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
  GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
  GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
...

Depending on the source, traces look a bit different at the start. But you should always see some setup at the beginning and per-chunk workload later on.

Indentation is used to display the call hierarchy. The commit keyword shows when a commit starts, and all SET and UPDATE_RC commands that follow with one level deeper indentation belong to that same database transaction commit.

Later, you see a group that starts with an apply header. It groups all IO requests that were performed for a call to fn apply that applies transactions and receipts of a chunk to the previous state root.

In the example, you see a list of GET requests that belong to that apply, each with the DB key used and the size of the value read. Further, you can read in the trace that this specific chunk had 0 transactions and that it cache-missed all 7 of the DB requests it performed to apply this empty chunk.

Example trace: Full mainnet node

Next let's look at an excerpt of an IO trace from a real node on mainnet.

...
GET State "AQAAAAMAAACm9DRx/dU8UFEfbumiRhDjbPjcyhE6CB1rv+8fnu81bw==" size=9
GET State "AQAAAAAAAACLFgzRCUR3inMDpkApdLxFTSxRvprJ51eMvh3WbJWe0A==" size=203
GET State "AQAAAAIAAACXlEo0t345S6PHsvX1BLaGw6NFDXYzeE+tlY2srjKv8w==" size=299
apply_transactions shard_id=3
  process_state_update 
    apply num_transactions=3 shard_cache_hit=207
      process_transaction tx_hash=C4itKVLP5gBoAPsEXyEbi67Gg5dvQVugdMjrWBBLprzB shard_cache_hit=57
        GET FlatState "AGxlb25hcmRvX2RheS12aW5jaGlrLm5lYXI=" size=36
        GET FlatState "Amxlb25hcmRvX2RheS12aW5jaGlrLm5lYXICANLByB1merOzxcGB1HI9/L60QvONzOE6ovF3hjYUbhA8" size=36
      process_transaction tx_hash=GRSXC4QCBJHN4hmJiATAbFGt9g5PiksQDNNRaSk666WX shard_cache_miss=3 prefetch_pending=3 shard_cache_hit=35
        GET FlatState "AnJlbGF5LmF1cm9yYQIA5iq407bcLgisCKxQQi47TByaFNe9FOgQg5y2gpU4lEM=" size=36
      process_transaction tx_hash=6bDPeat12pGqA3KEyyg4tJ35kBtRCuFQ7HtCpWoxr8qx shard_cache_miss=2 prefetch_pending=1 shard_cache_hit=21 prefetch_hit=1
        GET FlatState "AnJlbGF5LmF1cm9yYQIAyKT1vEHVesMEvbp2ICA33x6zxfmBJiLzHey0ZxauO1k=" size=36
      process_receipt receipt_id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC predecessor=1663adeba849fb7c26195678e1c5378278e5caa6325d4672246821d8e61bb160 receiver=token.sweat id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC shard_cache_too_large=1 shard_cache_miss=1 shard_cache_hit=38
        GET FlatState "AXRva2VuLnN3ZWF0" size=36
        GET State "AQAAAAMAAADVYp4vtlIbDoVhji22CZOEaxVWVTJKASq3iMvpNEQVDQ==" size=206835
        input 
        register_len 
        read_register 
        storage_read READ key='STATE' size=70 tn_db_reads=20 tn_mem_reads=0 shard_cache_hit=21
        register_len 
        read_register 
        attached_deposit 
        predecessor_account_id 
        register_len 
        read_register 
        sha256 
        read_register 
        storage_read READ key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=10 tn_mem_reads=19 shard_cache_hit=11
          GET FlatState "CXRva2VuLnN3ZWF0LHQAMWoGDDhG9NfuerJfbzjNMh22UT0mmEi98wVyHwjou6M=" size=36
        register_len 
        read_register 
        sha256 
        read_register 
        storage_write WRITE key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=0 tn_mem_reads=30
        ...

Maybe that's a bit much. Let's break it down into pieces.

It start with a few DB get requests that are outside of applying a chunk. It's quite common that we have these kinds of constant overhead requests that are independent of what's inside a chunk. If we see too many such requests, we should take a close look to see if we are wasting performance.

GET State "AQAAAAMAAACm9DRx/dU8UFEfbumiRhDjbPjcyhE6CB1rv+8fnu81bw==" size=9
GET State "AQAAAAAAAACLFgzRCUR3inMDpkApdLxFTSxRvprJ51eMvh3WbJWe0A==" size=203
GET State "AQAAAAIAAACXlEo0t345S6PHsvX1BLaGw6NFDXYzeE+tlY2srjKv8w==" size=299

Next let's look at apply_transactions but limit the depth of items to 3 levels.

apply_transactions shard_id=3
  process_state_update 
    apply num_transactions=3 shard_cache_hit=207
      process_transaction tx_hash=C4itKVLP5gBoAPsEXyEbi67Gg5dvQVugdMjrWBBLprzB shard_cache_hit=57
      process_transaction tx_hash=GRSXC4QCBJHN4hmJiATAbFGt9g5PiksQDNNRaSk666WX shard_cache_miss=3 prefetch_pending=3 shard_cache_hit=35
      process_transaction tx_hash=6bDPeat12pGqA3KEyyg4tJ35kBtRCuFQ7HtCpWoxr8qx shard_cache_miss=2 prefetch_pending=1 shard_cache_hit=21 prefetch_hit=1
      process_receipt receipt_id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC predecessor=1663adeba849fb7c26195678e1c5378278e5caa6325d4672246821d8e61bb160 receiver=token.sweat id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC shard_cache_too_large=1 shard_cache_miss=1 shard_cache_hit=38

Here you can see that before we even get to apply, we go through apply_transactions and process_state_update. The excerpt does not show it but there are DB requests listed further below that belong to these levels but not to apply.

Inside apply, we see 3 transactions being converted to receipts as part of this chunk, and one already existing action receipt getting processed.

Cache hit statistics for each level are also displayed. For example, the first transaction has 57 read requests and all of them hit in the shard cache. For the second transaction, we miss the cache 3 times but the values were already in the process of being prefetched. This would be account data which we fetch in parallel for all transactions in the chunk.

Finally, there are several process_receipt groups, although the excerpt was cut to display only one. Here we see the receiving account (receiver=token.sweat) and the receipt ID to potentially look it up on an explorer, or dig deeper using state viewer commands.

Again, cache hit statistics are included. Here you can see one value missed the cache because it was too large. Usually that's a contract code. We do not include it in the shard cache because it would take up too much space.

Zooming in a bit further, let's look at the DB request at the start of the receipt.

    GET FlatState "AXRva2VuLnN3ZWF0" size=36
    GET State "AQAAAAMAAADVYp4vtlIbDoVhji22CZOEaxVWVTJKASq3iMvpNEQVDQ==" size=206835
    input 
    register_len 
    read_register 
    storage_read READ key='STATE' size=70 tn_db_reads=20 tn_mem_reads=0 shard_cache_hit=21
    register_len 
    read_register 

FlatState "AXRva2VuLnN3ZWF0" reads the ValueRef of the contract code, which is 36 bytes in its serialized format. Then, the ValueRef is dereferenced to read the actual code, which happens to be 206kB in size. This happens in the State column because at the time of writing, we still read the actual values from the trie, not from flat state.

What follows are host function calls performed by the SDK. It uses input to check the function call arguments and copies it from a register into WASM memory.

Then the SDK reads the serialized contract state from the hardcoded key "STATE". Note that we charge 20 tn_db_reads for it, since we missed the accounting cache, but we hit everything in the shard cache. Thus, there are no DB requests. If there were DB requests for this tn_db_reads, you would see them listed.

The returned 70 bytes are again copied into WASM memory. Knowing the SDK code a little bit, we can guess that the data is then deserialized into the struct used by the contract for its root state. That's not visible on the trace though, as this happens completely inside the WASM VM.

Next we start executing the actual contract code, which again calls a bunch of host functions. Apparently the code starts by reading the attached deposit and the predecessor account id, presumably to perform some checks.

The sha256 call here is used to shorten implicit account ids. (Link to code for comparison).

Afterwards, a value with 16 bytes (a u128) is fetched from the trie state. To serve this, it required reading 30 trie nodes, 19 of them were cached in the accounting cache and were not charged the full gas cost. And the remaining 11 missed the accounting cache but they hit the shard cache. Nothing needed to be fetched from DB because the Sweatcoin specific prefetcher has already loaded everything into the shard cache.

Note: We see trie node requests despite flat state being used. This is because the trace was collected with a binary that performed a read on both the trie and flat state to do some correctness checks.

    attached_deposit 
    predecessor_account_id 
    register_len 
    read_register 
    sha256 
    read_register 
    storage_read READ key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=10 tn_mem_reads=19 shard_cache_hit=11
        GET FlatState "CXRva2VuLnN3ZWF0LHQAMWoGDDhG9NfuerJfbzjNMh22UT0mmEi98wVyHwjou6M=" size=36

So that is how to read these traces and dig deep. But maybe you want aggregated statistics instead? Then please continue reading.

Evaluating an IO trace

When you collect an IO trace over an hour of mainnet traffic, it can quickly be above 1GB in uncompressed size. You might be able to sample a few receipts and eyeball them to get a feeling for what's going on. But you can't understand the whole trace without additional tooling.

The parameter estimator command replay can help with that. (See also this readme) Run the following command to see an overview of available commands.

# will print the help page for the IO trace replay command
cargo run --profile dev-release -p runtime-params-estimator -- \
  replay --help

All commands aggregate the information of a trace. Either globally, per chunk, or per receipt. For example, below is the output that gives a list of RocksDB columns that were accessed and how many times, aggregated by chunk.

cargo run --profile dev-release -p runtime-params-estimator -- \
  replay  ./path/to/my.io_trace  chunk-db-stats
apply_transactions shard_id=3 block=DajBgxTgV8NewTJBsR5sTgPhVZqaEv9xGAKVnCiMiDxV
  GET   12 FlatState  4 State  

apply_transactions shard_id=0 block=DajBgxTgV8NewTJBsR5sTgPhVZqaEv9xGAKVnCiMiDxV
  GET   14 FlatState  8 State  

apply_transactions shard_id=2 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
  GET   2 FlatState  

apply_transactions shard_id=3 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
  GET   6 FlatState  2 State  

apply_transactions shard_id=1 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
  GET   50 FlatState  5 State  

...

apply_transactions shard_id=3 block=AUcauGxisMqNmZu5Ln7LLu8Li31H1sYD7wgd7AP6nQZR
  GET   17 FlatState  3 State  

top-level:
  GET   8854 Block  981 BlockHeader  16556 BlockHeight  59155 BlockInfo  2 BlockMerkleTree  330009 BlockMisc  1 BlockOrdinal  31924 BlockPerHeight  863 BlockRefCount  1609 BlocksToCatchup  1557 ChallengedBlocks  4 ChunkExtra  5135 ChunkHashesByHeight  128788 Chunks  35 EpochInfo  1 EpochStart  98361 FlatState  1150 HeaderHashesByHeight  8113 InvalidChunks  263 NextBlockHashes  22 OutgoingReceipts  131114 PartialChunks  1116 ProcessedBlockHeights  968698 State  
  SET   865 BlockHeight  1026 BlockMerkleTree  12428 BlockMisc  1636 BlockOrdinal  865 BlockPerHeight  865 BlockRefCount  3460 ChunkExtra  3446 ChunkHashesByHeight  339142 FlatState  3460 FlatStateDeltas  3460 FlatStateMisc  865 HeaderHashesByHeight  3460 IncomingReceipts  865 NextBlockHashes  3442 OutcomeIds  3442 OutgoingReceipts  863 ProcessedBlockHeights  340093 StateChanges  3460 TrieChanges  

The output contains one apply_transactions for each chunk, with the block hash and the shard id. Then it prints one line for each DB operations observed (GET,SET,...) together with a list of columns and an OP count.

See the top-level output at the end? These are all the DB requests that could not be assigned to specific chunks. The way we currently count write operations (SET, UPDATE_RC) they are never assigned to a specific chunk and instead only show up in the top-level list. Clearly, there is some room for improvement here. So far we simply haven't worried about RocksDB write performance so the tooling to debug write performance is naturally lacking.

Profiling neard

Sampling performance profiling

It is a common task to need to look where neard is spending time. Outside of instrumentation we've also been successfully using sampling profilers to gain an intuition over how the code works and where it spends time. It is a very quick way to get some baseline understanding of the performance characteristics, but due to its probabilistic nature it is also not particularly precise when it comes to small details.

Linux's perf has been a tool of choice in most cases, although tools like Intel VTune could be used too. In order to use either, first prepare your system:

$ sudo sysctl kernel.perf_event_paranoid=0
$ sudo sysctl kernel.kptr_restrict=0

Beware that this gives access to certain kernel state and environment to the unprivileged user. Once investigation is over either set these properties back to the more restricted settings or, better yet, reboot.

Definitely do not run untrusted code after running these commands.

Then collect a profile as such:

$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 YOUR_COMMAND_HERE
# or attach to a running process:
$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 -p NEARD_PID

This command will use the CPU time clock to determine when to trigger a sampling process and will do such sampling roughly 1000 times (the -F argument) every CPU second.

Once terminated, this command will produce a profile file in the current working directory. Although you can inspect the profile already with perf report, we've had much better experience with using Firefox Profiler as the viewer. Although Firefox Profiler supports perf and many other different data formats, for perf in particular a conversion step is necessary:

$ perf script -F +pid > mylittleprofile.script

Then, load this mylittleprofile.script file with the profiler.

Low overhead stack frame collection

The command above uses -g --call-graph dwarf,65528 parameter to instruct perf to collect stack trace for each sample using DWARF unwinding metadata. This will work no matter how neard is built, but is expensive and not super precise (e.g. it has trouble with JIT code.) If you have an ability to build a profiling-tuned build of neard, you can use higher quality stack frame collection.

$ cargo build --release --config .cargo/config.profiling.toml -p neard

Then, replace the --call-graph dwarf with --call-graph fp:

$ perf record -e cpu-clock -F1000 -g --call-graph fp,65528 YOUR_COMMAND_HERE

Profiling with hardware counters

As mentioned earlier, sampling profiler is probabilistic and the data it produces is only really suitable for a broad overview. Any attempt to analyze the performance of the code at the microarchitectural level (which you might want to do if investigating how to speed up a small but frequently invoked function) will be severely hampered by the low quality of data.

For a long time now, CPUs are able to expose information about how it operates on the code at a very fine grained level: how many cycles have passed, how many instructions have been processed, how many branches have been taken, how many predictions were incorrect, how many cycles were spent waiting of a memory accesses and many more. These allow a much better look at how the code behaves.

Until recently, use of these detailed counters was still sampling based -- the CPU would produce some information at a certain cadence of these counters (e.g. every 1000 instructions or cycles) which still shares a fair number of the same downsides as sampling cpu-clock. In order to address this downside, recent CPUs from both Intel and AMD have implemented a list of recent branches taken -- Last Branch Record or LBR. This is available on reasonably recent Intel architectures as well as starting with Zen 4 on the side of AMD. With LBRs profilers are able to gather information about the cycle counts between each branch, giving an accurate and precise evaluation of the performance at a basic block or function call level.

It all sounds really nice, so why are we not using these mechanisms all the time? That's because GCP VMs don't allow access to these counters! In order to access them the code has to be run on your own hardware, or a VM instance that provides direct access to the hardware, such as the (quite expensive) c3-highcpu-192-metal type.

Once everything is set up, though, the following command can gather some interesting information for you.

$ perf record -e cycles:u -b -g --call-graph fp,65528 YOUR_COMMAND_HERE

Analyzing this data is, unfortunately, not as easy as chucking it away to Firefox Profiler. I'm not aware of any other ways to inspect the data other than using perf report:

$ perf report -g --branch-history
$ perf report -g --branch-stack

You may also be able to gather some interesting results if you use --call-graph lbr and the relevant reporting options as well.

Memory usage profiling

neard is a pretty memory-intensive application with many allocations occurring constantly. Although Rust makes it pretty hard to introduce memory problems, it is still possible to leak memory or to inadvertently retain too much of it.

Unfortunately, “just” throwing a random profiler at neard does not work for many reasons. Valgrind for example is introducing enough slowdown to significantly alter the behaviour of the run, not to mention that to run it successfully and without crashing it will be necessary to comment out neard’s use of jemalloc for yet another substantial slowdown.

So far the only tool that worked out well out of the box was bytehound. Using it is quite straightforward, but needs Linux, and ability to execute privileged commands.

First, checkout and build the profiler (you will need to have nodejs yarn thing available as well):

$ git clone git@github.com:koute/bytehound.git
$ cargo build --release -p bytehound-preload
$ cargo build --release -p bytehound-cli

You will also need a build of your neard, once you have that, give it some ambient cabapilities necessary for profiling:

$ sudo sysctl kernel.perf_event_paranoid=0
$ sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/neard
$ sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/libbytehound.so

And finally run the program with the profiler enabled (in this case neard run command is used):

$ /lib64/ld-linux-x86-64.so.2 --preload /path/to/libbytehound.so /path/to/neard run

Viewing the profile

Do note that you will need about twice the amount of RAM as the size of the input file in order to load it successfully.

Once enough profiling data has been gathered, terminate the program. Use the bytehound CLI tool to operate on the profile. I recommend bytehound server over directly converting to e.g. heaptrack format using other subcommands as each invocation will read and parse the profile data from scratch. This process can take quite some time. server parses the inputs once and makes conversions and other introspection available as interactive steps.

You can use server interface to inspect the profile in one of the few ways, download a flamegraph or a heaptrack file. Heaptrack in particular provides some interesting additional visualizations and has an ability to show memory use over time from different allocation sources.

I personally found it a bit troublesome to figure out how to open the heaptrack file from the GUI. However, heaptrack myexportedfile worked perfectly. I recommend opening the file exactly this way.

Troubleshooting

No output file

  1. Set a higher profiler logging level. Verify that the profiler gets loaded at all. If you're not seeing any log messages, then something about your working environment is preventing the loader from including the profiler library.
  2. Try specifying an exact output file with e.g. environment variables that the profiler reads.

Crashes

If the profiled neard crashes in your tests, there are a couple of things you can try to get past it. First, make sure your binary has the necessary ambient capabilities (setcap command above needs to be executed every time binary is replaced!)

Another thing to try is disabling jemalloc. Comment out this code in neard/src/main.rs:

#![allow(unused)]
fn main() {
#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
}

The other thing you can try is different profilers, different versions of the profilers or different options made available (in particular disabling the shadow stack in bytehound), although I don't have specific recommendations here.

We don't know what exactly it is about neard that leads to it crashing under the profiler as easily as it does. I have seen valgrind reporting that we have libraries that are deallocating with a wrong size class, so that might be the reason? Do definitely look into this if you have time.

What to profile?

This section provides some ideas on programs you could consider profiling if you are not sure where to start.

First and foremost you could go shotgun and profile a full running neard node that's operating on mainnet or testnet traffic. There are a couple ways to set up such a node: search for my-own-mainnet or a forknet based tooling.

From there either attach to a running neard run process or stop the running one and start a new instance under the profiler.

This approach will give you a good overview of the entire system, but at the same time the information might be so dense, it might be difficult to derive any signal from the noise. There are alternatives that isolate certain components of the runtime:

Runtime::apply

Profiling just the Runtime::apply is going to include the work done by transaction runtime and the contract runtime only. A smart use of the tools already present in the neard binary can achieve that today.

First, make sure all deltas in flat storage are applied and written:

neard view-state --readwrite apply-range --shard-id $SHARD_ID --storage flat sequential

You will need to do this for all shards you're interested in profiling. Then pick a block or a range of blocks you want to re-apply and set the flat head to the specified height:

neard flat-storage move-flat-head --shard-id 0 --version 0 back --blocks 17

Finally the following commands will apply the block or blocks from the height in various different ways. Experiment with the different modes and flags to find the best fit for your task. Don't forget to run these commands under the profiler :)

# Apply blocks from current flat head to the highest known block height in sequence
# using the memtrie storage (note that after this you will need to move the flat head again)
neard view-state --readwrite apply-range --shard-id 0 --storage memtrie sequential
# Same but with flat storage
neard view-state --readwrite apply-range --shard-id 0 --storage flat sequential

# Repeatedly apply a single block at the flat head using the memtrie storage.
# Will not modify the storage on the disk.
neard view-state apply-range --shard-id 0 --storage memtrie benchmark
# Same but with flat storage
neard view-state apply-range --shard-id 0 --storage flat benchmark

Working with OpenTelemetry Traces

neard is instrumented in a few different ways. From the code perspective we have two major ways of instrumenting code:

  • Prometheus metrics – by computing various metrics in code and expoding them via the prometheus crate.
  • Execution tracing – this shows up in the code as invocations of functionality provided by the tracing crate.

The focus of this document is to provide information on how to effectively work with the data collected by the execution tracing approach to instrumentation.

Gathering and Viewing the Traces

Tracing the execution of the code produces two distinct types of data: spans and events. These then are exposed as either logs (representing mostly the events) seen in the standard output of the neard process or sent onwards to an opentelemetry collector.

When deciding how to instrument a specific part of the code, consider the following decision tree:

  1. Do I need execution timing information? If so, use a span; otherwise
  2. Do I need call stack information? If so, use a span; otherwise
  3. Do I need to preserve information about inputs or outputs to a specific section of the code? If so, use key-values on a pre-existing span or an event; otherwise
  4. Use an event if it represents information applicable to a single point of execution trace.

As of writing (February 2024) our codebase uses spans somewhat sparsely and relies on events heavily to expose information about the execution of the code. This is largely a historical accident due to the fact that for a long time stdout logs were the only reasonable way to extract information out of the running executable.

Today we have more tools available to us. In production environments and environments replicating said environment (i.e. GCP environments such as mocknet) there's the ability to push this data to Grafana Loki (for events) and Tempo (for spans and events alike), so long as the amount of data is within reason. For that reason it is critical that the event and span levels are chosen appropriately and in consideration with the frequency of invocations. In local environments developers can use projects like Jaeger, or set up the Grafana stack if they wish to use a consistent interfaces.

It is still more straightforward to skip all the setup necessary for tracing, but relying exclusively on logs only increases noise for the other developers and makes it ever so slightly harder to extract signal in the future. Keep this trade off in mind.

Spans

We have a style guide section on the use of Spans, please make yourself familiar with it.

Every tracing::debug_span!() creates a new span, and usually it is attached to its parent automatically.

However, a few corner cases exist.

  • do_apply_chunks() starts 4 sub-tasks in parallel and waits for their completion. To make it work, the parent span is passed explicitly to the sub-tasks.

  • Messages to actix workers. If you do nothing, that the traces are limited to work done in a single actor. But that is very restrictive and not useful enough. To workaround that, each actix message gets attached opentelemetry::Context. That context somehow represents the information about the parent span. This mechanism is the reason you see annoying .with_span_context() function calls whenever you send a message to an actix Actor.

  • Inter-process tracing is theoretically available, but I have never tested it. The plan was to test it as soon as the Canary images get updated 😭 Therefore it most likely doesn’t work. Each PeerMessage is injected with TraceContext (1, 2) and the receiving node extracts that context and all spans generated in handling that message should be parented to the trace from another node.

  • Some spans are created using info_span!() but they are few and mostly for the logs. Exporting only info-level spans doesn’t give any useful tracing information in Grafana.

  • actix::Actor::handle() deserves a special note. The design choice was to provide a macro that lets us easily annotate every implementation of actix::Actor::handle(). This macro sets the following span attributes:

    • actor to the name of the struct that implements actix::Actor
    • handler to the name of the message struct

    And it lets you provide more span attributes. In the example, ClientActor specifies msg_type, which in all cases is identical to handler.

Configuration

The Tracing documentation page in nearone's Outline documents the steps necessary to start moving the trace data from the node to Nearone's Grafana Cloud instance. Once you set up your nodes, you can use the explore page to verify that the traces are coming through.

Image displaying the Grafana explore page interacting with the grafana-nearinc-traces data source, with Service Name filter set to =~"neard:mocknet-mainnet-94194484-nagisa-10402-test-vzx2.near|neard:mocknet-mainnet-94194484-nagisa-10402-test-xdwp.near" and showing some traces having been found

If the traces are not coming through quite yet, consider using the ability to set logging configuration at runtime. Create $NEARD_HOME/log_config.json file with the following contents:

{ "opentelemetry": "info" }

Or optionally with rust_log setting to reduce logging on stdout:

{ "opentelemetry": "info", "rust_log": "WARN" }

and invoke sudo pkill -HUP neard. Double check that the collector is running as well.

Good to know: You can modify the event/span/log targets you’re interested in just like when setting the RUST_LOG environment variable, including target filters. If you're setting verbose levels, consider selecting specific targets you're interested in too. This will help to keep trace ingest costs down.

For more information about the dynamic settings refer to core/dyn-configs code in the repository.

Local development

TODO: the setup is going to depend on whether one would like to use grafana stack or just jaeger or something else. We should document setting either of these up, including the otel collector and such for a full end-to-end setup. Success criteria: running integration tests should allow you to see the traces in your grafana/jaeger. This may require code changes as well.

Using the Grafana Stack here gives the benefit of all of the visualizations that are built-in. Any dashboards you build are also portable between the local environment and the Grafana Cloud instance. Jaeger may give a nicer interactive exploration ability. You can also set up both if you wish.

Visualization

Now that the data is arriving into the databases, it is time to visualize the data to determine what you want to know about the node. The only general advise I have here is to check that the data source is indeed tempo or loki.

Explore

Initial exploration is best done with Grafana's Explore tool or some other mechanism to query and display individual traces.

The query builder available in Grafana makes the process quite straightforward to start with, but is also somewhat limited. Underlying TraceQL has many more features that are not available through the builder. For example, you can query data in somewhat of a relational manner, such as this query below queries only spans named process_receipt that take 50ms when run as part of new_chunk processing for shard 3!

{ name="new_chunk" && span.shard_id = "3" } >> { name="process_receipt" && duration > 50ms }

Good to know: When querying, keep in mind the "Options" dropdown that allows you to specify the limit of results and the format in which these results are presented! In particular, the "Traces/Spans" toggle will affect the durations shown in the result table.

Once you click on a span of interest, Grafana will open you a view with the trace that contains said span, where you can inspect both the overall trace and the properties of the span:

Image displaying a specific trace with two of the spans expanded to show their details

Dashboards

Once you have arrived at an interesting query, you may be inclined to create a dashboard that summarizes the data without having to dig into individual traces and spans.

As an example the author was interested in checking the execution speed before and after a change in a component. To make the comparison visual, the span of interest was graphed using the histogram visualization in order to obtain the following result. In this graph the Y axis displays the number of occurrences for spans that took X-axis long to complete.

In general most of the panels work with tracing results directly but some of the most interesting ones do not. It is necessary to experiment with certain options and settings to have grafana panels start showing data. Some notable examples:

  1. Time series – a “Prepare time series” data transformation with “Multi-frame time series” has to be added;
  2. Histogram – make sure to use "spans" table format option;
  3. Heatmap - set “Calculate from data” option to “Yes”;
  4. Bar chart – works out of the box, but x axis won't be readable ever.

You can also add a panel that shows all the trace events in a log-like representation using the log or table visualization.

Multiple nodes

One frequently asked question is whether Grafana lets you distinguish between nodes that export tracing information.

The answer is yes.

In addition to span attributes, each span has resource attributes. There you'll find properties like node_id which uniquely identify a node.

  • account_id is the account_id from validator_key.json;
  • chain_id is taken from genesis.json;
  • node_id is the public key from node_key.json;
  • service.name is account_id if that is available, otherwise it is node_id.

Code Style

This document specifies the code style to use in the nearcore repository. The primary goal here is to achieve consistency, maintain it over time, and cut down on the mental overhead related to style choices.

Right now, nearcore codebase is not perfectly consistent, and the style acknowledges this. It guides newly written code and serves as a tie breaker for decisions. Rewriting existing code to conform 100% to the style is not a goal. Local consistency is more important: if new code is added to a specific file, it's more important to be consistent with the file rather than with this style guide.

This is a live document, which intentionally starts in a minimal case. When doing code-reviews, consider if some recurring advice you give could be moved into this document.

Formatting

Use rustfmt for minor code formatting decisions. This rule is enforced by CI

Rationale: rustfmt style is almost always good enough, even if not always perfect. The amount of bikeshedding saved by rustfmt far outweighs any imperfections.

Idiomatic Rust

While the most important thing is to solve the problem at hand, we strive to implement the solution in idiomatic Rust, if possible. To learn what is considered idiomatic Rust, a good start are the Rust API guidelines (but keep in mind that nearcore is not a library with public API, not all advice applies literally).

When in doubt, ask question in the Rust 🦀 Zulip stream or during code review.

Rationale:

  • Consistency: there's usually only one idiomatic solution amidst many non-idiomatic ones.
  • Predictability: you can use the APIs without consulting documentation.
  • Performance, ergonomics and correctness: language idioms usually reflect learned truths, which might not be immediately obvious.

Style

This section documents all micro-rules which are not otherwise enforced by rustfmt.

Avoid AsRef::as_ref

When you have some concrete type, prefer .as_str, .as_bytes, .as_path over generic .as_ref. Only use .as_ref when the type in question is a generic T: AsRef<U>.

#![allow(unused)]
fn main() {
// GOOD
fn log_validator(account_id: AccountId) {
    metric_for(account_id.as_str())
       .increment()
}

// BAD
fn log_validator(account_id: AccountId) {
    metric_for(account_id.as_ref())
       .increment()
}
}

Note that Option::as_ref, Result::as_ref are great, do use them!

Rationale: Readability and churn-resistance. There might be more than one AsRef<U> implementation for a given type (with different Us). If a new implementation is added, some of the .as_ref() calls might break. See also this issue.

Avoid references to Copy-types

Various generic APIs in Rust often return references to data (&T). When T is a small Copy type like i32, you end up with &i32 while many API expect i32, so dereference has to happen somewhere. Prefer dereferencing as early as possible, typically in a pattern:

#![allow(unused)]
fn main() {
// GOOD
fn compute(map: HashMap<&'str, i32>) {
    if let Some(&value) = map.get("key") {
        process(value)
    }
}
fn process(value: i32) { ... }

// BAD
fn compute(map: HashMap<&'str, i32>) {
    if let Some(value) = map.get("key") {
        process(*value)
    }
}
fn process(value: i32) { ... }
}

Rationale: If the value is used multiple times, dereferencing in the pattern saves keystrokes. If the value is used exactly once, we just want to be consistent. Additional benefit of early deref is reduced scope of borrow.

Note that for some big Copy types, notably CryptoHash, we sometimes use references for performance reasons. As a rule of thumb, T is considered big if size_of::<T>() > 2 * size_of::<usize>().

Prefer for loops over for_each and try_for_each methods

Iterators offer for_each and try_for_each methods which allow executing a closure over all items of the iterator. This is similar to using a for loop but comes with various complications and may lead to less readable code. Prefer using a loop rather than those methods, for example:

#![allow(unused)]
fn main() {
// GOOD
for outcome_with_id in result? {
    *total_gas_burnt =
        safe_add_gas(*total_gas_burnt, outcome_with_id.outcome.gas_burnt)?;
    outcomes.push(outcome_with_id);
}

// BAD
result?.into_iter().try_for_each(
    |outcome_with_id: ExecutionOutcomeWithId| -> Result<(), RuntimeError> {
        *total_gas_burnt =
            safe_add_gas(*total_gas_burnt, outcome_with_id.outcome.gas_burnt)?;
        outcomes.push(outcome_with_id);
        Ok(())
    },
)?;
}

Rationale: The for_each and try_for_each method don’t play nice with break and continue statements nor do they mesh well with async IO (since .await inside of the closure isn’t possible). And while try_for_each allows for the use of question mark operator, one may end up having to uses it twice: once inside the closure and second time outside the call to try_for_each. Furthermore, usage of the functions often introduce some minor syntax noise.

There are situations when those methods may lead to more readable code. Common example are long call chains. Even then such code may evolve with the closure growing and leading to less readable code. If advantages of using the methods aren’t clear cut, it’s usually better to err on side of more imperative style.

Lastly, anecdotally the methods (e.g. when used with chain or flat_map) may lead to faster code. This intuitively makes sense but it’s worth to keep in mind that compilers are pretty good at optimising and in practice may generate optimal code anyway. Furthermore, optimising code for readability may be more important (especially outside of hot path) than small performance gains.

Prefer to_string to format!("{}")

Prefer calling to_string method on an object rather than passing it through format!("{}") if all you’re doing is converting it to a String.

#![allow(unused)]
fn main() {
// GOOD
let hash = block_hash.to_string();
let msg = format!("{}: failed to open", path.display());

// BAD
let hash = format!("{block_hash}");
let msg = path.display() + ": failed to open";
}

Rationale: to_string is shorter to type and also faster.

Import Granularity

Group imports by module, but not deeper:

#![allow(unused)]
fn main() {
// GOOD
use std::collections::{hash_map, BTreeSet};
use std::sync::Arc;

// BAD - nested groups.
use std::{
    collections::{hash_map, BTreeSet},
    sync::Arc,
};

// BAD - not grouped together.
use std::collections::BTreeSet;
use std::collections::hash_map;
use std::sync::Arc;
}

This corresponds to "rust-analyzer.imports.granularity.group": "module" setting in rust-analyzer (docs).

Rationale: Consistency, matches existing practice.

Import Blocks

Do not separate imports into groups with blank lines. Write a single block of imports and rely on rustfmt to sort them.

#![allow(unused)]
fn main() {
// GOOD
use crate::types::KnownPeerState;
use borsh::BorshSerialize;
use near_primitives::utils::to_timestamp;
use near_store::{DBCol::Peers, Store};
use rand::seq::SliceRandom;
use std::collections::HashMap;
use std::net::SocketAddr;

// BAD -- several groups of imports
use std::collections::HashMap;
use std::net::SocketAddr;

use borsh::BorshSerialize;
use rand::seq::SliceRandom;

use near_primitives::utils::to_timestamp;
use near_store::{DBCol::Peers, Store};

use crate::types::KnownPeerState;
}

Rationale: Consistency, ease of automatic enforcement. Today, stable rustfmt can't split imports into groups automatically, and doing that manually consistently is a chore.

Derives

When deriving an implementation of a trait, specify a full path to the traits provided by the external libraries:

#![allow(unused)]
fn main() {
// GOOD
#[derive(Copy, Clone, serde::Serialize, thiserror::Error, strum::Display)]
struct Grapefruit;

// BAD
use serde::Serialize;
use thiserror::Error;
use strum::Display;

#[derive(Copy, Clone, Serialize, Error, Display)]
struct Banana;
}

As an exception to this rule, it is okay to use either style when the derived trait already includes the name of the library (as would be the case for borsh::BorshSerialize.)

Rationale: Specifying a full path to the externally provided derivations here makes it straightforward to differentiate between the built-in derivations and those provided by the external crates. The surprise factor for derivations sharing a name with the standard library traits (Display) is reduced and it also acts as natural mechanism to tell apart names prone to collision (Serialize), all without needing to look up the list of imports.

Arithmetic integer operations

Use methods with an appropriate overflow handling over plain arithmetic operators (+-*/%) when dealing with integers.

// GOOD
a.wrapping_add(b);
c.saturating_sub(2);
d.widening_mul(3);   // NB: unstable at the time of writing
e.overflowing_div(5);
f.checked_rem(7);

// BAD
a + b
c - 2
d * 3
e / 5
f % 7

If you’re confident the arithmetic operation cannot fail, x.checked_[add|sub|mul|div](y).expect("explanation why the operation is safe") is a great alternative, as it neatly documents not just the infallibility, but also why that is the case.

This convention may be enforced by the clippy::arithmetic_side_effects and clippy::integer_arithmetic lints.

Rationale: By default the outcome of an overflowing computation in Rust depends on a few factors, most notably the compilation flags used. The quick explanation is that in debug mode the computations may panic (cause side effects) if the result has overflowed, and when built with optimizations enabled, these computations will wrap-around instead.

For nearcore and neard we have opted to enable the panicking behaviour regardless of the optimization level. By doing it this we hope to prevent accidental stabilization of protocol mis-features that depend on incorrect handling of these overflows or similarly scary silent bugs. The downside to this approach is that any such arithmetic operation now may cause a node to crash, much like indexing a vector with a[idx] may cause a crash when idx is out-of-bounds. Unlike indexing, however, developers and reviewers are not used to treating integer arithmetic operations with the due suspicion. Having to make a choice, and explicitly spell out, how an overflow case ought to be handled will result in an easier to review and understand code and a more resilient project overall.

Standard Naming

  • Use - rather than _ in crate names and in corresponding folder names.
  • Avoid single-letter variable names especially in long functions. Common i, j etc. loop variables are somewhat of an exception but since Rust encourages use of iterators those cases aren’t that common anyway.
  • Follow standard Rust naming patterns such as:
    • Don’t use get_ prefix for getter methods. A getter method is one which returns (a reference to) a field of an object.
    • Use set_ prefix for setter methods. An exception are builder objects which may use different a naming style.
    • Use into_ prefix for methods which consume self and to_ prefix for methods which don’t.
  • Use get_block_header rather than get_header for methods which return a block header.
  • Don’t use _by_hash suffix for methods which lookup chain objects (blocks, chunks, block headers etc.) by their hash (i.e. their primary identifier).
  • Use _by_height and similar suffixes for methods which lookup chain objects (blocks, chunks, block headers etc.) by their height or other property which is not their hash.

Rationale: Consistency.

Documentation

When writing documentation in .md files, wrap lines at approximately 80 columns.

<!-- GOOD -->
Manually reflowing paragraphs is tedious. Luckily, most editors have this
functionality built in or available via extensions. For example, in Emacs you
can use `fill-paragraph` (<kbd>M-q</kbd>), (neo)vim allows rewrapping with `gq`,
and VS Code has `stkb.rewrap` extension.

<!-- BAD -->
One sentence per-line is also occasionally used for technical writing.
We avoid that format though.
While convenient for editing, it may be poorly legible in unrendered form

<!-- BAD -->
Definitely don't use soft-wrapping. While markdown mostly ignores source level line breaks, relying on soft wrap makes the source completely unreadable, especially on modern wide displays.

Tracing

When emitting events and spans with tracing prefer adding variable data via tracing's field mechanism.

#![allow(unused)]
fn main() {
// GOOD
debug!(
    target: "client",
    validator_id = self.client.validator_signer.get().map(|vs| {
        tracing::field::display(vs.validator_id())
    }),
    %hash,
    "block.previous_hash" = %block.header().prev_hash(),
    "block.height" = block.header().height(),
    %peer_id,
    was_requested,
    "Received block",
);
}

Most apparent violation of this rule will be when the event message utilizes any form of formatting, as seen in the following example:

#![allow(unused)]
fn main() {
// BAD
debug!(
    target: "client",
    "{:?} Received block {} <- {} at {} from {}, requested: {}",
    self.client.validator_signer.get().map(|vs| vs.validator_id()),
    hash,
    block.header().prev_hash(),
    block.header().height(),
    peer_id,
    was_requested
);
}

Always specify the target explicitly. A good default value to use is the crate name, or the module path (e.g. chain::client) so that events and spans common to a topic can be grouped together. This grouping can later be used for customizing which events to output.

Rationale: This makes the events structured – one of the major value add propositions of the tracing ecosystem. Structured events allow for immediately actionable data without additional post-processing, especially when using some of the more advanced tracing subscribers. Of particular interest would be those that output events as JSON, or those that publish data to distributed event collection systems such as opentelemetry. Maintaining this rule will also usually result in faster execution (when logs at the relevant level are enabled.)

Spans

Use the spans to introduce context and grouping to and between events instead of manually adding such information as part of the events themselves. Most of the subscribers ingesting spans also provide a built-in timing facility, so prefer using spans for measuring the amount of time a section of code needs to execute.

Give spans simple names that make them both easy to trace back to code, and to find a particular span in logs or other tools ingesting the span data. If a span begins at the top of a function, prefer giving it a name of that function, otherwise prefer a snake_case name.

When instrumenting asynchronous functions the #[tracing::instrument] macro or the Future::instrument is required. Using Span::entered or a similar method that is not aware of yield points will result in incorrect span data and could lead to difficult to troubleshoot issues such as stack overflows.

Always explicitly specify the level, target, and skip_all options and do not rely on the default values. skip_all avoids adding all function arguments as span fields which can lead recording potentially unnecessary and expensive information. Carefully consider which information needs recording and the cost of recording the information when using the fields option.

#![allow(unused)]
fn main() {
#[tracing::instrument(
    level = "trace",
    target = "network",
    "handle_sync_routing_table",
    skip_all
)]
async fn handle_sync_routing_table(
    clock: &time::Clock,
    network_state: &Arc<NetworkState>,
    conn: Arc<connection::Connection>,
    rtu: RoutingTableUpdate,
) {
    ...
}
}

In regular synchronous code it is fine to use the regular span API if you need to instrument portions of a function without affecting the code structure:

#![allow(unused)]
fn main() {
fn compile_and_serialize_wasmer(code: &[u8]) -> Result<wasmer::Module> {
    // Some code...
    {
        let _span = tracing::debug_span!(target: "vm", "compile_wasmer").entered();
        // ...
        // _span will be dropped when this scope ends, terminating the span created above.
        // You can also `drop` it manually, to end the span early with `drop(_span)`.
    }
    // Some more code...
}
}

Rationale: Much as with events, this makes the information provided by spans structured and contextual. This information can then be output to tooling in an industry standard format, and can be interpreted by an extensive ecosystem of tracing subscribers.

Event and span levels

The INFO level is enabled by default, use it for information useful for node operators. The DEBUG level is enabled on the canary nodes, use it for information useful in debugging testnet failures. The TRACE level is not generally enabled, use it for arbitrary debug output.

Metrics

Consider adding metrics to new functionality. For example, how often each type of error was triggered, how often each message type was processed.

Rationale: Metrics are cheap to increment, and they often provide a significant insight into operation of the code, almost as much as logging. But unlike logging metrics don't incur a significant runtime cost.

Naming

Prefix all nearcore metrics with near_. Follow the Prometheus naming convention for new metrics.

Rationale: The near_ prefix makes it trivial to separate metrics exported by nearcore from other metrics, such as metrics about the state of the machine that runs neard.

Performance

In most cases incrementing a metric is cheap enough never to give it a second thought. However accessing a metric with labels on a hot path needs to be done carefully.

If a label is based on an integer, use a faster way of converting an integer to the label, such as the itoa crate.

For hot code paths, re-use results of with_label_values() as much as possible.

Rationale: We've encountered issues caused by the runtime costs of incrementing metrics before. Avoid runtime costs of incrementing metrics too often.

Documentation

This chapter describes nearcore's approach to documentation. There are three primary types of documentation to keep in mind:

  • The NEAR Protocol Specification (source code) is the formal description of the NEAR protocol. The reference nearcore implementation and any other NEAR client implementations must follow this specification.
  • User docs (source code) explain what is NEAR and how to participate in the network. In particular, they contain information pertinent to the users of NEAR: validators and smart contract developers.
  • Documentation for nearcore developers (source code) is the book you are reading right now! The target audience here is the contributors to the main implementation of the NEAR protocol (nearcore).

Overview

The bulk of the internal docs is within this book. If you want to write some kind of document, add it here! The architecture and practices chapters are intended for somewhat up-to-date normative documents. The misc chapter holds everything else.

This book is not intended for user-facing documentation, so don't worry about proper English, typos, or beautiful diagrams -- just write stuff! It can easily be improved over time with pull requests. For docs, we use a lightweight review process and try to merge any improvement as quickly as possible. Rather than blocking a PR on some stylistic changes, just merge it and submit a follow-up.

Note the "edit" button at the top-right corner -- super useful for fixing any typos you spot!

In addition to the book, we also have some "inline" documentation in the code. For Rust, it is customary to have a per-crate README.md file and include it as a doc comment via #![doc = include_str!("../README.md")] in lib.rs. We don't require every item to be documented, but we certainly encourage documenting as much as possible. If you spend some time refactoring or fixing a function, consider adding a doc comment (///) to it as a drive-by improvement.

We currently don't render rustdoc, see #7836.

Book How To

We use mdBook to render a bunch of markdown files as a static website with a table of contents, search and themes. Full docs are here, but the basics are very simple.

To add a new page to the book:

  1. Add a .md file somewhere in the ./docs folder.
  2. Add a link to that page to the SUMMARY.md.
  3. Submit a PR (again, we promise to merge it without much ceremony).

The doc itself is in vanilla markdown.

To render documentation locally:

# Install mdBook
$ cargo install mdbook
$ mdbook serve --open ./docs

This will generate the book from the docs folder, open it in a browser and start a file watcher to rebuild the book every time the source files change.

Note that GitHub's default rendering mostly works just as well, so you don't need to go out of your way to preview your changes when drafting a page or reviewing pull requests to this book.

The book is deployed via the book GitHub Action workflow. This workflow runs mdBook and then deploys the result to GitHub Pages.

For internal docs, you often want to have pretty pictures. We don't currently have a recommended workflow, but here are some tips:

  • Don't add binary media files to Git to avoid inflating repository size. Rather, upload images as comments to this super-secret issue #7821, and then link to the images as

    ![image](https://user-images.githubusercontent.com/1711539/195626792-7697129b-7f9c-4953-b939-0b9bcacaf72c.png)
    

    Use a single comment per page with multiple images.

  • Google Docs is an OK way to create technical drawings, you can add a link to the doc with source to that secret issue as well.

  • There's some momentum around using mermaid.js for diagramming, and there's an appropriate plugin for that. Consider if that's something you might want to use.

Tracking issues

nearcore uses so-called "tracking issues" to coordinate larger pieces of work (e.g. implementation of new NEPs). Such issues are tagged with the C-tracking-issue label.

The goal of tracking issues is to serve as a coordination point. They can help new contributors and other interested parties come up to speed with the current state of projects. As such, they should link to things like design docs, todo-lists of sub-issues, existing implementation PRs, etc.

One can further use tracking issues to:

  • get a feeling for what's happening in nearcore by looking at the set of open tracking issues.
  • find larger efforts to contribute to as tracking issues usually contain up-for-grab to-do lists.
  • follow the progress of specific features by subscribing to the issue on GitHub.

If you are leading or participating in a larger effort, please create a tracking issue for your work.

Guidelines

  • Tracking issues should be maintained in the nearcore repository. If the projects are security sensitive, then they should be maintained in the nearcore-private repository.
  • The issues should be kept up-to-date. At a minimum, all new context should be added as comments, but preferably the original description should be edited to reflect the current status.
  • The issues should contain links to all the relevant design documents which should also be kept up-to-date.
  • The issues should link to any relevant NEP if applicable.
  • The issues should contain a list of to-do tasks that should be kept up-to-date as new work items are discovered and other items are done. This helps others gauge progress and helps lower the barrier of entry for others to participate.
  • The issues should contain links to relevant Zulip discussions. Prefer open forums like Zulip for discussions. When necessary, closed forums like video calls can also be used but care should be taken to document a summary of the discussions.
  • For security-sensitive discussions, use the appropriate private Zulip streams.

This issue is a good example of how tracking issues should be maintained.

Background

The idea of tracking issues is also used to track project work in the Rust language. See this post for a rough description and these issues for how they are used in Rust.

Security Vulnerabilities

The intended audience of the information presented here is developers working on the implementation of NEAR.

Are you a security researcher? Please report security vulnerabilities to security@near.org.

As nearcore is open source, all of its issues and pull requests are also publicly tracked on GitHub. However, from time to time, if a security-sensitive issue is discovered, it cannot be tracked publicly on GitHub. However, we should promote as similar a development process to work on such issues as possible. To enable this, below is the high-level process for working on security-sensitive issues.

  1. There is a private fork of nearcore on GitHub. Access to this repository is restricted to the set of people who are trusted to work on and have knowledge about security-sensitive issues in nearcore.

    This repository can be manually synced with the public nearcore repository using the following commands:

    $ git remote add nearcore-public git@github.com:near/nearcore
    $ git remote add nearcore-private git@github.com:near/nearcore-private
    $ git fetch nearcore-public
    $ git push nearcore-private nearcore-public/master:master
    
  2. All security-sensitive issues must be created on the private nearcore repository. You must also assign one of the [P-S0, P-S1] labels to the issue to indicate the severity of the issue. The two criteria to use to help you judge the severity are the ease of carrying out the attack and the impact of the attack. An attack that is easy to do or can have a huge impact should have the P-S0 label and P-S1 otherwise.

  3. All security-sensitive pull requests should also be created on the private nearcore repository. Note that once a PR has been approved, it should not be merged into the private repository. Instead, it should be first merged into the public repository and then the private fork should be updated using the steps above.

  4. Once work on a security issue is finished, it needs to be deployed to all the impacted networks. Please contact the node team for help with this.

Fast Builds

nearcore is implemented in Rust and is a fairly sizable project, so it takes a while to build. This chapter collects various tips to make the development process faster.

Optimizing build times is a bit of a black art, so please do benchmarks on your machine to verify that the improvements work for you. Changing some configuration and making a typo, which prevents it from improving build times is an extremely common failure mode!

Rust Perf Book contains a section on compilation times as well!

cargo build --release is obviously slower than cargo build. We enable full lto (link-time optimization), so our -r builds are very slow, use a lot of RAM, and don't utilize the available parallelism fully.

As debug builds are much too slow at runtime for many purposes, we have a custom profile --profile dev-release which is equivalent to -r, except that the time-consuming options such as LTO are disabled, and debug assertions are enabled.

Use --profile dev-release for most local development, or when connecting a locally built node to a network. Use -r for production, or if you want to get absolute performance numbers.

Linker

By default, rustc uses the default system linker, which tends to be quite slow. Using lld (LLVM linker) or mold (very new, very fast linker) provides big wins for many setups.

I don't know what's the official source of truth for using alternative linkers, I usually refer to this comment.

Usually, adding

[build]
rustflags = ["-C", "link-arg=-fuse-ld=lld"]

to ~/.cargo/config is the most convenient approach.

lld itself can be installed with sudo apt install lld (or the equivalent in the distro/package manager of your choice).

Prebuilt RocksDB

By default, we compile RocksDB (a C++ project) from source during the neard build. By linking to a prebuilt copy of RocksDB this work can be avoided entirely. This is a huge win, especially if you clean the ./target directory frequently.

To use a prebuilt RocksDB, set the ROCKSDB_LIB_DIR environment variable to a location containing librocksdb.a:

$ export ROCKSDB_LIB_DIR=/usr/lib/x86_64-linux-gnu
$ cargo build -p neard

Note, that the system must provide a recent version of the library which, depending on which operating system you’re using, may require installing packages from a testing branch. For example, on Debian it requires installing librocksdb-dev from the experimental repository:

Note: Based on which distro you are using this process will look different. Please refer to the documentation of the package manager you are using.

echo 'deb http://ftp.debian.org/debian experimental main contrib non-free' |
    sudo tee -a /etc/apt/sources.list
sudo apt update
sudo apt -t experimental install librocksdb-dev

ROCKSDB_LIB_DIR=/usr/lib/x86_64-linux-gnu
export ROCKSDB_LIB_DIR

Global Compilation Cache

By default, Rust compiles incrementally, with the incremental cache and intermediate outputs stored in the project-local ./target directory.

The sccache utility can be used to share these artifacts between machines or checkouts within the same machine. sccache works by intercepting calls to rustc and will fetch the cached outputs from the global cache whenever possible. This tool can be set up as such:

$ cargo install sccache
$ export RUSTC_WRAPPER="sccache"
$ export SCCACHE_CACHE_SIZE="30G"
$ cargo build -p neard

Refer to the project’s README for further configuration options.

IDEs Are Bad For Environment Handling

Generally, the knobs in this section are controlled either via global configuration in ~/.cargo/config or environment variables.

Environment variables are notoriously easy to lose, especially if you are working both from a command line and a graphical IDE. Double-check that the environment within which builds are executed is identical to avoid nasty failure modes such as full cache invalidation when switching from the CLI to an IDE or vice-versa.

direnv sometimes can be used to conveniently manage project-specific environment variables.

General principles

  1. Every PR needs to have test coverage in place. Sending the code change and deferring tests for a future change is not acceptable.
  2. Tests need to either be sufficiently simple to follow or have good documentation to explain why certain actions are made and conditions are expected.
  3. When implementing a PR, make sure to run the new tests with the change disabled and confirm that they fail! It is extremely common to have tests that pass without the change that is being tested.
  4. The general rule of thumb for a reviewer is to first review the tests, and ensure that they can convince themselves that the code change that passes the tests must be correct. Only then the code should be reviewed.
  5. Have the assertions in the tests as specific as possible, however do not make the tests change-detectors of the concrete implementation. (assert only properties which are required for correctness). For example, do not do assert!(result.is_err()), expect the specific error instead.

Tests hierarchy

In the NEAR Reference Client we largely split tests into three categories:

  1. Relatively cheap sanity or fast fuzz tests: It includes all the #[test] Rust tests not decorated by features. Our repo is configured in such a way that all such tests are run on every PR and failing at least one of them is blocking the PR from being merged.

To run such tests locally run cargo nextest run --all. It requires nextest harness which can be installed by running cargo install cargo-nextest first.

  1. Expensive tests: This includes all the fuzzy tests that run many iterations, as well as tests that spin up multiple nodes and run them until they reach a certain condition. Such tests are decorated with #[cfg(feature="expensive-tests")]. It is not trivial to enable features that are not declared in the top-level crate, and thus the easiest way to run such tests is to enable all the features by passing --all-features to cargo nextest run, e.g:

cargo nextest run --package near-client -E 'test(=tests::cross_shard_tx::test_cross_shard_tx)' --all-features

  1. Python tests: We have an infrastructure to spin up nodes, both locally and remotely, in python, and interact with them using RPC. The infrastructure and the tests are located in the pytest folder. The infrastructure is relatively straightforward, see for example block_production.py here. See the Test infrastructure section below for details.

Expensive and python tests are not part of CI, and are run by a custom nightly runner. The results of the latest runs are available here. Today, test runs launch approximately every 5-6 hours. For the latest results look at the second run, since the first one has some tests still scheduled to run.

Test infrastructure

Different levels of the reference implementation have different infrastructures available to test them.

Client

The Client is separated from the runtime via a RuntimeAdapter trait. In production, it uses NightshadeRuntime which uses real runtime and epoch managers. To test the client without instantiating runtime and epoch manager, we have a mock runtime KeyValueRuntime.

Most of the tests in the client work by setting up either a single node (via setup_mock()) or multiple nodes (via setup_mock_all_validators()) and then launching the nodes and waiting for a particular message to occur, with a predefined timeout.

For the most basic example of using this infrastructure see produce_two_blocks in tests/process_blocks.rs.

  1. The callback (Box::new(move |msg, _ctx, _| { ...) is what is executed whenever the client sends a message. The return value of the callback is sent back to the client, which allows for testing relatively complex scenarios. The tests generally expect a particular message to occur, in this case, the tests expect two blocks to be produced. System::current().stop(); is the way to stop the test and mark it as passed.
  2. near_network::test_utils::wait_or_panic(5000); is how the timeout for the test is set (in milliseconds).

For an example of a test that launches multiple nodes, see chunks_produced_and_distributed_common in tests/chunks_management.rs. The setup_mock_all_validators function is the key piece of infrastructure here.

Runtime

Tests for Runtime are listed in tests/test_cases_runtime.rs.

To run a test, usually, a mock RuntimeNode is created via create_runtime_node(). In its constructor, the Runtime is created in the get_runtime_and_trie_from_genesis function.

Inside a test, an abstracted User is used for sending specific actions to the runtime client. The helper functions function_call, deploy_contract, etc. eventually lead to the Runtime.apply method call.

For setting usernames during playing with transactions, use default names alice_account, bob_account, eve_dot_alice_account, etc.

Network

Chain, Epoch Manager, Runtime and other low-level changes

When building new features in the chain, epoch_manager and network crates, make sure to build new components sufficiently abstract so that they can be tested without relying on other components.

For example, see tests for doomslug here, for network cache here, or for promises in runtime here.

Python tests

See this page for detailed coverage of how to write a python test.

We have a python library that allows one to create and run python tests.

To run python tests, from the nearcore repo the first time, do the following:

cd pytest
virtualenv . --python=python3
. .env/bin/activate
pip install -r requirements.txt
python tests/sanity/block_production.py

This will create a python virtual environment, activate the environment, install all the required packages specified in the requirements.txt file and run the tests/sanity/block_production.py file. After the first time, we only need to activate the environment and can then run the tests:

cd pytest
. .env/bin/activate
python tests/sanity/block_production.py

Use pytest/tests/sanity/block_production.py as the basic example of starting a cluster with multiple nodes, and doing RPC calls.

See pytest/tests/sanity/deploy_call_smart_contract.py to see how contracts can be deployed, or transactions called.

See pytest/tests/sanity/staking1.py to see how staking transactions can be issued.

See pytest/tests/sanity/state_sync.py to see how to delay the launch of the whole cluster by using init_cluster instead of start_cluster, and then launching nodes manually.

Enabling adversarial behavior

To allow testing adversarial behavior, or generally, behaviors that a node should not normally exercise, we have certain features in the code decorated with #[cfg(feature="adversarial")]. The binary normally is compiled with the feature disabled, and when compiled with the feature enabled, it traces a warning on launch.

The nightly runner runs all the python tests against the binary compiled with the feature enabled, and thus the python tests can make the binary perform actions that it normally would not perform.

The actions can include lying about the known chain height, producing multiple blocks for the same height, or disabling doomslug.

See all the tests under pytest/tests/adversarial for some examples.

Python Tests

To simplify writing integration tests for nearcore we have a python infrastructure that allows writing a large variety of tests that run small local clusters, remove clusters, or run against full-scale live deployments.

Such tests are written in python and not in Rust (in which the nearcore itself, and most of the sanity and fuzz tests, are written) due to the availability of libraries to easily connect to, remove nodes and orchestrate cloud instances.

Nearcore itself has several features guarded by a feature flag that allows the python tests to invoke behaviors otherwise impossible to be exercised by an honest actor.

Basics

The infrastructure is located in {nearcore}/pytest/lib and the tests themselves are in subdirectories of {nearcore}/pytest/tests. To prepare the local machine to run the tests you'd need python3 (python 3.7), and have several dependencies installed, for which we recommend using virtualenv:

cd pytest
virtualenv .env --python=python3
. .env/bin/activate
pip install -r requirements.txt

The tests are expected to be ran from the pytest dir itself. For example, once the virtualenv is configured:

cd pytest
. .env/bin/activate
python tests/sanity/block_production.py

This will run the most basic tests that spin up a small cluster locally and wait until it produces several blocks.

Compiling the client for tests

The local tests by default expect the binary to be in the default location for a debug build ({nearcore}/target/debug). Some tests might also expect test-specific features guarded by a feature flag to be available. To compile the binary with such features run:

cargo build -p neard --features=adversarial

The feature is called adversarial to highlight that the many functions it enables, outside of tests, would constitute malicious behavior. The node compiled with such a flag will not start unless an environment variable ADVERSARY_CONSENT=1 is set and prints a noticeable warning when it starts, thus minimizing the chance that an honest participant accidentally launches a node compiled with such functionality.

You can change the way the tests run (locally or using Google Cloud), and where the local tests look for the binary by supplying a config file. For example, if you want to run tests against a release build, you can create a file with the following config:

{"local": true, "near_root": "../target/release/"}

and run the test with the following command:

NEAR_PYTEST_CONFIG=<path to config> python tests/sanity/block_production.py

Writing tests

We differentiate between "regular" tests, or tests that spin up their cluster, either local or on the cloud, and "mocknet" tests, or tests that run against an existing live deployment of NEAR.

In both cases, the test starts by importing the infrastructure and starting or connecting to a cluster

Starting a cluster

In the simplest case a regular test starts by starting a cluster. The cluster will run locally by default but can be spun up on the cloud by supplying the corresponding config.

import sys
sys.path.append('lib')
from cluster import start_cluster

nodes = start_cluster(4, 0, 4, None, [["epoch_length", 10], ["block_producer_kickout_threshold", 80]], {})

In the example above the first three parameters are num_validating_nodes, num_observers and num_shards. The third parameter is a config, which generally should be None, in which case the config is picked up from the environment variable as shown above.

start_cluster will spin up num_validating_nodes nodes that are block producers (with pre-staked tokens), num_observers non-validating nodes and will configure the system to have num_shards shards. The fifth argument changes the genesis config. Each element is a list of some length n where the first n-1 elements are a path in the genesis JSON file, and the last element is the value. You'd often want to significantly reduce the epoch length, so that your test triggers epoch switches, and reduce the kick-out threshold since with shorter epochs it is easier for a block producer to get kicked out.

The last parameter is a dictionary from the node ordinal to changes to their local config.

Note that start_cluster spins up all the nodes right away. Some tests (e.g. tests that test syncing) might want to configure the nodes but delay their start. In such a case you will initialize the cluster by calling init_cluster and will run the nodes manually, for example, see state_sync.py

Connecting to a mocknet

Nodes that run against a mocknet would connect to an existing cluster instead of running their own.

import sys
sys.path.append('lib')
from cluster import connect_to_mocknet

nodes, accounts = connect_to_mocknet(None)

The only parameter is a config, with None meaning to use the config from the environment variable. The config should have the following format:

{
    "nodes": [
        {"ip": "(some_ip)", "port": 3030},
        {"ip": "(some_ip)", "port": 3030},
        {"ip": "(some_ip)", "port": 3030},
        {"ip": "(some_ip)", "port": 3030}
    ],
    "accounts": [
        {"account_id": "node1", "pk": "ed25519:<public key>", "sk": "edd25519:<secret key>"},
        {"account_id": "node2", "pk": "ed25519:<public key>", "sk": "edd25519:<secret key>"}
    ]
}

Manipulating nodes

The nodes returned by start_cluster and init_cluster have certain convenience functions. You can see the full interface in {nearcore}/pytest/lib/cluster.py.

start(boot_public_key, (boot_ip, boot_port)) starts the node. If both arguments are None, the node will start as a boot node (note that the concept of a "boot node" is relatively vague in a decentralized system, and from the perspective of the tests the only requirement is that the graph of "node A booted from node B" is connected).

The particular way to get the boot_ip and boot_port when launching node1 with node2 being its boot node is the following:

node1.start(node2.node_key.pk, node2.addr())

kill() shuts down the node by sending it SIGKILL

reset_data() cleans up the data dir, which could be handy between the calls to kill and start to see if a node can start from a clean state.

Nodes on the mocknet do not expose start, kill and reset_data.

Issuing RPC calls

Nodes in both regular and mocknet tests expose an interface to issue RPC calls. In the most generic case, one can just issue raw JSON RPC calls by calling the json_rpc method:

validator_info = nodes[0].json_rpc('validators', [<some block_hash>])

For the most popular calls, there are convenience functions:

  • send_tx sends a signed transaction asynchronously
  • send_tx_and_wait sends a signed transaction synchronously
  • get_status returns the current status (the output of the /status/endpoint), which contains e.g. last block hash and height
  • get_tx returns a transaction by the transaction hash and the recipient ID.

See all the methods in {nearcore}/pytest/lib/cluster.rs after the definition of the json_rpc method.

Signing and sending transactions

There are two ways to send a transaction. A synchronous way (send_tx_and_wait) sends a tx and blocks the test execution until either the TX is finished, or the timeout is hit. An asynchronous way (send_tx + get_tx) sends a TX and then verifies its result later. Here's an end-to-end example of sending a transaction:

# the tx needs to include one of the recent hashes
last_block_hash = nodes[0].get_status()['sync_info']['latest_block_hash']
last_block_hash_decoded = base58.b58decode(last_block_hash.encode('utf8'))

# sign the actual transaction
# `fr` and `to` in this case are instances of class `Key`.
# In mocknet tests the list `Key`s for all the accounts are returned by `connect_to_mocknet`
# In regular tests each node is associated with a single account, and its key is stored in the
# `signer_key` field (e.g. `nodes[0].signer_key`)
# `15` in the example below is the nonce. Nonces needs to increase for consecutive transactions
# for the same sender account.
tx = sign_payment_tx(fr, to.account_id, 100, 15, last_block_hash_decoded)

# Sending the transaction synchronously. `10` is the timeout in seconds. If after 10 seconds the
# outcome is not ready, throws an exception
if want_sync:
    outcome = nodes[0].send_tx_and_wait(tx, 10)

# Sending the transaction asynchronously.
if want_async:
    tx_hash = nodes[from_ordinal % len(nodes)].send_tx(tx)['result']

    # and then sometime later fetch the result...
    resp = nodes[0].get_tx(tx_hash, to.account_id, timeout=1)
    # and see if the tx has finished
    finished = 'result' in resp and 'receipts_outcome' in resp['result'] and len(resp['result']['receipts_outcome']) > 0

See rpc_tx_forwarding.py for an example of signing and submitting a transaction.

Adversarial behavior

Some tests need certain nodes in the cluster to exercise behavior that is impossible to be invoked by an honest node. For such tests, we provide functionality that is protected by an "adversarial" feature flag.

It's an advanced feature, and more thorough documentation is a TODO. Most of the tests that depend on the feature flag enabled are under {nearcore}/pytest/tests/adversarial, refer to them for how such features can be used. Search for code in the nearcore codebase guarded by the "adversarial" feature flag for an example of how such features are added and exposed.

Interfering with the network

We have a library that allows running a proxy in front of each node that would intercept all the messages between nodes, deserialize them in python and run a handler on each one. The handler can then either let the message pass (return True), drop it (return False) or replace it (return <new message>).

This technique can be used to both interfere with the network (by dropping or replacing messages), and to inspect messages that flow through the network without interfering with them. For the latter, note that the handler for each node runs in a separate Process, and thus you need to use multiprocessing primitives if you want the handlers to exchange information with the main test process, or between each other.

See the tests that match tests/sanity/proxy_*.py for examples.

Contributing tests

We always welcome new tests, especially python tests that use the above infrastructure. We have a list of test requests here, but also welcome any other tests that test aspects of the network we haven't thought about.

Cheat sheet/overview of testing utils

This page covers the different testing utils/libraries that we have for easier unit testing in Rust.

Basics

CryptoHash

To create a new crypto hash:

#![allow(unused)]
fn main() {
"ADns6sqVyFLRZbSMCGdzUiUPaDDtjTmKCWzR8HxWsfDU".parse().unwrap();
}

Account

Also, prefer doing parse + unwrap:

#![allow(unused)]
fn main() {
let alice: AccountId = "alice.near".parse().unwrap();
}

Signatures

In memory signer (generates the key based on a seed). There is a slight preference to use the seed that is matching the account name.

This will create a signer for account 'test' using 'test' as a seed.

#![allow(unused)]
fn main() {
let signer: InMemoryValidatorSigner = create_test_signer("test");
}

Block

Use TestBlockBuilder to create the block that you need. This class allows you to set custom values for most of the fields.

#![allow(unused)]
fn main() {
let test_block = test_utils::TestBlockBuilder::new(prev, signer).height(33).build();
}

Store

Use the in-memory test store in tests:

#![allow(unused)]
fn main() {
let store = create_test_store();
}

EpochManager

See usages of MockEpochManager. Note that this is deprecated. Try to use EpochManager itself wherever possible.

Runtime

You can use the KeyValueRuntime (instead of the Nightshade one):

#![allow(unused)]
fn main() {
KeyValueRuntime::new(store, &epoch_manager);
}

Chain

No fakes or mocks.

Client

TestEnv - for testing multiple clients (without network):

#![allow(unused)]
fn main() {
TestEnvBuilder::new(genesis).client(vec!["aa"]).validators(..).epoch_managers(..).build();
}

Network

PeerManager

To create a PeerManager handler:

#![allow(unused)]
fn main() {
let pm = peer_manager::testonly::start(...).await;
}

To connect to others:

#![allow(unused)]
fn main() {
pm.connect_to(&pm2.peer_info).await;
}

Events handling

To wait/handle a given event (as a lot of network code is running in an async fashion):

#![allow(unused)]
fn main() {
pm.events.recv_util(|event| match event {...}).await;
}

End to End

chain, runtime, signer

In chain/chain/src/test_utils.rs:

#![allow(unused)]
fn main() {
// Creates 1-validator (test):  chain, KVRuntime and a signer
let (chain, runtime, signer) = setup();
}

block, client actor, view client

In chain/client/src/test_utils.rs

#![allow(unused)]
fn main() {
let (block, client, view_client) = setup(MANY_FIELDS);
}

Test Coverage

In order to focus the testing effort where it is most needed, we have a few ways we track test coverage.

Codecov

The main one is Codecov. Coverage is visible on this webpage, and displays the total coverage, including unit and integration tests. Codecov is especially interesting for its PR comments. The PR comments, in particular, can easily show which diff lines are being tested and which are not.

However, sometimes Codecov gives too rough estimates, and this is where artifact results come in.

Artifact Results

We also push artifacts, as a result of each CI run. You can access them here:

  1. Click "Details" on one of the CI actions run on your PR (literally any one of the actions is fine, you can also access CI actions runs on any CI)
  2. Click "Summary" on the top left of the opening page
  3. Scroll to the bottom of the page
  4. In the "Artifacts" section, just above the "Summary" section, there is a coverage-html link (there is also coverage-lcov for people who use eg. the coverage gutters vscode integration)
  5. Downloading it will give you a zip file with the interesting files.

In there, you can find:

  • Two -diff files, that contain code coverage for the diff of your PR, to easily see exactly which lines are covered and which are not
  • Two -full folders, that contain code coverage for the whole repository
  • Each of these exists in one unit- variant, that only contains the unit tests, and one integration- variant, that contains all the tests we currently have

To check that your PR is properly tested, if you want better quality coverage than what codecov "requires," you can have a look at unit-diff, because we agreed that we want unit tests to be able to detect most bugs due to the troubles of debugging failing integration tests.

To find a place that would deserve adding more tests, look at one of the -full directories on master, pick one not-well-tested file, and add (ideally unit) tests for the lines that are missing.

The presentation is unfortunately less easy to access than codecov, and less eye-catchy. On the other hand, it should be more precise. In particular, the -full variants show region-based coverage. It can tell you that eg. the ? branch is not covered properly by highlighting it red.

One caveat to be aware of: the -full variants do not highlight covered lines in green, they just highlight non-covered lines in red.

Protocol Upgrade

This document describes the entire cycle of how a protocol upgrade is done, from the initial PR to the final release. It is important for everyone who contributes to the development of the protocol and its client(s) to understand this process.

Background

At NEAR, we use the term protocol version to mean the version of the blockchain protocol and is separate from the version of some specific client (such as nearcore), since the protocol version defines the protocol rather than some specific implementation of the protocol. More concretely, for each epoch, there is a corresponding protocol version that is agreed upon by validators through a voting mechanism. Our upgrade scheme dictates that protocol version X is backward compatible with protocol version X-1 so that nodes in the network can seamlessly upgrade to the new protocol. However, there is no guarantee that protocol version X is backward compatible with protocol version X-2.

Despite the upgrade mechanism, rolling out a protocol change can be scary, especially if the change is invasive. For those changes, we may want to have several months of testing before we are confident that the change itself works and that it doesn't break other parts of the system.

Protocol version voting and upgrade

When a new neard version, containing a new protocol version, is released, all node maintainers need to upgrade their binary. That typically means stopping neard, downloading or compiling the new neard binary and restarting neard. However the protocol version of the whole network is not immediately bumped to the new protocol version. Instead a process called voting takes place and determines if and when the protocol version upgrade will take place.

Voting is a fully automated process in which all block producers across the network vote in support or against upgrading the protocol version. The voting happens in the last block every epoch. Upgraded nodes will begin voting in favour of the new protocol version after a predetermined date. The voting date is configured by the release owner like this. Once at least 80% of the stake votes in favour of the protocol change in the last block of epoch X, the protocol version will be upgraded in the first block of epoch X+2.

For mainnet releases, the release on github typically happens on a Monday or Tuesday, the voting typically happens a week later and the protocol version upgrade happens 1-2 epochs after the voting. This gives the node maintainers enough time to upgrade their neard nodes. The node maintainers can upgrade their nodes at any time between the release and the voting but it is recommended to upgrade soon after the release. This is to accommodate for any database migrations or miscellaneous delays.

Starting a neard node with protocol version voting in the future in a network that is already operating at that protocol version is supported as well. This is useful in the scenario where there is a mainnet security release where mainnet has not yet voted or upgraded to the new version. That same binary with protocol voting date in the future can be released in testnet even though it has already upgraded to the new protocol version.

Nightly Protocol features

To make protocol upgrades more robust, we introduce the concept of a nightly protocol version together with the protocol feature flags to allow easy testing of the cutting-edge protocol changes without jeopardizing the stability of the codebase overall. The use of the nightly and nightly_protocol for new features is mandatory while the use of dedicated rust features for new protocol features is optional and only recommended when necessary. Adding rust features leads to conditional compilation which is generally not developer friendly. In Cargo.toml file of the crates we have in nearcore, we introduce rust compile-time features nightly_protocol and nightly:

nightly_protocol = []
nightly = [
    "nightly_protocol",
    ...
]

where nightly_protocol is a marker feature that indicates that we are on nightly protocol whereas nightly is a collection of new protocol features which also implies nightly_protocol.

When it is not necessary to use a rust feature for the new protocol feature the Cargo.toml file will remain unchanged.

When it is necessary to use a rust feature for the new protocol feature, it can be added to the Cargo.toml, to the nightly features. For example, when we introduce EVM as a new protocol change, suppose the current protocol version is 40, then we would do the following change in Cargo.toml:

nightly_protocol = []
nightly = [
    "nightly_protocol",
    "protocol_features_evm",
    ...
]

In core/primitives/src/version.rs, we would change the protocol version by:

#![allow(unused)]
fn main() {
#[cfg(feature = “nightly_protocol”)]
pub const PROTOCOL_VERSION: u32 = 100;
#[cfg(not(feature = “nightly_protocol”)]
pub const PROTOCOL_VERSION: u32 = 40;
}

This way the stable versions remain unaffected after the change. Note that nightly protocol version intentionally starts at a much higher number to make the distinction between the stable protocol and nightly protocol clearer.

To determine whether a protocol feature is enabled, we do the following:

  • We maintain a ProtocolFeature enum where each variant corresponds to some protocol feature. For nightly protocol features, the variant may optionally be gated by the corresponding rust compile-time feature.
  • We implement a function protocol_version to return, for each variant, the corresponding protocol version in which the feature is enabled.
  • When we need to decide whether to use the new feature based on the protocol version of the current network, we can simply compare it to the protocol version of the feature. To make this simpler, we also introduced a macro checked_feature

For more details, please refer to core/primitives/src/version.rs.

Feature Gating

It is worth mentioning that there are two types of checks related to protocol features:

  • Runtime checks that compare the protocol version of the current epoch and the protocol version of the feature. Those runtime checks must be used for both stable and nightly features.
  • Compile time checks that check if the rust feature corresponding with the protocol feature is enabled. This check is optional and can only be used for nightly features.

Testing

Nightly protocol features allow us to enable the most bleeding-edge code in some testing environments. We can choose to enable all nightly protocol features by

#![allow(unused)]
fn main() {
cargo build -p neard --release --features nightly
}

or enable some specific protocol feature by

#![allow(unused)]
fn main() {
cargo build -p neard --release --features nightly_protocol,<protocol_feature>
}

In practice, we have all nightly protocol features enabled for Nayduck tests and on betanet, which is updated daily.

Feature Stabilization

New protocol features are introduced first as nightly features and when the author of the feature thinks that the feature is ready to be stabilized, they should submit a pull request to stabilize the feature using this template. In this pull request, they should do the feature gating, increase the PROTOCOL_VERSION constant (if it hasn't been increased since the last release), and change the protocol_version implementation to map the stabilized features to the new protocol version.

A feature stabilization request must be approved by at least two nearcore code owners. Unless it is a security-related fix, a protocol feature cannot be included in any release until at least one week after its stabilization. This is to ensure that feature implementation and stabilization are not rushed.

This document describes the advanced network options that you can configure by modifying the "network" section of your "config.json" file:

{
  // ...
  "network": {
    // ...
    "public_addrs": [],
    "allow_private_ip_in_public_addrs": false,
    "experimental": {
      "inbound_disabled": false,
      "connect_only_to_boot_nodes": false,
      "skip_sending_tombstones_seconds": 0,
      "tier1_enable_inbound": true,
      "tier1_enable_outbound": false,
      "tier1_connect_interval": {
        "secs": 60,
        "nanos": 0
      },
      "tier1_new_connections_per_attempt": 50
    }
  },
  // ...
}

TIER1 network

Participants of the BFT consensus (block & chunk producers) now can establish direct (aka TIER1) connections between each other, which will optimize the communication latency and minimize the number of dropped chunks. If you are a validator, you can enable TIER1 connections by setting the following fields in the config:

  • public_addrs
    • this is a list of the public addresses (in the format "<node public key>@<IP>:<port>") of trusted nodes, which are willing to route messages to your node
    • this list will be broadcasted to the network so that other validator nodes can connect to your node.
    • if your node has a static public IP, set public_addrs to a list with a single entry with the public key and address of your node, for example: "public_addrs": ["ed25519:86EtEy7epneKyrcJwSWP7zsisTkfDRH5CFVszt4qiQYw@31.192.22.209:24567"].
    • if your node doesn't have a public IP (for example, it is hidden behind a NAT), set public_addrs to a list (<=10 entries) of proxy nodes that you trust (arbitrary nodes with static public IPs).
    • support for nodes with dynamic public IPs is not implemented yet.
  • experimental.tier1_enable_outbound
    • makes your node actively try to establish outbound TIER1 connections (recommended) once it learns about the public addresses of other validator nodes. If disabled, your node won't try to establish outbound TIER1 connections, but it still may accept incoming TIER1 connections from other nodes.
    • currently false by default, but will be changed to true by default in the future
  • experimental.tier1_enable_inbound
    • makes your node accept inbound TIER1 connections from other validator nodes.
    • disable both tier1_enable_inbound and tier1_enable_outbound if you want to opt-out from the TIER1 communication entirely
    • disable tier1_enable_inbound if you are not a validator AND you don't want your node to act as a proxy for validators.
    • true by default

Starting a test chain from state taken from mainnet or testnet

Purpose

For testing purposes, it is often desirable to start a test chain with a starting state that looks like mainnet or testnet. This is usually done for the purpose of testing changes to neard itself, but it's also possible to do this if you're a contract developer and want to see what a change to your contract would look like on top of the current mainnet state. At the end of the process described here, you'll have a set of genesis records that can be used to start your own test chain, that'll be like any other test chain like the ones generated by the neard localnet command, except with account balances and data taken from mainnet

How-to

The first step is to obtain an RPC node home directory for the chain you'd like to spoon. So if you want to use mainnet state, you can follow the instructions here to obtain a recent snapshot of a mainnet node's home directory. Once you have your node's home directory set up, run the following state-viewer command to generate a dump of the chain's state:

$ neard --home $NEAR_HOME_DIRECTORY view-state dump-state --stream

This command will take a while (possibly many hours) to run. But at the end you should have genesis.json and records.json files in $NEAR_HOME_DIRECTORY/output. This records file represents all of the chain's current state, and is what we'll use to start our chain.

From here, we need to make some changes to the genesis.json that was generated in $NEAR_HOME_DIRECTORY/output. To see why, note that the validators field of this genesis file lists all the current mainnet validators and their keys. So that means if we were to try and start a test chain from the generated genesis and records files as-is, it would work, but our node would expect the current mainnet validators to be producing blocks and chunks (which they definitely won't be! Because we're the only ones who know or care about this new test chain).

So we need to select a new list of validators to start off our chain. Suppose that we want our chain to have two validators, validator0.near and validator1.near. Let's make a new directory where we'll be storing intermediate files during this process:

$ mkdir ~/test-chain-scratch

then using your favorite editor, lay out the validators you want in the test chain as a JSON list in the same format as the validators field in genesis.json, maybe in the file ~/test-chain-scratch/validators.json

[
  {
    "account_id": "validator0.near",
    "public_key": "ed25519:GRAFkrqEkJAbdbWUgc6fDnNpCTE83C3pzdJpjAHkMEhq",
    "amount": "100000000000000000000000000000000"
  },
  {
    "account_id": "validator1.near",
    "public_key": "ed25519:5FxQQTC9mk5kLAhTF9ffDMTXiyYrDXyGYskgz46kHMdd",
    "amount": "100000000000000000000000000000000"
  }
]

These validator keys should be keys you've already generated. So for the rest of this document, we'll assume you've run:

$ neard --home ~/near-test-chain/validator0 init --account-id validator0.near
$ neard --home ~/near-test-chain/validator1 init --account-id validator1.near

This is also a good time to think about what extra accounts you might want in your test chain. Since all accounts in the test chain will have the same keys as they do on mainnet, you'll only have access to the accounts that you have access to on mainnet. If you want to add an account with a large balance to properly test things out, you can write them out in a file as a JSON list of state records (in the same format as they appear in records.json). For example, you could put the following in ~/test-chain-scratch/extra-records.json:

[
  {
    "Account": {
      "account_id": "my-test-account.near",
      "account": {
        "amount": "10000000000000000000000000000000000",
        "locked": "0",
        "code_hash": "11111111111111111111111111111111",
        "storage_usage": 182,
        "version": "V1"
      }
    }
  },
  {
    "AccessKey": {
      "account_id": "my-test-account.near",
      "public_key": "ed25519:Eo9W44tRMwcYcoua11yM7Xfr1DjgR4EWQFM3RU27MEX8",
      "access_key": {
        "nonce": 0,
        "permission": "FullAccess"
      }
    }
  }
]

You'll want to include an access key here, otherwise you won't be able to do anything with the account. Note that here you can also add access keys for any mainnet account you want, so you'll be able to control it in the test chain.

Now to make these changes to the genesis and records files, you can use the neard amend-genesis command like so:

# mkdir ~/near-test-chain/
$ neard amend-genesis --genesis-file-in $NEAR_HOME_DIRECTORY/output/genesis.json --records-file-in $NEAR_HOME_DIRECTORY/output/records.json --validators ~/test-chain-scratch/validators.json --extra-records ~/test-chain-scratch/extra-records.json --chain-id $TEST_CHAIN_ID --records-file-out ~/near-test-chain/records.json --genesis-file-out ~/near-test-chain/genesis.json

Starting the network

After running the previous steps you should have the files genesis.json and records.json in ~/near-test-chain/. Assuming you've started it with the two validators validator0.near and validator1.near as described above, you'll want to run at least two nodes, one for each of these validator accounts. If you're working with multiple computers or VMs that can connect to each other over the internet, you'll be able to run your test network over the internet as is done with the "real" networks (mainnet, testnet, etc.). But for now let's assume that you want to run this on only one machine.

So assuming you've initialized home directories for each of the validators with the init command described above, you'll want to copy the records and genesis files generated in the previous step to each of these:

$ cp ~/near-test-chain/records.json ~/near-test-chain/validator0
$ cp ~/near-test-chain/genesis.json ~/near-test-chain/validator0
$ cp ~/near-test-chain/records.json ~/near-test-chain/validator1
$ cp ~/near-test-chain/genesis.json ~/near-test-chain/validator1

Now we'll need to make a few config changes to each of ~/near-test-chain/validator0/config.json and ~/near-test-chain/validator1/config.json:

changes to ~/near-test-chain/validator0/config.json:

{
  "genesis_records_file": "records.json",
  "rpc": {
    "addr": "0.0.0.0:3030"
  },
  "network": {
    "addr": "0.0.0.0:24567",
    "boot_nodes": "ed25519:Dk4A7NPBYFPwKWouiSUoyZ15igbLSrcPEJqUqDX4grb7@127.0.0.1:24568",
    "skip_sync_wait": false,
  },
  "consensus": {
    "min_num_peers": 1
  },
  "tracked_shards": [0],
}

changes to ~/near-test-chain/validator1/config.json:

{
  "genesis_records_file": "records.json",
  "rpc": {
    "addr": "0.0.0.0:3031"
  },
  "network": {
    "addr": "0.0.0.0:24568",
    "boot_nodes": "ed25519:6aR4xVQedQ7Z9URrASgwBY8bedpaYzgH8u5NqEHp2hBv@127.0.0.1:24567",
    "skip_sync_wait": false,
  },
  "consensus": {
    "min_num_peers": 1
  },
  "tracked_shards": [0],
}

Here we make sure to have each node listen on different ports, while telling each about the other via network.boot_nodes. In this boot_nodes string, we set the public key not to the validator key, but to whatever key is present in the node_key.json file you got when you initialized the home directory. So for validator0's config, we set its boot node to validator1's node key, followed by the address of the socket it should be listening on. We also want to drop the minimum required number of peers, since we're just running a small test network locally. We set skip_sync_wait to false, because otherwise we get strange behavior that will often make your network stall.

After making these changes, you can try running one neard process for each of your validators:

$ neard --home ~/near-test-chain/validator0 run
$ neard --home ~/near-test-chain/validator1 run

Now these nodes will begin by taking the records laid out in records.json and turning them into a genesis state. At the time of this writing, using the latest nearcore version from the master branch, this will take a couple hours. But your validators should begin producing blocks after that's done.

Overview

This chapter holds various assorted bits of docs. If you want to document something, but don't know where to put it, put it here!

Crate Versioning and Publishing

While all the crates in the workspace are directly unversioned (v0.0.0), they all share a unified variable version in the workspace manifest. This keeps versions consistent across the workspace and informs their versions at the moment of publishing.

We also have CI infrastructure set up to automate the publishing process to crates.io. So, on every merge to master, if there's a version change, it is automatically applied to all the crates in the workspace and it attempts to publish the new versions of all non-private crates. All crates that should be exempt from this process should be marked private. That is, they should have the publish = false specification in their package manifest.

This process is managed by cargo-workspaces, with a bit of magic sprinkled on top.

Issue Labels

Issue labels are of the following format <type>-<content> where <type> is a capital letter indicating the type of the label and <content> is a hyphened phrase indicating what this label is about. For example, in the label C-bug, C means category and bug means that the label is about bugs. Common types include C, which means category, A, which means area and T, which means team.

An issue can have multiple labels including which area it touches, which team should be responsible for the issue, and so on. Each issue should have at least one label attached to it after it is triaged and the label could be a general one, such as C-enhancement or C-bug.

Experimental: Dump State to External Storage

Purpose

State Sync is being reworked.

A new version is available for experimental use. This version gets state parts from external storage. The following kinds of external storage are supported:

  • Local filesystem
  • Google Cloud Storage
  • Amazon S3

A new version of decentralized state sync is work in progress.

How-to

neard release 1.36.0-rc.1 adds an experimental option to sync state from external storage.

See how-to how to configure your node to State Sync from External Storage.

In case you would like to manage your own dumps of State, keep reading.

Google Cloud Storage

To enable Google Cloud Storage as your external storage, add this to your config.json file:

"state_sync": {
  "dump": {
    "location": {
      "GCS": {
        "bucket": "my-gcs-bucket",
      }
    }
  }
}

And run your node with an environment variable SERVICE_ACCOUNT or GOOGLE_APPLICATION_CREDENTIALS pointing to the credentials json file

SERVICE_ACCOUNT=/path/to/file ./neard run

Amazon S3

To enable Amazon S3 as your external storage, add this to your config.json file:

"state_sync": {
  "dump": {
    "location": {
      "S3": {
        "bucket": "my-aws-bucket",
        "region": "my-aws-region"
      }
    }    
  }
}

And run your node with environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY:

AWS_ACCESS_KEY_ID="MY_ACCESS_KEY" AWS_SECRET_ACCESS_KEY="MY_AWS_SECRET_ACCESS_KEY" ./neard run

Dump to a local filesystem

Add this to your config.json file to dump state of every epoch to local filesystem:

"state_sync": {
  "dump": {
    "location": {
      "Filesystem": {
        "root_dir": "/tmp/state-dump"
      }
    }    
  }
}

In this case you don't need any extra environment variables. Simply run your node:

./neard run

Archival node - Recovery of missing data

Incident description

In early 2024 there have been a few incidents on archival node storage. As result of these incidents archival node might have lost some data from January to March.

The practical effect of this issue is that requests querying the state of an account may fail, returning instead an internal server error.

Check if my node has been impacted

The simplest way to check whether or not a node has suffered data loss is to run one of the following queries:

replace <RPC_URL> with the correct URL (example: http://localhost:3030)

curl -X POST <RPC_URL> \
        -H "Content-Type: application/json" \
        -d '
        { "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "b001b461c65aca5968a0afab3302a5387d128178c99ff5b2592796963407560a", "block_id": 109913260, "request_type": "view_account" } }'
curl -X POST <RPC_URL> \
        -H "Content-Type: application/json" \
        -d '
        { "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "token2.near", "block_id": 114580308, "request_type": "view_account" } }'
curl -X POST <RPC_URL> \
        -H "Content-Type: application/json" \
        -d '
        { "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "timpanic.tg", "block_id": 115185110, "request_type": "view_account" } }'
curl -X POST <RPC_URL> \
        -H "Content-Type: application/json" \
        -d '
        { "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "01.near", "block_id": 115514400, "request_type": "view_account" } }'

If for any of the above requests you get an error of the kind MissingTrieValue it means that the node presents the issue.

Remediation steps

  1. Stop neard

  2. Delete the existing hot and cold databases. Example assuming default configuration:

    rm -rf ~/.near/hot-data && rm -rf ~/.near/cold-data
    
  3. Download an up-to-date snapshot, following this guide: Storage snapshots

Option B: manually run recovery commands

Follow the instructions below if, for any reason, you prefer to perform the manual recovery steps on your node instead of downloading a new snapshot.

Requirements:

  • neard must be stopped while recovering data
  • cold storage must be mounted on an SSD disk or better
  • The config resharding_config.batch_delay must be set to 0.

After the recovery is finished the configuration changes can be undone and a hard disk drive can be used to mount the cold storage.

Important considerations:

  • The recovery procedure will, most likely, take more than one week
    • Since neard must be stopped in order to execute the commands, the node won't be functional during this period of time
  • The node must catch up to the chain's head after completing the data recovery, this could take days as well
    • During catch up the node can answer RPC queries. However, the chain head is still at the point where the node was stopped; for this reason recent blocks won't be available immediately

We published a reference recovery script in the nearcore repository. Your neard setup might be different, so the advice is to thoroughly check the script before running it. For completeness, here we include the set of commands to run:

neard view-state -t cold --readwrite apply-range --start-index 109913254 --end-index 110050000 --shard-id 2 --storage trie-free --save-state cold sequential
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 0 --restore
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 1 --restore
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 2 --restore
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 3 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 0 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 1 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 2 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 3 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 4 --restore

Verify if remediation has been successful

Run the queries specified in the section: Check if my node has been impacted. All of them should return a successful response now.