Introduction
Welcome to the nearcore development guide!
The target audience of this guide are developers of nearcore itself. If you are a user of NEAR (either a contract developer, or validator running a node), please refer to the user docs at https://docs.near.org.
This guide is built with mdBook from sources in the nearcore repository. You can edit it by pressing the "edit" icon in the top right corner, we welcome all contributions. The guide is hosted at https://near.github.io/nearcore/.
The guide is organized as a collection of loosely coupled chapters -- you don't need to read them in order, feel free to peruse the TOC, and focus on the interesting bits. The chapters are classified into three parts:
- Architecture talks about how the code works. So, for example, if you are interested in how a transaction flows through the system, look there!
- Practices describe, broadly, how we write code. For example, if you want to learn about code style, issue tracking, or debugging performance problems, this is the chapter for you.
- Finally, the Misc part holds various assorted bits and pieces. We are trying to bias ourselves towards writing more docs, so, if you want to document something and it doesn't cleanly map to a category above, just put it in misc!
If you are unsure, start with Architecture Overview and then read Run a Node
Overview
This document describes the high-level architecture of nearcore. The focus here is on the implementation of the blockchain protocol, not the protocol itself. For reference documentation of the protocol, please refer to nomicon
Some parts of our architecture are also covered in this video series on YouTube.
Bird's Eye View
If we put the entirety of nearcore onto one picture, we get something like this:
Don't worry if this doesn't yet make a lot of sense: hopefully, by the end of this document the above picture would become much clearer!
Overall Operation
nearcore
is a blockchain node -- it's a single binary (neard
) which runs on
some machine and talks to other similar binaries running elsewhere. Together,
the nodes agree (using a distributed consensus algorithm) on a particular
sequence of transactions. Once transaction sequence is established, each node
applies transactions to the current state. Because transactions are fully
deterministic, each node in the network ends up with identical state. To allow
greater scalability, NEAR protocol uses sharding, which allows a node to hold
only a small subset (shard) of the whole state.
neard
is a stateful, restartable process. When neard
starts, the node
connects to the network and starts processing blocks (block is a batch of
transactions, processed together; transactions are batched into blocks for
greater efficiency). The results of processing are persisted in the database.
RocksDB is used for storage. Usually, the node's data is found in the ~/.near
directory. The node can be stopped at any moment and be restarted later. While
the node is offline it misses the block, so, after a restart, the sync process
kicks in which brings the node up-to-speed with the network by downloading the
missing bits of history from more up-to-date peer nodes.
Major components of nearcore:
-
JSON RPC. This HTTP RPC interface is how
neard
communicates with non-blockchain outside world. For example, to submit a transaction, some client sends an RPC request with it to some node in the network. From that node, the transaction propagates through the network, until it is included in some block. Similarly, a client can send an HTTP request to a node to learn about current state of the blockchain. The JSON RPC interface is documented here. -
Network. If RPC is aimed "outside" the blockchain, "network" is how peer
neard
nodes communicate with each other within the blockchain. RPC carries requests from users of the blockchain, while network carries various messages needed to implement consensus. Two directly connected nodes communicate by sending protobuf-encoded messages over TCP. A node also includes logic to route messages for indirect peers through intermediaries. Oversimplifying a lot, it's enough for a new node to know an IP address of just one other network participant. From this bootstrap connection, the node learns how to communicate with any other node in the network. -
Client. Somewhat confusingly named, client is the logical state of the blockchain. After receiving and decoding a request, both RPC and network usually forward it in the parsed form to the client. Internally, client is split in two somewhat independent components: chain and runtime.
-
Chain. The job of chain, in a nutshell, is to determine a global order of transactions. Chain builds and maintains the blockchain data structure. This includes block and chunk production and processing, consensus, and validator selection. However, chain is not responsible for actually applying transactions and receipts.
-
Runtime. If chain selects the order of transactions, Runtime applies transaction to the state. Chain guarantees that everyone agrees on the order and content of transactions, and Runtime guarantees that each transaction is fully deterministic. It follows that everyone agrees on the "current state" of the blockchain. Some transactions are as simple as "transfer X tokens from Alice to Bob". But a much more powerful class of transactions is supported: "run this arbitrary WebAssembly code in the context of the current state of the chain". Running such "smart contract" transactions securely and efficiently is a major part of what Runtime does. Today, Runtime uses a JIT compiler to do that.
-
Storage. Storage is more of a cross-cutting concern, than an isolated component. Many parts of a node want to durably persist various bits of state to disk. One notable case is the logical state of the blockchain, and, in particular, data associated with each account. Logically, the state of an account on a chain is a key-value map:
HashMap<Vec<u8>, Vec<u8>>
. But there is a twist: it should be possible to provide a succinct proof that a particular key indeed holds a particular value. To allow that internally the state is implemented as a persistent (in both senses, "functional" and "on disk") merkle-patricia trie. -
Parameter Estimator. One kind of transaction we support is "run this arbitrary, Turing-complete computation". To protect from a
loop {}
transaction halting the whole network, Runtime implements resource limiting: each transaction runs with a certain finite amount of "gas", and each operation costs a certain amount of gas to perform. Parameter estimator is essentially a set of benchmarks used to estimate relative gas costs of various operations.
Entry Points
neard/src/main.rs
contains the main function that starts a blockchain node.
However, this file mostly only contains the logic to parse arguments and
dispatch different commands. start_with_config
in nearcore/src/lib.rs
is the
actual entry point and it starts all the actors.
JsonRpcHandler::process
in the jsonrpc
crate is the RPC entry point. It
implements the public API of a node, which is documented
here.
PeerManagerActor::spawn
in the network
is an entry for the other point of
contract with the outside world -- the peer-to-peer network.
Runtime::apply
in the runtime
crate is the entry point for transaction
processing logic. This is where state transitions actually happen, after chain
decided, according to distributed consensus, which transitions need to
happen.
Code Map
This section contains some high-level overview of important crates and data structures.
core/primitives
This crate contains most of the types that are shared across different crates.
core/primitives-core
This crate contains types needed for runtime.
core/store/trie
This directory contains the MPT state implementation. Note that we usually use
TrieUpdate
to interact with the state.
chain/chain
This crate contains most of the chain logic (consensus, block processing, etc).
ChainUpdate::process_block
is where most of the block processing logic
happens.
State update
The blockchain state of a node can be changed in the following two ways:
- Applying a chunk. This is how the state is normally updated: through
Runtime::apply
. - State sync. State sync can happen in two cases:
- A node is far enough behind the most recent block and triggers state sync to fast forward to the state of a very recent block without having to apply blocks in the middle.
- A node is about to become validator for some shard in the next epoch, but it
does not yet have the state for that shard. In this case, it would run state
sync through the
catchup
routine.
chain/chunks
This crate contains most of the sharding logic which includes chunk creation,
distribution, and processing. ShardsManager
is the main struct that
orchestrates everything here.
chain/client
This crate defines two important structs, Client
and ViewClient
. Client
includes everything necessary for the chain (without network and runtime) to
function and runs in a single thread. ViewClient
is a "read-only" client that
answers queries without interfering with the operations of Client
.
ViewClient
runs in multiple threads.
chain/network
This crate contains the entire implementation of the p2p network used by NEAR blockchain nodes.
Two important structs here: PeerManagerActor
and Peer
. Peer manager
orchestrates all the communications from network to other components and from
other components to network. Peer
is responsible for low-level network
communications from and to a given peer (more details in
this article). Peer manager runs in one thread while each
Peer
runs in its own thread.
Architecture Invariant: Network communicates to Client
through
NetworkClientMessages
and to ViewClient
through NetworkViewClientMessages
.
Conversely, Client
and ViewClient
communicates to network through
NetworkRequests
.
chain/epoch_manager
This crate is responsible for determining validators and other epoch related information such as epoch id for each epoch.
Note: EpochManager
is constructed in NightshadeRuntime
rather than in
Chain
, partially because we had this idea of making epoch manager a smart
contract.
chain/jsonrpc
This crate implements JSON-RPC API server to enable
submission of new transactions and inspection of the blockchain data, the
network state, and the node status. When a request is processed, it generates a
message to either ClientActor
or ViewClientActor
to interact with the
blockchain. For queries of blockchain data, such as block, chunk, account, etc,
the request usually generates a message to ViewClientActor
. Transactions, on
the other hand, are sent to ClientActor
for further processing.
runtime/runtime
This crate contains the main entry point to runtime -- Runtime::apply
. This
function takes ApplyState
, which contains necessary information passed from
chain to runtime, a list of SignedTransaction
and a list of Receipt
, and
returns an ApplyResult
, which includes state changes, execution outcomes, etc.
Architecture Invariant: The state update is only finalized at the end of
apply
. During all intermediate steps state changes can be reverted.
runtime/near-vm-logic
VMLogic
contains all the implementations of host functions and is the
interface between runtime and wasm. VMLogic
is constructed when runtime
applies function call actions. In VMLogic
, interaction with NEAR blockchain
happens in the following two ways:
VMContext
, which contains lightweight information such as current block hash, current block height, epoch id, etc.External
, which is a trait that contains functions to interact with blockchain by either reading some nontrivial data, or writing to the blockchain.
runtime/near-vm-runner
run
function in runner.rs
is the entry point to the vm runner. This function
essentially spins up the vm and executes some function in a contract. It
supports different wasm compilers including wasmer0, wasmer2, and wasmtime
through compile-time feature flags. Currently we use wasmer0 and wasmer2 in
production. The imports
module exposes host functions defined in
near-vm-logic
to WASM code. In other words, it defines the ABI of the
contracts on NEAR.
neard
As mentioned before, neard
is the crate that contains that main entry points.
All the actors are spawned in start_with_config
. It is also worth noting that
NightshadeRuntime
is the struct that implements RuntimeAdapter
.
core/store/src/db.rs
This file contains the schema (DBCol) of our internal RocksDB storage - a good starting point when reading the code base.
Cross Cutting Concerns
Observability
The tracing crate is used for structured, hierarchical event output and logging. We also integrate Prometheus for light-weight metric output. See the style documentation for more information on the usage.
Testing
Rust has built-in support for writing unit tests by marking functions
with the #[test]
directive. Take full advantage of that! Testing not
only confirms that what was written works the way it was intended to but
also helps during refactoring since it catches unintended behaviour
changes.
Not all tests are created equal though and while some may only need
milliseconds to run, others may run for several seconds or even
minutes. Tests that take a long time should be marked as such by
prefixing their name with slow_test_
or ultra_slow_test_
:
#![allow(unused)] fn main() { #[test] fn ultra_slow_test_catchup_random_single_part_sync() { test_catchup_random_single_part_sync_common(false, false, 13) } }
During local development both slow and ultra-slow tests will not run
with a typical just nextest
invocation. You can run them with just nextest-slow
or just nextest-all
locally. CI will run the slow
tests and the ultra_slow
ones are left to run on Nayduck.
Because ultra_slow
tests are run on nayduck, they need to be explicitly
included in nightly/expensive.txt
file; for example:
expensive --timeout=1800 near-client near_client tests::catching_up::ultra_slow_test_catchup_random_single_part_sync
expensive --timeout=1800 near-client near_client tests::catching_up::ultra_slow_test_catchup_random_single_part_sync --features nightly
For more details regarding nightly tests see nightly/README.md
.
Note that what counts as a slow test is defined in
.config/nextest.toml
.
How neard works
This chapter describes how neard works with a focus on implementation details and practical scenarios. To get a better understanding of how the protocol works, please refer to nomicon. For a high-level code map of nearcore, please refer to this document.
High level overview
On the high level, neard is a daemon that periodically receives messages from the network and sends messages to peers based on different triggers. Neard is implemented using an actor framework called actix.
Note: Using actix was decided in the early days of the implementation of nearcore and by no means represents our confidence in actix. On the contrary, we have noticed a number of issues with actix and are considering implementing an actor framework in house.
There are several important actors in neard:
-
PeerActor
- Each peer is represented by one peer actor and runs in a separate thread. It is responsible for sending messages to and receiving messages from a given peer. AfterPeerActor
receives a message, it will route it toClientActor
,ViewClientActor
, orPeerManagerActor
depending on the type of the message. -
PeerManagerActor
- Peer Manager is responsible for receiving messages to send to the network from eitherClientActor
orViewClientActor
and routing them to the rightPeerActor
to send the bytes over the wire. It is also responsible for handling some types of network messages received and routed throughPeerActor
. For the purpose of this document, we only need to know thatPeerManagerActor
handlesRoutedMessage
s. Peer manager would decide whether theRoutedMessage
s should be routed toClientActor
orViewClientActor
. -
ClientActor
- Client actor is the âcoreâ of neard. It contains all the main logic including consensus, block and chunk processing, state transition, garbage collection, etc. Client actor is single-threaded. -
ViewClientActor
- View client actor can be thought of as a read-only interface to client. It only accesses data stored in a nodeâs storage and does not mutate any state. It is used for two purposes:- Answering RPC requests by fetching the relevant piece of data from storage.
- Handling some network requests that do not require any changes to the storage, such as header sync, state sync, and block sync requests.
ViewClientActor
runs in four threads by default but this number is configurable.
Data flow within neard
Flow for incoming messages:
Flow for outgoing messages:
How neard operates when it is fully synced
When a node is fully synced, the main logic of the node operates in the following way (the node is assumed to track all shards, as most nodes on mainnet do today):
- A block is produced by some block producer and sent to the node through broadcasting.
- The node receives a block and tries to process it. If the node is synced it presumably has the previous block and the state before the current block to apply. It then checks whether it has all the chunks available. If the node is not a validator node, it wonât have any chunk parts and therefore wonât have the chunks available. If the node is a validator node, it may already have chunk parts through chunk parts forwarding from other nodes and therefore may have already reconstructed some chunks. Regardless, if the node doesnât have all chunks for all shards, it will request them from peers by parts.
- The chunk requests are sent and the node waits for enough chunk parts to be
received to reconstruct the chunks. For each chunk, 1/3 of all the parts
(100) is sufficient to reconstruct a chunk. If new blocks arrive while waiting
for chunk parts, they will be put into an
OrphanPool
, waiting to be processed. If a chunk part request is not responded to withinchunk_request_retry_period
, which is set to 400ms by default, then a request for the same chunk part would be sent again. - After all chunks are reconstructed, the node processes the current block by applying transactions and receipts from the chunks. Afterwards, it will update the head according to the fork choice rule, which only looks at block height. In other words, if the newly processed block is of higher height than the current head of the node, the head is updated.
- The node checks whether any blocks in the
OrphanPool
are ready to be processed in a BFS order and processes all of them until none can be processed anymore. Note that a block is put into theOrphanPool
if and only if its previous block is not accepted. - Upon acceptance of a block, the node would check whether it needs to run garbage collection. If it needs to, it would garbage collect two blocks worth of data at a time. The logic of garbage collection is complicated and could be found here.
- If the node is a validator node, it would start a timer after the current
block is accepted. After
min_block_production_delay
which is currently configured to be 1.3s on mainnet, it would send an approval to the block producer of the next block (current block height + 1).
The main logic is illustrated below:
How neard works when it is synchronizing
PeerManagerActor
periodically sends a NetworkInfo
message to ClientActor
to update it on the latest peer information, which includes the height of each
peer. Once ClientActor
realizes that it is more than sync_height_threshold
(which by default is set to 1) behind the highest height among peers, it starts
to sync. The synchronization process is done in three steps:
-
Header sync. The node first identifies the headers it needs to sync through a
get_locator
calculation. This is essentially an exponential backoff computation that tries to identify commonly known headers between the node and its peers. Then it would request headers from different peers, at mostMAX_BLOCK_HEADER_HASHES
(which is 512) headers at a time. -
After the headers are synced, the node would determine whether it needs to run state sync. The exact condition can be found here but basically a node would do state sync if it is more than 2 epochs behind the head of the network. State sync is a very complex process and warrants its own section. We will give a high level overview here.
- First, the node computes
sync_hash
which is the hash of the block that identifies the state that the node wants to sync. This is guaranteed to be the first block of the most recent epoch. In fact, there is a check on the receiver side that this is indeed the case. The node would also request the block whose hash issync_hash
- The node deletes basically all data (blocks, chunks, state) from its storage. This is not an optimal solution, but it makes the implementation for combining state easier when there is no stale data in storage.
- For the state of each shard that the node needs to download, it first requests a header that contains some metadata the node needs to know about. Then the node computes the number of state parts it needs to download and requests those parts from different peers who track the shard.
- After all parts are downloaded, the node combines those state parts and then finalizes the state sync by applying the last chunk included in or before the sync block so that the node has the state after applying sync block to be able to apply the next block.
- The node resets heads properly after state sync.
- First, the node computes
-
Block Sync. The node first gets the block with highest height that is on the canonical chain and request from there
MAX_BLOCK_REQUESTS
(which is set to 5) blocks from different peers in a round robin order. The block sync routine runs again if head has changed (progress is made) or if a timeout (which is set to 2s) has happened.
Note: when a block is received and its height is no more than 500 + the nodeâs current head height, then the node would request its previous block automatically. This is called orphan sync and helps to speed up the syncing process. If, on the other hand, the height is more than 500 + the nodeâs current head height, the block is simply dropped.
How ClientActor
works
ClientActor has some periodically running routines that are worth noting:
- Doomslug
timer -
This routine runs every
doosmslug_step_period
(set to 100ms by default) and updates consensus information. If the node is a validator node, it also sends approvals when necessary. - Block
production -
This routine runs every
block_production_tracking_delay
(which is set to 100ms by default) and checks if the node should produce a block. - Log summary - Prints a log line that summarizes block rate, average gas used, the height of the node, etc. every 10 seconds.
- Resend chunk
requests -
This routine runs every
chunk_request_retry_period
(which is set to 400ms). It resends the chunk part requests for those that are not yet responded to. - Sync -
This routine runs every
sync_step_period
(which is set to 10ms by default) and checks whether the node needs to sync from its peers and, if needed, also starts the syncing process. - Catch
up -
This routine runs every
catchup_step_period
(which is set to 100ms by default) and runs the catch up process. This only applies if a node validates shard A in epoch X and is going to validate a different shard B in epoch X+1. In this case, the node would start downloading the state for shard B at the beginning of epoch X. After the state downloading is complete, it would apply all blocks in the current epoch (epoch X) for shard B to ensure that the node has the state needed to validate shard B when epoch X+1 starts.
How Sync Works
Basics
While Sync and Catchup sounds similar - they are actually describing two completely different things.
Sync - is used when your node falls âbehindâ other nodes in the network (for example because it was down for some time or it took longer to process some blocks etc).
Catchup - is used when you want (or have to) start caring about (a.k.a. tracking) additional shards in the future epochs. Currently it should be a no-op for 99% of nodes (see below).
Tracking shards: as you know our system has multiple shards (currently 4). Currently 99% of nodes are tracking all the shards: validators have to - as they have to validate the chunks from all the shards, and normal nodes mostly also track all the shards as this is default.
But in the future - we will have more and more people tracking only a subset of the shards, so the catchup will be increasingly important.
Sync
If your node is behind the head - it will start the sync process (this code is
running periodically in the client_actor and if youâre behind for more than
sync_height_threshold
(currently 50) blocks - it will enable the sync.
The Sync behavior differs depending on whether youâre an archival node (which means you care about the state of each block) or ânormalâ node - where you care mostly about the Tip of the network.
Step 1: Header Sync [archival node & normal node*] (âdownloading headersâ)
The goal of the header sync is to get all the block headers from your current HEAD all the way to the top of the chain.
As headers are quite small, we try to request multiple of them in a single call (currently we ask for 512 headers at once).
Step 1a: Epoch Sync [normal node*] // not implemented yet
While currently normal nodes are using Header sync, we could actually allow them to do something faster - âlight client syncâ a.k.a âepoch syncâ.
The idea of the epoch sync, is to read âjustâ a single block header from each epoch - that has to contain additional information about validators.
This way it would drastically reduce both the time needed for the sync and the db resources.
Implementation target date is TBD.
Notice that in the image above - it is enough to only get the âlastâ header from each epoch. For the âcurrentâ epoch, we still need to get all the headers.
Step 2: State sync [normal node]
After header sync - if you notice that youâre too far behind, i.e. the chain head is at least two epochs ahead of your local head - the node will try to do the âstate syncâ.
The idea of the state sync is - rather than trying to process all the blocks - try to âjumpâ ahead by downloading the freshest state instead - and continue processing blocks from that place in the chain. As a side effect, it is going to create a âgapâ in the chunks/state on this node (which is fine - as the data will be garbage collected after 5 epochs anyway). State sync will ONLY sync to the beginning of the epoch - it cannot sync to any random block.
This step is never run on the archival nodes - as these nodes want to have whole history and cannot have any gaps.
In this case, we can skip processing transactions that are in the blocks 124 - 128, and start from 129 (after sync state finishes)
See how-to to learn how to configure your node to state sync.
Step 3: Block sync [archival node, normal node] (âdownloading blocksâ)
The final step is to start requesting and processing blocks as soon as possible, hoping to catch up with the chain.
Block sync will request up to 5 (MAX_BLOCK_REQUESTS
) blocks at a time - sending
explicit Network BlockRequests for each one.
After the response (Block) is received - the code will execute the âstandardâ path that tries to add this block to the chain (see section below).
In this case, we are processing each transaction for each block - until we catch up with the chain.
Side topic: how blocks are added to the chain?
A node can receive a Block in two ways:
- Either by broadcasting - when a new block is produced, its contents are broadcasted within the network by the nodes
- Or by explicitly sending a BlockRequest to another peer - and getting a Block in return.
(in case of broadcasting, the node will automatically reject any Blocks that are
more than 500 (BLOCK_HORIZON
) blocks away from the current HEAD).
When a given block is received, the node checks if it can be added to the current chain.
If blockâs âparentâ (prev_block
) is not in the chain yet - the block gets added
to the orphan list.
If the parent is already in the chain - we can try to add the block as the head of the chain.
Before adding the block, we want to download the chunks for the shards that we
are tracking - so in many cases, weâll call missing_chunks
functions that will
try to go ahead and request those chunks.
Note: as an optimization, weâre also sometimes trying to fetch chunks for
the blocks that are in the orphan pool â but only if they are not more than 3
(NUM_ORPHAN_ANCESTORS_CHECK
) blocks away from our head.
We also keep a separate job in client_actor that keeps retrying chunk fetching from other nodes if the original request fails.
After all the chunks for a given block are received (we have a separate HashMap that checks how many chunks are missing for each block) - weâre ready to process the block and attach it to the chain.
Afterwards, we look at other entries in the orphan pool to see if any of them are a direct descendant of the block that we just added - and if yes, we repeat the process.
Catchup
The goal of catchup
Catchup is needed when not all nodes in the network track all shards and nodes can change the shard they are tracking during different epochs.
For example, if a node tracks shard 0 at epoch T and tracks shard 1 at epoch T+1, it actually needs to have the state of shard 1 ready before the beginning of epoch T+1. We make sure this happens by making the node start downloading the state for shard 1 at the beginning of epoch T and applying blocks during epoch T to shard 1âs state. Because downloading state can take time, the node may have already processed some blocks (for shard 0 at this epoch), so when the state finishes downloading, the node needs to âcatch upâ processing these blocks for shard 1.
Right now, all nodes do track all shards, so technically we shouldnât need the catchup process, but it is still implemented for the future.
Image below: Example of the node, that tracked only shard 0 in epoch T-1, and will start tracking shard 0 & 1 in epoch T+1.
At the beginning of the epoch T, it will initiate the state download (green) and afterwards will try to âcatchupâ the blocks (orange). After blocks are caught up, it will continue processing as normal.
How catchup interact with normal block processing
The catchup process has two phases: downloading states for shards that we are going to care about in epoch T+1 and catching up blocks that have already been applied.
When epoch T starts, the node will start downloading states of shards that it
will track for epoch T+1, which it doesn't track already. Downloading happens in
a different thread so ClientActor
can still process new blocks. Before the
shard states for epoch T+1 are ready, processing new blocks only applies chunks
for the shards that the node is tracking in epoch T. When the shard states for
epoch T+1 finish downloading, the catchup process needs to reprocess the
blocks that have already been processed in epoch T to apply the chunks for the
shards in epoch T+1. We assume that it will be faster than regular block
processing, because blocks are not full and block production has its own delays,
so catchup can finish within an epoch.
In other words, there are three modes for applying chunks and two code paths,
either through the normal process_block
(blue) or through catchup_blocks
(orange). When process_block
, either that the shard states for the next epoch
are ready, corresponding to IsCaughtUp
and all shards the node is tracking in
this, or will be tracking in the next, epoch will be applied, or when the
states are not ready, corresponding to NotCaughtUp
, then only the shards for
this epoch will be applied. When catchup_blocks
, shards for the next epoch
will be applied.
#![allow(unused)] fn main() { enum ApplyChunksMode { IsCaughtUp, CatchingUp, NotCaughtUp, } }
How catchup works
The catchup process is initiated by process_block
, where we check if the block
is caught up and if we need to download states. The logic works as follows:
- For the first block in an epoch T, we check if the previous block is caught up, which signifies if the state of the new epoch is ready. If the previous block is not caught up, the block will be orphaned and not processed for now because it is not ready to be processed yet. Ideally, this case should never happen, because the node will appear stalled until the blocks in the previous epoch are catching up.
- Otherwise, we start processing blocks for the new epoch T. For the first
block, we always mark it as not caught up and will initiate the process
for downloading states for shards that we are going to care about in epoch
T+1. Info about downloading states is persisted in
DBCol::StateDlInfos
. - For other blocks, we mark them as not caught up if the previous block is not
caught up. This info is persisted in
DBCol::BlocksToCatchup
which stores mapping from previous block to vector of all child blocks to catch up. - Chunks for already tracked shards will be applied during
process_block
, as we said before mentioningApplyChunksMode
. - Once we downloaded state, we start catchup. It will take blocks from
DBCol::BlocksToCatchup
in breadth-first search order and apply chunks for shards which have to be tracked in the next epoch. - When catchup doesn't see any more blocks to process,
DBCol::BlocksToCatchup
is cleared, which means that catchup process is finished.
The catchup process is implemented through the function Client::run_catchup
.
ClientActor
schedules a call to run_catchup
every 100ms. However, the call
can be delayed if ClientActor has a lot of messages in its actix queue.
Every time run_catchup
is called, it checks DBCol::StateDlInfos
to see
if there are any shard states that should be downloaded. If so, it
initiates the syncing process for these shards. After the state is downloaded,
run_catchup
will start to apply blocks that need to be caught up.
One thing to note is that run_catchup
is located at ClientActor
, but
intensive work such as applying state parts and applying blocks is actually
offloaded to SyncJobsActor
in another thread, because we donât want
ClientActor
to be blocked by this. run_catchup
is simply responsible for
scheduling SyncJobsActor
to do the intensive job. Note that SyncJobsActor
is
state-less, it doesnât have write access to the chain. It will return the changes
that need to be made as part of the response to ClientActor
, and ClientActor
is responsible for applying these changes. This is to ensure only one thread
(ClientActor
) has write access to the chain state. However, this also adds a
lot of limits, for example, SyncJobsActor
can only be scheduled to apply one
block at a time. Because run_catchup
is only scheduled to run every 100ms, the
speed of catching up blocks is limited to 100ms per block, even when blocks
applying can be faster. Similar constraints happen to apply state parts.
Improvements
There are three improvements we can make to the current code.
First, currently we always initiate the state downloading process at the first block of an epoch, even when there are no new states to be downloaded for the new epoch. This is unnecessary.
Second, even though run_catchup
is scheduled to run every 100ms, the call can
be delayed if ClientActor has messages in its actix queue. A better way to do
this is to move the scheduling of run_catchup
to check_triggers
.
Third, because of how run_catchup
interacts with SyncJobsActor
, run_catchup
can catch up at most one block every 100 ms. This is because we donât want to
write to ChainStore
in multiple threads. However, the changes that catching up
blocks make do not interfere with regular block processing and they can be
processed at the same time. However, to restructure this, we will need to
re-implement ChainStore
to separate the parts that can be shared among threads
and the part that canât.
Garbage Collection
This document covers the basics of Chain garbage collection.
Currently we run garbage collection only in non-archival nodes, to keep the size of the storage under control. Therefore, we remove blocks, chunks and state that is âoldâ enough - which in current configuration means 5 epochs ago.
We run a single âroundâ of GC after a new block is accepted to the chain - and in order not to delay the chain too much, we make sure that each round removes at most 2 blocks from the chain.
For more details look at function clear_data()
in file chain/chain/src/chain.rs
How it works:
Imagine the following chain (with 2 forks)
In the pictures below, letâs assume that epoch length is 5 and we keep only 3 epochs (rather than 5 that is currently set in production) - otherwise the image becomes too large đ.
If head is in the middle of the epoch, the gc_stop
will be set to the first
block of epoch T-2, and tail
& fork_tail
will be sitting at the last block of
epoch T-3.
(and no GC is happening in this round - as tail is next to gc_stop
).
Next block was accepted on the chain (head jumped ahead), but still no GC happening in this round:
Now interesting things will start happening once head âcrossesâ over to the next epoch.
First, the gc_stop
will jump to the beginning of the next epoch.
Then weâll start the GC of the forks: by first moving the fork_tail
to match
the gc_stop
and going backwards from there.
It will start removing all the blocks that donât have a successor (a.k.a the tip of the fork). And then it will proceed to lower height.
Will keep going until it âhitsâ the tail.
In order not to do too much in one go, weâd only remove up to 2 block in each run (that happens after each head update).
Now, the forks are gone, so we can proceed with GCing of the blocks from the canonical chain:
Same as before, weâd remove up to 2 blocks in each run:
Until we catch up to the gc_stop
.
How Epoch Works
This short document will tell you all you need to know about Epochs in NEAR protocol.
You can also find additional information about epochs in nomicon.
What is an Epoch?
Epoch is a sequence of consecutive blocks. Within one epoch, the set of validators is fixed, and validator rotation happens at epoch boundaries.
Basically almost all the changes that we do are happening at epoch boundaries:
- sharding changes
- protocol version changes
- validator changes
- changing tracking shards
- state sync
Where does the Epoch Id come from?
EpochId
for epoch T+2 is the last hash of the block of epoch T.
Situation at genesis is interesting. We have three blocks:
dummy â genesis â first-block
Where do we set the epoch length?
Epoch length is set in the genesis config. Currently in mainnet it is set to 43200 blocks:
"epoch_length": 43200
See the mainnet genesis for more details.
This means that each epoch lasts around 15 hours.
Important: sometimes there might be âtroublesâ on the network, that might result in epoch lasting a little bit longer (if we cannot get enough signatures on the last blocks of the previous epoch).
You can read specific details on our nomicon page.
How do we pick the next validators?
TL;DR: in the last block of the epoch T, we look at the accounts that have highest stake and we pick them to become validators in T+2.
We are deciding on validators for T+2 (and not T+1) as we want to make sure that validators have enough time to prepare for block production and validation (they have to download the state of shards etc).
For more info on how we pick validators please look at nomicon.
Epoch and Sharding
Sharding changes happen only on epoch boundary - thatâs why many of the requests
(like which shard does my account belong to), require also an epoch_id
as a
parameter.
As of April 2022 we donât have dynamic sharding yet, so the whole chain is simply using 4 shards.
How can I get more information about current/previous epochs?
We donât show much information about Epochs in Explorer. Today, you can use
state_viewer
(if you have access to the network database).
At the same time, weâre working on a small debug dashboard, to show you the basic information about past epochs - stay tuned.
Technical details
Where do we store epoch info?
We use a couple of columns in the database to store epoch information:
- ColEpochInfo = 11 - is storing the mapping from EpochId to EpochInfo structure that contains all the details.
- ColEpochStart = 23 - has a mapping from EpochId to the first block height of that epoch.
- ColEpochValidatorInfo = 47 - contains validator statistics (blocks produced etc.) for each epoch.
How does epoch info look like?
Hereâs the example epoch info from a localnet node. As you can see below, EpochInfo mostly contains information about who is the validator and in which order should they produce the blocks.
EpochInfo.V3(
epoch_height=7,
validators=ListContainer([
validator_stake.V1(account_id='node0', public_key=public_key.ED25519(tuple_data=ListContainer([b'7PGseFbWxvYVgZ89K1uTJKYoKetWs7BJtbyXDzfbAcqX'])), stake=51084320187874404740382878961615),
validator_stake.V1(account_id='node2', public_key=public_key.ED25519(tuple_data=ListContainer([b'GkDv7nSMS3xcqA45cpMvFmfV1o4fRF6zYo1JRR6mNqg5'])), stake=51084320187874404740382878961615),
validator_stake.V1(account_id='node1', public_key=public_key.ED25519(tuple_data=ListContainer([b'6DSjZ8mvsRZDvFqFxo8tCKePG96omXW7eVYVSySmDk8e'])), stake=50569171534262067815663761517574)]),
validator_to_index={'node0': 0, 'node1': 2, 'node2': 1},
block_producers_settlement=ListContainer([0, 1, 2]),
chunk_producers_settlement=ListContainer([ListContainer([0, 1, 2]), ListContainer([0, 1, 2]), ListContainer([0, 1, 2]), ListContainer([0, 1, 2]), ListContainer([0, 1, 2])]),
hidden_validators_settlement=ListContainer([]),
fishermen=ListContainer([]),
fishermen_to_index={},
stake_change={'node0': 51084320187874404740382878961615, 'node1': 50569171534262067815663761517574, 'node2': 51084320187874404740382878961615},
validator_reward={'near': 37059603312899067633082436, 'node0': 111553789870214657675206177, 'node1': 110428850075662293347329569, 'node2': 111553789870214657675206177},
validator_kickout={},
minted_amount=370596033128990676330824359,
seat_price=24438049905601740367428723111,
protocol_version=52
)
Transaction Routing
We all know that transactions are âaddedâ to the chain - but how do they get there?
Hopefully by the end of this article, the image below should make total sense.
Step 1: Transaction creator/author
The journey starts with the author of the transaction - who creates the transaction object (basically list of commands) - and signs them with their private key.
Basically, they prepare the payload that looks like this:
#![allow(unused)] fn main() { pub struct SignedTransaction { pub transaction: Transaction, pub signature: Signature, } }
With such a payload, they can go ahead and send it as a JSON-RPC request to ANY node in the system (they can choose between using âsyncâ or âasyncâ options).
From now on, theyâll also be able to query the status of the transaction - by using the hash of this object.
Fun fact: the Transaction
object also contains some fields to prevent
attacks: like nonce
to prevent replay attack, and block_hash
to limit the
validity of the transaction (it must be added within
transaction_validity_period
(defined in genesis) blocks of block_hash
).
Step 2: Inside the node
Our transaction has made it to a node in the system - but most of the nodes are not validators - which means that they cannot mutate the chain.
Thatâs why the node has to forward it to someone who can - the upcoming validator.
The node, roughly, does the following steps:
- verify transactionâs metadata - check signatures etc. (we want to make sure that we donât forward bogus data)
- forward it to the âupcomingâ validator - currently we pick the validators that
would be a chunk creator in +2, +3, +4 and +8 blocks (this is controlled by
TX_ROUTING_HEIGHT_HORIZON
) - and send the transaction to all of them.
Step 3: En-route to validator/producer
Great, the node knows to send (forward) the transaction to the validator, but how does the routing work? How do we know which peer is hosting a validator?
Each validator is regularly (every config.ttl_account_id_router
/2 seconds == 30
minutes in production) broadcasting so called AnnounceAccount
, which is
basically a pair of (account_id, peer_id)
, to the whole network. This way each
node knows which peer_id
to send the message to.
Then it asks the routing table about the shortest path to the peer, and sends
the ForwardTx
message to the peer.
Step 4: Chunk producer
When a validator receives such a forwarded transaction, it double-checks that it is
about to produce the block, and if so, it adds the transaction to the mempool
(TransactionPool
) for this shard, where it waits to be picked up when the chunk
is produced.
What happens afterwards will be covered in future episodes/articles.
Additional notes:
Transaction being added multiple times
But such an approach means, that weâre forwarding the same transaction to multiple validators (currently 4) - so can it be added multiple times?
No. Remember that a transaction has a concrete hash which is used as a global identifier. If the validator sees that the transaction is present in the chain, it removes it from its local mempool.
Can transaction get lost?
Yes - they can and they do. Sometimes a node doesnât have a path to a given
validator or it didnât receive an AnnounceAccount
for it, so it doesnât know
where to forward the message. And if this happens to all 4 validators that we
try to send to, then the message can be silently dropped.
Weâre working on adding some monitoring to see how often this happens.
Transactions, Receipts and Chunk Surprises
We finished the previous article (Transaction routing) where a transaction was successfully added to the soon-to-be block producerâs mempool.
In this article, weâll cover what happens next: How it is changed into a receipt and executed, potentially creating even more receipts in the process.
First, letâs look at the âhigh-level viewâ:
Transaction vs Receipt
As you can see from the image above:
Transactions are âexternalâ communication - they are coming from the outside.
Receipts are used for âinternalâ communication (cross shard, cross contract) - they are created by the block/chunk producers.
Life of a Transaction
If we âzoom-inâ, the chunk producer's work looks like this:
Step 1: Process Transaction into receipt
Once a chunk producer is ready to produce a chunk, it will fetch the transactions from its mempool, check that they are valid, and if so, prepare to process them into receipts.
Note: There are additional restrictions (e.g. making sure that we take them in the right order, that we donât take too many, etc.) - that you can see in nomiconâs transaction page.
You can see this part in explorer:
Step 2: Sending receipt to the proper destination
Once we have a receipt, we have to send it to the proper destination - by adding
it to the outgoing_receipt
list, which will be forwarded to the chunk
producers from the next block.
Note: There is a special case here - if the sender of the receipt is the
same as the receiver, then the receipt will be added to the local_receipts
queue and executed in the same block.
Step 3: When an incoming receipt arrives
(Note: this happens in the ânextâ block)
When a chunk producer receives an incoming receipt, it will try to execute its actions (creating accounts, executing function calls etc).
Such actions might generate additional receipts (for example a contract might want to call other contracts). All these outputs are added to the outgoing receipt queue to be executed in the next block.
If the incoming receipt queue is too large to execute in the current chunk, the producer will put the remaining receipts onto the âdelayedâ queue.
Step 4: Profit
When all the âdependantâ receipts are executed for a given transaction, we can consider the transaction to be successful.
[Advanced] But reality is more complex
Caution: In the section below, some things are simplified and do not match exactly how the current code works.
Letâs quickly also check whatâs inside a Chunk:
#![allow(unused)] fn main() { pub struct ShardChunkV2 { pub chunk_hash: ChunkHash, pub header: ShardChunkHeader, pub transactions: Vec<SignedTransaction>, pub receipts: Vec<Receipt>, // outgoing receipts from 'previous' block } }
Yes, it is a little bit confusing, that receipts here are NOT the âincomingâ ones for this chunk, but instead the âoutgoingâ ones from the previous block, i.e. all receipts from shard 0, block B are actually found in shard 0, block B+1. Why?!?!
This has to do with performance.
The steps usually followed for producing a block are as follows
- Chunk producer executes the receipts and creates a chunk. It sends the chunk to other validators. Note that it's the execution/processing of the receipts that usually takes the most time.
- Validators receive the chunk and validate it before signing the chunk. Validation involves executing/processing of the receipts in the chunk.
- Once the next block chunk producer receives the validation (signature), only then can it start producing the next chunk.
Simple approach
First, letâs imagine how the system would look like, if chunks contained things that weâd expect:
- list of transactions
- list of incoming receipts
- list of outgoing receipts
- hash of the final state
This means, that the chunk producer has to compute all this information first, before sending the chunk to other validators.
Once the other validators receive the chunk, they can start their own processing to verify those outgoing receipts/final state - and then do the signing. Only then, can the next chunk producer start creating the next chunk.
While this approach does work, we can do it faster.
Faster approach
What if the chunk didnât contain the âoutputâ state? This changes our âmentalâ model a little bit, as now when weâre singing the chunk, weâd actually be verifying the previous chunk - but thatâs the topic for the next article (to be added).
For now, imagine if the chunk only had:
- a list of transactions
- a list of incoming receipts
In this case, the chunk producer could send the chunk a lot earlier, and validators (and chunk producer) could do their processing at the same time:
Now the last mystery: Why do we have âoutgoingâ receipts from previous chunks rather than incoming to the current one?
This is yet another optimization. This way the chunk producer can send out the chunk a little bit earlier - without having to wait for all the other shards.
But thatâs a topic for another article (to be added).
Cross shard transactions - deep dive
In this article, we'll look deeper into how cross-shard transactions are working
on the simple example of user shard0
transferring money to user shard1
.
These users are on separate shards (shard0
is on shard 0 and shard1
is on
shard 1).
Imagine, we run the following command in the command line:
$ NEAR_ENV=local near send shard0 shard1 500
What happens under the hood? How is this transaction changed into receipts and processed by near?
From Explorer perspective
If you look at a simple token transfer in explorer (example), you can see that it is broken into three separate sections:
- convert transaction into receipt ( executed in block B )
- receipt that transfers tokens ( executed in block B+1 )
- receipt that refunds gas ( executed in block B+2 )
But under the hood, the situation is a little bit more complex, as there is actually one more receipt (that is created after converting the transaction). Let's take a deeper look.
Internal perspective (Transactions & Receipts)
One important thing to remember is that NEAR is sharded - so in all our designs, we have to assume that each account is on a separate shard. So that the fact that some of them are colocated doesn't give any advantage.
Step 1 - Transaction
This is the part which we receive from the user (SignedTransaction
) - it has 3
parts:
- signer (account + key) who signed the transaction
- receiver (in which account context should we execute this)
- payload - a.k.a Actions to execute.
As the first step, we want to change this transaction into a Receipt (a.k.a 'internal' message) - but before doing that, we must verify that:
- the message signature matches (that is - that this message was actually signed by this key)
- that this key is authorized to act on behalf of that account (so it is a full access key to this account - or a valid function key).
The last point above means, that we MUST execute this (Transaction to Receipt)
transition within the shard that the signer
belongs to (as other shards don't
know the state that belongs to signer - so they don't know which keys it has).
So actually if we look inside the chunk 0 (where shard0
belongs) at block
B, we'll see the transaction:
Chunk: Ok(
V2(
ShardChunkV2 {
chunk_hash: ChunkHash(
8mgtzxNxPeEKfvDcNdFisVq8TdeqpCcwfPMVk219zRfV,
),
header: V3(
ShardChunkHeaderV3 {
inner: V2(
ShardChunkHeaderInnerV2 {
prev_block_hash: CgTJ7FFwmawjffrMNsJ5XhvoxRtQPXdrtAjrQjG91gkQ,
prev_state_root: 99pXnYjQbKE7bEf277urcxzG3TaN79t2NgFJXU5NQVHv,
outcome_root: 11111111111111111111111111111111,
encoded_merkle_root: 67zdyWTvN7kB61EgTqecaNgU5MzJaCiRnstynerRbmct,
encoded_length: 187,
height_created: 1676,
shard_id: 0,
gas_used: 0,
gas_limit: 1000000000000000,
balance_burnt: 0,
outgoing_receipts_root: 8s41rye686T2ronWmFE38ji19vgeb6uPxjYMPt8y8pSV,
tx_root: HyS6YfQbfBRniVSbWRnxsxEZi9FtLqHwyzNivrF6aNAM,
validator_proposals: [],
},
),
height_included: 0,
signature: ed25519:uUvmvDV2cRVf1XW93wxDU8zkYqeKRmjpat4UUrHesJ81mmr27X43gFvFuoiJHWXz47czgX68eyBN38ejwL1qQTD,
hash: ChunkHash(
8mgtzxNxPeEKfvDcNdFisVq8TdeqpCcwfPMVk219zRfV,
),
},
),
transactions: [
SignedTransaction {
transaction: Transaction {
signer_id: AccountId(
"shard0",
),
public_key: ed25519:Ht8EqXGUnY8B8x7YvARE1LRMEpragRinqA6wy5xSyfj5,
nonce: 11,
receiver_id: AccountId(
"shard1",
),
block_hash: 6d5L1Vru2c4Cwzmbskm23WoUP4PKFxBHSP9AKNHbfwps,
actions: [
Transfer(
TransferAction {
deposit: 500000000000000000000000000,
},
),
],
},
signature: ed25519:63ssFeMyS2N1khzNFyDqiwSELFaUqMFtAkRwwwUgrPbd1DU5tYKxz9YL2sg1NiSjaA71aG8xSB7aLy5VdwgpvfjR,
hash: 6NSJFsTTEQB4EKNKoCmvB1nLuQy4wgSKD51rfXhmgjLm,
size: 114,
},
],
receipts: [],
},
),
)
Side note: When we're converting the transaction into a receipt, we also use this moment to deduct prepaid gas fees and transferred tokens from the 'signer' account. The details on how much gas is charged can be found at https://nomicon.io/RuntimeSpec/Fees/.
Step 2 - cross shard receipt
After transaction was changed into a receipt, this receipt must now be sent to
the shard where the receiver
is (in our example shard1
is on shard 1).
We can actually see this in the chunk of the next block:
Chunk: Ok(
V2(
ShardChunkV2 {
chunk_hash: ChunkHash(
DoF7yoCzyBSNzB8R7anWwx6vrimYqz9ZbEmok4eqHZ3m,
),
header: V3(
ShardChunkHeaderV3 {
inner: V2(
ShardChunkHeaderInnerV2 {
prev_block_hash: 82dKeRnE262qeVf31DXaxHvbYEugPUDvjGGiPkjm9Rbp,
prev_state_root: DpsigPFeVJDenQWVueGKyTLVYkQuQjeQ6e7bzNSC7JVN,
outcome_root: H34BZknAfWrPCcppcHSqbXwFvAiD9gknG8Vnrzhcc4w,
encoded_merkle_root: 3NDvQBrcRSAsWVPWkUTTrBomwdwEpHhJ9ofEGGaWsBv9,
encoded_length: 149,
height_created: 1677,
shard_id: 0,
gas_used: 223182562500,
gas_limit: 1000000000000000,
balance_burnt: 22318256250000000000,
outgoing_receipts_root: Co1UNMcKnuhXaHZz8ozMnSfgBKPqyTKLoC2oBtoSeKAy,
tx_root: 11111111111111111111111111111111,
validator_proposals: [],
},
),
height_included: 0,
signature: ed25519:32hozA7GMqNqJzscEWzYBXsTrJ9RDhW5Ly4sp7FXP1bmxoCsma8Usxry3cjvSuywzMYSD8HvGntVtJh34G2dKJpE,
hash: ChunkHash(
DoF7yoCzyBSNzB8R7anWwx6vrimYqz9ZbEmok4eqHZ3m,
),
},
),
transactions: [],
receipts: [
Receipt {
predecessor_id: AccountId(
"shard0",
),
receiver_id: AccountId(
"shard1",
),
receipt_id: 3EtEcg7QSc2CYzuv67i9xyZTyxBD3Dvx6X5yf2QgH83g,
receipt: Action(
ActionReceipt {
signer_id: AccountId(
"shard0",
),
signer_public_key: ed25519:Ht8EqXGUnY8B8x7YvARE1LRMEpragRinqA6wy5xSyfj5,
gas_price: 103000000,
output_data_receivers: [],
input_data_ids: [],
actions: [
Transfer(
TransferAction {
deposit: 500000000000000000000000000,
},
),
],
},
),
},
],
},
),
)
Side comment: notice that the receipt itself no longer has a signer
field, but
a predecessor_id
one.
Such a receipt is sent to the destination shard (we'll explain this process in a separate article) where it can be executed.
3. Gas refund.
When shard 1 processes the receipt above, it is then ready to refund the unused
gas to the original account (shard0
). So it also creates the receipt, and puts
it inside the chunk. This time it is in shard 1 (as that's where it was
executed).
Chunk: Ok(
V2(
ShardChunkV2 {
chunk_hash: ChunkHash(
8sPHYmBFp7cfnXDAKdcATFYfh9UqjpAyqJSBKAngQQxL,
),
header: V3(
ShardChunkHeaderV3 {
inner: V2(
ShardChunkHeaderInnerV2 {
prev_block_hash: Fj7iu26Yy9t5e9k9n1fSSjh6ZoTafWyxcL2TgHHHskjd,
prev_state_root: 4y6VL9BoMJg92Z9a83iqKSfVUDGyaMaVU1RNvcBmvs8V,
outcome_root: 7V3xRUeWgQa7D9c8s5jTq4dwdRcyTuY4BENRmbWaHiS5,
encoded_merkle_root: BnCE9LZgnFEjhQv1fSYpxPNw56vpcLQW8zxNmoMS8H4u,
encoded_length: 149,
height_created: 1678,
shard_id: 1,
gas_used: 223182562500,
gas_limit: 1000000000000000,
balance_burnt: 22318256250000000000,
outgoing_receipts_root: HYjZzyTL5JBfe1Ar4C4qPKc5E6Vbo9xnLHBKLVAqsqG2,
tx_root: 11111111111111111111111111111111,
validator_proposals: [],
},
),
height_included: 0,
signature: ed25519:4FzcDw2ay2gAGosNpFdTyEwABJhhCwsi9g47uffi77N21EqEaamCg9p2tALbDt5fNeCXXoKxjWbHsZ1YezT2cL94,
hash: ChunkHash(
8sPHYmBFp7cfnXDAKdcATFYfh9UqjpAyqJSBKAngQQxL,
),
},
),
transactions: [],
receipts: [
Receipt {
predecessor_id: AccountId(
"system",
),
receiver_id: AccountId(
"shard0",
),
receipt_id: 6eei79WLYHGfv5RTaee4kCmzFx79fKsX71vzeMjCe6rL,
receipt: Action(
ActionReceipt {
signer_id: AccountId(
"shard0",
),
signer_public_key: ed25519:Ht8EqXGUnY8B8x7YvARE1LRMEpragRinqA6wy5xSyfj5,
gas_price: 0,
output_data_receivers: [],
input_data_ids: [],
actions: [
Transfer(
TransferAction {
deposit: 669547687500000000,
},
),
],
},
),
},
],
},
),
)
Such gas refund receipts are a little bit special - as we'll set the
predecessor_id
to be system
- but the receiver is what we expect (shard0
account).
Note: system
is a special account that doesn't really belong to any shard.
As you can see in this example, the receipt was created within shard 1.
So putting it all together would look like this:
But wait - NEAR was saying that transfers are happening with 2 blocks - but here I see that it took 3 blocks. What's wrong?
The image above is a simplification, and reality is a little bit trickier - especially as receipts in a given chunks are actually receipts received as a result from running a PREVIOUS chunk from this shard.
We'll explain it more in the next section.
Advanced: What's actually going on?
As you could have read in Transactions And Receipts - the 'receipts' field in the chunk is actually representing 'outgoing' receipts from the previous block.
So our image should look more like this:
In this example, the black boxes are representing the 'processing' of the chunk, and red arrows are cross-shard communication.
So when we process Shard 0 from block 1676, we read the transaction, and output the receipt - which later becomes the input for shard 1 in block 1677.
But you might still be wondering - so why didn't we add the Receipt (transfer) to the list of receipts of shard0 1676?
That's because the shards & blocks are set BEFORE we do any computation. So the more correct image would look like this:
Here you can clearly see that chunk processing (black box), is happening AFTER the chunk is set.
In this example, the blue arrows are showing the part where we persist the result (receipt) into next block's chunk.
In a future article, we'll discuss how the actual cross-shard communication works (red arrows) in the picture, and how we could guarantee that a given shard really gets all the red arrows, before it starts processing.
Gas
This page describes the technical details around gas during the lifecycle of a transaction(*) while giving an intuition for why things are the way they are from a technical perspective. For a more practical, user-oriented angle, please refer to the gas section in the official protocol documentation.
(*) For this page, a transaction shall refer to the set of all recursively
generated receipts by a SignedTransaction
. When referring to only the original
transaction object, we write SignedTransaction
.
The topic is split into several sections.
- Gas Flow
- Buying Gas: How are NEAR tokens converted to gas?
- Burning Gas: Who receives burnt tokens?
- Gas in Contract Calls: How is gas attached to calls?
- Contract Reward: How smart contract earn a reward.
- Gas Price:
- Block-Level Gas Price: How the block-level gas price is determined.
- Pessimistic Gas Price: How worst-case gas pricing is estimated.
- Effective Gas Purchase Cost: The cost paid for a receipt.
- Tracking Gas: How the system keeps track of purchased gas during the transaction execution.
Gas Flow
On the highest level, gas is bought by the signer, burnt during execution, and contracts receive a part of the burnt gas as a reward. We will discuss each step in more details.
Buying Gas for a Transaction
A signer pays all the gas required for a transaction upfront. However, there is
no explicit act of buying gas. Instead, the fee is subtracted directly in NEAR
tokens from the balance of the signer's account. If we ignore all the details
explained further down, the fee is calculated as gas amount
* gas price
.
The gas amount
is not a field of SignedTransaction
, nor is it something the
signer can choose. It is only a virtual field that is computed on-chain following
the protocol's rules.
The gas price
is a variable that may change during the execution of the
transaction. The way it is implemented today, a single transaction can be
charged a different gas price for different receipts.
Already we can see a fundamental problem: Gas is bought once at the beginning but the gas price may change during execution. To solve this incompatibility, the protocol calculates a pessimistic gas price for the initial purchase. Later on, the delta between real and pessimistic gas prices is refunded at the end of every receipt execution.
An alternative implementation would instead charge the gas at every receipt, instead of once at the start. However, remember that the execution may happen on a different shard than the signer account. Therefore we cannot access the signer's balance while executing.
Burning Gas
Buying gas immediately removes a part of the signer's tokens from the total supply. However, the equivalent value in gas still exists in the form of the receipt and the unused gas will be converted back to tokens as a refund.
The gas spent on execution on the other hand is burnt and removed from total supply forever. Unlike gas in other chains, none of it goes to validators. This is roughly equivalent to the base fee burning mechanism which Ethereum added in EIP-1559. But in Near Protocol, the entire fee is burnt because there is no priority fee that Ethereum pays out to validators.
The following diagram shows how gas flows through the execution of a transaction. The transaction consists of a function call performing a cross contract call, hence two function calls in sequence. (Note: This diagram is heavily simplified, more accurate diagrams are further down.)
Gas in Contract Calls
A function call has a fixed gas cost to be initiated. Then the execution itself
draws gas from the attached_gas
, sometimes also called prepaid_gas
, until it
reaches zero, at which point the function call aborts with a GasExceeded
error. No changes are persisted on chain.
(Note on naming: If you see prepaid_fee: Balance
in the nearcore code base,
this is NOT only the fee for prepaid_gas
. It also includes prepaid fees for
other gas costs. However, prepaid_gas: Gas
is used the same in the code base
as described in this document.)
Attaching gas to function calls is the primary way for end-users and contract developers to interact with gas. All other gas fees are implicitly computed and are hidden from the users except for the fact that the equivalent in tokens is removed from their account balance.
To attach gas, the signer sets the gas field of the function call action.
Wallets and CLI tools expose this to the users in different ways. Usually just
as a gas
field, which makes users believe this is the maximum gas the
transaction will consume. Which is not true, the maximum is the specified number
plus the fixed base cost.
Contract developers also have to pick the attached gas values when their
contract calls another contract. They cannot buy additional gas, they have to
work with the unspent gas attached to the current call. They can check how much
gas is left by subtracting the used_gas()
from the prepaid_gas()
host
function results. But they cannot use all the available gas, since that would
prevent the current function call from executing to the end.
The gas attached to a function can be at most max_total_prepaid_gas
, which is
300 Tgas since the mainnet launch. Note that this limit is per
SignedTransaction
, not per function call. In other words, batched function
calls share this limit.
There is also a limit to how much single call can burn, max_gas_burnt
, which
used to be 200 Tgas but has been increased to 300 Tgas in protocol version 52.
(Note: When attaching gas to an outgoing function call, this is not counted as
gas burnt.) However, given a call can never burn more than was attached anyway,
this second limit is obsolete with the current configuration where the two limits
are equal.
Since protocol version 53, with the stabilization of
NEP-264, contract
developers do not have to specify the absolute amount of gas to attach to calls.
promise_batch_action_function_call_weight
allows to specify a ratio of unspent
gas that is computed after the current call has finished. This allows attaching
100% of unspent gas to a call. If there are multiple calls, this allows
attaching an equal fraction to each, or any other split as defined by the weight
per call.
Contract Reward
A rather unique property of Near Protocol is that a part of the gas fee goes to the contract owner. This "smart contract gets paid" model is pretty much the opposite design choice from the "smart contract pays" model that for example Cycles in the Internet Computer implement.
The idea is that it gives contract developers a source of income and hence an incentive to create useful contracts that are commonly used. But there are also downsides, such as when implementing a free meta-transaction relayer one has to be careful not to be susceptible to faucet-draining attacks where an attacker extracts funds from the relayer by making calls to a contract they own.
How much contracts receive from execution depends on two things.
- How much gas is burnt on the function call execution itself. That is, only
the gas taken from the
attached_gas
of a function call is considered for contract rewards. The base fees paid for creating the receipt, including theaction_function_call
fee, are burnt 100%. - The remainder of the burnt gas is multiplied by the runtime configuration
parameter
burnt_gas_reward
which currently is at 30%.
During receipt execution, nearcore code tracks the gas_burnt_for_function_call
separately from other gas burning to enable this contract reward calculations.
In the (still simplified) flow diagram, the contract reward looks like this.
For brevity, gas_burnt_for_function_call
in the diagram is denoted as wasm fee
.
Gas Price
Gas pricing is a surprisingly deep and complicated topic. Usually, we only think
about the value of the gas_price
field in the block header. However, to
understand the internals, this is not enough.
Block-Level Gas Price
gas_price
is a field in the block header. It determines how much it costs to
burn gas at the given block height. Confusingly, this is not the same price at
which gas is purchased.
(See Effective Gas Purchase Price.)
The price is measured in NEAR tokens per unit of gas. It dynamically changes in the range between 0.1 NEAR per Pgas and 2 NEAR per Pgas, based on demand. (1 Pgas = 1000 Tgas corresponds to a full chunk.)
The block producer has to set this field following the exact formula as defined by the protocol. Otherwise, the produced block is invalid.
Intuitively, the formula checks how much gas was used compared to the total capacity. If it exceeds 50%, the gas price increases exponentially within the limits. When the demand is below 50%, it decreases exponentially. In practice, it stays at the bottom most of the time.
Note that all shards share the same gas price. Hence, if one out of four shards is at 100% capacity, this will not cause the price to increase. The 50% capacity is calculated as an average across all shards.
Going slightly off-topic, it should also be mentioned that chunk capacity is not
constant. Chunk producers can change it by 0.1% per chunk. The nearcore client
does not currently make use of this option, so it really is a nitpick only
relevant in theory. However, any client implementation such as nearcore must
compute the total capacity as the sum of gas limits stored in the chunk headers
to be compliant. Using a hard-coded 1000 Tgas * num_shards
would lead to
incorrect block header validation.
Pessimistic Gas Price
The pessimistic gas price calculation uses the fact that any transaction can
only have a limited depth in the generated receipt DAG. For most actions, the
depth is a constant 1 or 2. For function call actions, it is limited to a
hand-wavy attached_gas
/ min gas per function call
. (Note: attached_gas
is
a property of a single action and is only a part of the total gas costs of a
receipt.)
Once the maximum depth is known, the protocol assumes that the gas price will not change more than 3% per receipt. This is not a guarantee since receipts can be delayed for virtually unlimited blocks.
The final formula for the pessimistic gas price is the following.
pessimistic(current_gas_price, max_depth) = current_gas_price Ă 1.03^max_depth
This still is not the price at which gas is purchased. But we are very close.
Effective Gas Purchase Cost
When a transaction is converted to its root action receipt, the gas costs are calculated in two parts.
Part one contains all the gas which is burnt immediately. Namely, the send
costs for a receipt and all the actions it includes. This is charged at the
current block-level gas price.
Part two is everything else, from execution costs of actions that are statically
known such as CreateAccount
all the way to attached_gas
for function calls.
All of this is purchased at the same pessimistic gas price, even if some actions
inside might have a lower maximum call depth than others.
The deducted tokens are the sum of these two parts. If the account has
insufficient balance to pay for this pessimistic pricing, it will fail with a
NotEnoughBalance
error, with the required balance included in the error
message.
Inserting the pessimistic gas pricing into the flow diagram, we finally have a complete picture. Note how an additional refund receipt is required. Also, check out the updated formula for the effective purchase price at the top left and the resulting higher number.
Tracking Gas in Receipts
The previous section explained how gas is bought and what determines its price. This section details the tracking that enables correct refunds.
First, when a SignedTransaction
is converted to a receipt, the pessimistic gas
price is written to the receipt's gas_price
field.
Later on, when the receipt has been executed, a gas refund is created at the
value of receipt.gas_burnt
* (block_header.gas_price
- receipt.gas_price
).
Some gas goes attaches to outgoing receipts. We commonly refer to this as used gas that was not burnt, yet. The refund excludes this gas. But it includes the receipt send cost.
Finally, unspent gas is refunded at the full receipt.gas_price
. This refund is
merged with the refund for burnt gas of the same receipt outcome to reduce the
number of spawned system receipts. But it makes it a bit harder to interpret
refunds when backtracking for how much gas a specific refund receipt covers.
Receipt Congestion
Near Protocol executes transactions in multiple steps, or receipts. Once a transaction is accepted, the system has committed to finish all those receipts even if it does not know ahead of time how many receipts there will be or on which shards they will execute.
This naturally leads to the problem that if shards just keep accepting more transactions, we might accept workload at a higher rate than we can execute.
Cross-shard congestion as flow problem
For a quick formalized discussion on congestion, let us model the Near Protocol transaction execution as a flow network.
Each shard has a source that accepts new transactions and a sink for burning receipts. The flow is measured in gas. Edges to sinks have a capacity of 1000 Tgas. (Technically, it should be 1300 but let's keep it simple for this discussion.)
The edges between shards are not limited in this model. In reality, we are eventually limited by the receipt sizes and what we can send within a block time through the network links. But if we only look at that limit, we can send very many receipts with a lot of gas attached to them. Thus, the model considers it unlimited.
Okay, we have the capacities of the network modeled. Now let's look at how a receipt execution maps onto it.
Let's say a receipt starts at shard 1 with 300 Tgas. While executing, it burns 100 Tgas and creates an outgoing receipts with 200 Tgas to another shard. We can represent this in the flow network with 100 Tgas to the sink of shard 1 and 200 Tgas to shard 2.
Note: The graph includes the execution of the next block with the 200 Tgas to the sink of shard 2. This should be interpreted as if we continue sending the exact same workload on all shards every block. Then we reach this steady state where we continue to have these gas assignments per edge.
Now we can do some flow analysis. It is immediately obvious that the total outflow per is limited to N * 1000 Tgas but the incoming flow is unlimited.
For a finite amount of time, we can accept more inflow than outflow, we just have to add buffers to store what we cannot execute, yet. But to stay within finite memory requirements, we need to fall back to a flow diagram where outflows are greater or equal to inflows within a finite time frame.
Next, we look at ideas one at a time before combining some of them into the cross-shard congestion design proposed in NEP-539.
Idea 1: Compute the minimum max-flow and stay below that limit
One approach to solve congestion would be to never allow more work into the system than we can execute.
But this is not ideal. Just consider this example where everybody tries to access a contract on the same shard.
In this workload where everyone want to use the capacity of the same shard, the max-flow of the system is essentially the 1000 Tgas that shard 3 can execute. No matter how many additional shards we add, this 1000 Tgas does not increase.
Consequently, if we want to limit inflow to be the same or lower than the
outflow, we cannot accept more than 1000 Tgas / NUM_SHARDS
of new transactions
per chunk.
So, can we just put a constant limit on sources that's 1000 Tgas / NUM_SHARDS
? Not
really, as this limit is hardly practical. It means we limit global throughput
to that of a single shard. Then why would we do sharding in the first place?
The sad thing is, there is no way around it in the most general case. A congestion control strategy that does not apply this limit to this workload will always have infinitely sized queues.
Of course, we won't give up. We are not limited to a constant capacity limit, we can instead adjust it dynamically. We simply have to find a strategy that detects such workload and eventually applies the required limit.
Most of these strategies can be gamed by malicious actors and probably that
means we eventually fall back to the minimum of 1000 Tgas / NUM_SHARDS
. But at
this stage our ambition isn't to have 100% utilization under all malicious
cases. We are instead trying to find a solution that can give 100% utilization
for normal operation and then falls back to 1000 Tgas / NUM_SHARDS
when it has
to, in order to prevent out-of-memory crashes.
Idea 2: Limit transactions when we use too much memory
What if we have no limit at the source until we notice we are above the memory threshold we are comfortable with? Then we can reduce the source capacity in steps, potentially down to 0, until buffers are getting emptier and we use less memory again.
If we do that, we can decide between either applying a global limit on all
sources (allow only 1000 Tgas / NUM_SHARDS
new transactions on all shards like
in idea 1) or applying the limit only to transactions that go to the shard with
the congestion problem.
The first choice is certainly safe. But it means that a single congested shard leads to all shards slowing down, even if they could keep working faster without ever sending receipts to the congested shard. This is a hit to utilization we want to avoid. So let's try the second way.
In that case we filter transactions by receiver and keep accepting transactions that go to non-congested shards. This would work fine, if all transactions would only have depth 1.
But receipts produced by an accepted transaction can produce more receipts to any other shard. Therefore, we might end up accepting more inflow that indirectly requires bandwidth on the congested shard.
Crucially, when accepting a transaction, we don't know ahead of time which shards will be affected by the full directed graph of receipts in a transaction. We only know the first step. For multi-hop transactions, there is no easy way out.
But it is worth mentioning, that in practice the single-hop function call is the most common case. And this case can be handled nicely by rejecting incoming transactions to congested shards.
Idea 3: Apply backpressure to stop all flows to a congested shard
On top of stopping transactions to congested shards, we can also stop receipts if they have a congested shard as the receiver. We simply put them in a buffer of the sending shard and keep them there until the congested shard has space again for the receipts.
The problem with this idea is that it leads to deadlocks where all receipts in the system are waiting in outgoing buffers but cannot make progress because the receiving shard already has too high memory usage.
Idea 4: Keep minimum incoming queue length to avoid deadlocks
This is the final idea we need. To avoid deadlocks, we ensure that we can always send receipts to a shard that does not have enough work in the delayed receipts queue already.
Basically, the backpressure limits from idea 3 are only applied to incoming receipts but not for the total size. This guarantees that in the congested scenario that previously caused a deadlock, we always have something in the incoming queue to work on, otherwise there wouldn't be backpressure at all.
We decided to measure the incoming congestion level using gas rather than bytes, because it is here to maximize utilization, not to minimize memory consumption. And utilization is best measured in gas. If we have a queue of 10_000 Tgas waiting, even if only 10% of that is burnt in this step of the transaction, we still have 1000 Tgas of useful work we can contribute to the total flow. Thus under the assumption that at least 10% of gas is being burnt, we have 100% utilization.
A limit in bytes would be better to argue how much memory we need exactly. But in some sense, the two are equivalent, as producing large receipts should cost a linear amount of gas. What exactly the conversion rate is, is rather complicated and warrants its own investigation with potential protocol changes to lower the ratio in the most extreme cases. And this is important regardless of how congestion is handled, given that network bandwidth is becoming more and more important as we add more shards. Issue #8214 tracks our effort on estimating what that cost should be and #9378 tracks our best progress on calculating what it is today.
Of course, we can increase the queue to have even better utility guarantees. But it comes at the cost of longer delays for every transaction or receipt that goes through a congested shard.
This strategy also preserves the backpressure property in the sense that all shards on a path from sources to sinks that contribute to congestion will eventually end up with full buffers. Combined with idea 2, eventually all transactions to those shards are rejected. All of this without affecting shards that are not on the critical path.
Putting it all together
The proposal in NEP-539 combines all ideas 2, 3, and 4.
We have a limit of how much memory we consider to be normal operations (for example 500 MB). Then we stop new transaction coming in to that shard but still allow more incoming transactions to other shards if those are not congested. That alone already solves all problems with single-hop transactions.
In the congested shard itself, we also keep accepting transactions to other shards. But we heavily reduce the gas allocated for new transactions, in order to have more capacity to work on finishing the waiting receipts. This is technically not necessary for any specific property, but it should make sense intuitively that this helps to reduce congestion quicker and therefore lead to a better user experience. This is why we added this feature. And our simulations also support this intuition.
Then we apply backpressure for multi-hop receipts and avoid deadlocks by only applying the backpressure when we still have enough work queued up that holding it back cannot lead to a slowed down global throughput.
Another design decision was to linearly interpolate the limits, as opposed to binary on and off states. This way, we don't have to be too precise in finding the right parameters, as the system should balance itself around a specific limit that works for each workload.
Meta Transactions
NEP-366 introduced the concept of meta transactions to Near Protocol. This feature allows users to execute transactions on NEAR without owning any gas or tokens. In order to enable this, users construct and sign transactions off-chain. A third party (the relayer) is used to cover the fees of submitting and executing the transaction.
The MVP for meta transactions is currently in the stabilization process. Naturally, the MVP has some limitations, which are discussed in separate sections below. Future iterations have the potential to make meta transactions more flexible.
Overview
Credits for the diagram go to the NEP authors Alexander Fadeev and Egor Uleyskiy.
The graphic shows an example use case for meta transactions. Alice owns an
amount of the fungible token $FT. She wants to transfer some to John. To do
that, she needs to call ft_transfer("john", 10)
on an account named FT
.
In technical terms, ownership of $FT is an entry in the FT
contract's storage
that tracks the balance for her account. Note that this is on the application
layer and thus not a part of Near Protocol itself. But FT
relies on the
protocol to verify that the ft_transfer
call actually comes from Alice. The
contract code checks that predecessor_id
is "Alice"
and if that is the case
then the call is legitimately from Alice, as only she could create such a
receipt according to the Near Protocol specification.
The problem is, Alice has no NEAR tokens. She only has a NEAR account that
someone else funded for her and she owns the private keys. She could create a
signed transaction that would make the ft_transfer("john", 10)
call. But
validator nodes will not accept it, because she does not have the necessary Near
token balance to purchase the gas.
With meta transactions, Alice can create a DelegateAction
, which is very
similar to a transaction. It also contains a list of actions to execute and a
single receiver for those actions. She signs the DelegateAction
and forwards
it (off-chain) to a relayer. The relayer wraps it in a transaction, of which the
relayer is the signer and therefore pays the gas costs. If the inner actions
have an attached token balance, this is also paid for by the relayer.
On chain, the SignedDelegateAction
inside the transaction is converted to an
action receipt with the same SignedDelegateAction
on the relayer's shard. The
receipt is forwarded to the account from Alice
, which will unpacked the
SignedDelegateAction
and verify that it is signed by Alice with a valid Nonce
etc. If all checks are successful, a new action receipt with the inner actions
as body is sent to FT
. There, the ft_transfer
call finally executes.
Relayer
Meta transactions only work with a relayer. This is an application layer
concept, implemented off-chain. Think of it as a server that accepts a
SignedDelegateAction
, does some checks on them and eventually forwards it
inside a transaction to the blockchain network.
A relayer may choose to offer their service for free but that's not going to be financially viable long-term. But they could easily have the user pay using other means, outside of Near blockchain. And with some tricks, it can even be paid using fungible tokens on Near.
In the example visualized above, the payment is done using $FT. Together with
the transfer to John, Alice also adds an action to pay 0.1 $FT to the relayer.
The relayer checks the content of the SignedDelegateAction
and only processes
it if this payment is included as the first action. In this way, the relayer
will be paid in the same transaction as John.
Note that the payment to the relayer is still not guaranteed. It could be that Alice does not have sufficient $FT and the transfer fails. To mitigate, the relayer should check the $FT balance of Alice first.
Unfortunately, this still does not guarantee that the balance will be high enough once the meta transaction executes. The relayer could waste NEAR gas without compensation if Alice somehow reduces her $FT balance in just the right moment. Some level of trust between the relayer and its user is therefore required.
The vision here is that there will be mostly application-specific relayers. A general-purpose relayer is difficult to implement with just the MVP. See limitations below.
Limitation: Single receiver
A meta transaction, like a normal transaction, can only have one receiver. It's possible to chain additional receipts afterwards. But crucially, there is no atomicity guarantee and no roll-back mechanism.
For normal transactions, this has been widely accepted as a fact for how Near Protocol works. For meta transactions, there was a discussion around allowing multiple receivers with separate lists of actions per receiver. While this could be implemented, it would only create a false sense of atomicity. Since each receiver would require a separate action receipt, there is no atomicity, the same as with chains of receipts.
Unfortunately, this means the trick to compensate the relayer in the same meta
transaction as the serviced actions only works if both happen on the same
receiver. In the example, both happen on FT
and this case works well. But it
would not be possible to send $FT1 and pay the relayer in $FT2. Nor could one
deploy a contract code on Alice
and pay in $FT in one meta transaction. It
would require two separate meta transactions to do that. Due to timing problems,
this again requires some level of trust between the relayer and Alice.
A potential solution could involve linear dependencies between the action receipts spawned from a single meta transaction. Only if the first succeeds, will the second start executing, and so on. But this quickly gets too complicated for the MVP and is therefore left open for future improvements.
Constraints on the actions inside a meta transaction
A transaction is only allowed to contain one single delegate action. Nested delegate actions are disallowed and so are delegate actions next to each other in the same receipt.
Nested delegate actions have no known use case and it would be complicated to implement. Consequently, it was omitted.
For delegate actions beside each other, there was a bit of back and forth during the NEP-366 design phase. The potential use case here is essentially the same as having multiple receivers in a delegate action. Naturally, it runs into all the same complications (false sense of atomicity) and ends with the same conclusion: Omitted from the MVP and left open for future improvement.
Limitation: Accounts must be initialized
Any transaction, including meta transactions, must use NONCEs to avoid replay attacks. The NONCE must be chosen by Alice and compared to a NONCE stored on chain. This NONCE is stored on the access key information that gets initialized when creating an account.
Implicit accounts don't need to be initialized in order to receive NEAR tokens, or even $FT. This means users could own $FT but no NONCE is stored on chain for them. This is problematic because we want to enable this exact use case with meta transactions, but we have no NONCE to create a meta transaction.
For the MVP, the proposed solution, or work-around, is that the relayer will
have to initialize the account of Alice once if it does not exist. Note that
this cannot be done as part of the meta transaction. Instead, it will be a
separate transaction that executes first. Only then can Alice even create a
SignedDelegateAction
with a valid NONCE.
Once again, some trust is required. If Alice wanted to abuse the relayer's helpful service, she could ask the relayer to initialize her account. Afterwards, she does not sign a meta transaction, instead she deletes her account and cashes in the small token balance reserved for storage. If this attack is repeated, a significant amount of tokens could be stolen from the relayer.
One partial solution suggested here was to remove the storage staking cost from accounts. This means there is no financial incentive for Alice to delete her account. But it does not solve the problem that the relayer has to pay for the account creation and Alice can simply refuse to send a meta transaction afterwards. In particular, anyone creating an account would have financial incentive to let a relayer create it for them instead of paying out of their own pockets. This would still be better than Alice stealing tokens but fundamentally, there still needs to be some trust.
An alternative solution discussed is to do NONCE checks on the relayer's access key. This prevents replay attacks and allows implicit accounts to be used in meta transactions without even initializing them. The downside is that meta transactions share the same NONCE counter(s). That means, a meta transaction sent by Bob may invalidate a meta transaction signed by Alice that was created and sent to the relayer at the same time. Multiple access keys by the relayer and coordination between relayer and user could potentially alleviate this problem. But for the MVP, nothing along those lines has been approved.
Gas costs for meta transactions
Meta transactions challenge the traditional ways of charging gas for actions. To see why, let's first list the normal flow of gas, outside of meta transactions.
- Gas is purchased (by deducting NEAR from the transaction signer account),
when the transaction is converted into a receipt. The amount of gas is
implicitly defined by the content of the receipt. For function calls, the
caller decides explicitly how much gas is attached on top of the minimum
required amount. The NEAR token price per gas unit is dynamically adjusted on
the blockchain. In today's nearcore code base, this happens as part of
verify_and_charge_transaction
which gets called inprocess_transaction
. - For all actions listed inside the transaction, the
SEND
cost is burned immediately. Depending on the conditionsender == receiver
, one of two possibleSEND
costs is chosen. TheEXEC
cost is not burned, yet. But it is implicitly part of the transaction cost. The third and last part of the transaction cost is the gas attached to function calls. The attached gas is also called prepaid gas. (Not to be confused withtotal_prepaid_exec_fees
which is the implicitly prepaid gas forEXEC
action costs.) - On the receiver shard,
EXEC
costs are burned before the execution of an action starts. Should the execution fail and abort the transaction, the remaining gas will be refunded to the signer of the transaction.
Ok, now adapt for meta transactions. Let's assume Alice uses a relayer to execute actions with Bob as the receiver.
- The relayer purchases the gas for all inner actions, plus the gas for the delegate action wrapping them.
- The cost of sending the inner actions and the delegate action from the
relayer to Alice's shard will be burned immediately. The condition
relayer == Alice
determines which actionSEND
cost is taken (sir
ornot_sir
). Let's call thisSEND(1)
. - On Alice's shard, the delegate action is executed, thus the
EXEC
gas cost for it is burned. Alice sends the inner actions to Bob's shard. Therefore, we burn theSEND
fee again. This time based onAlice == Bob
to figure outsir
ornot_sir
. Let's call thisSEND(2)
. - On Bob's shard, we execute all inner actions and burn their
EXEC
cost.
Each of these steps should make sense and not be too surprising. But the
consequence is that the implicit costs paid at the relayer's shard are
SEND(1)
+ SEND(2)
+ EXEC
for all inner actions plus SEND(1)
+ EXEC
for
the delegate action. This might be surprising but hopefully with this
explanation it makes sense now!
Gas refunds in meta transactions
Gas refund receipts work exactly like for normal transaction. At every step, the difference between the pessimistic gas price and the actual gas price at that height is computed and refunded. At the end of the last step, additionally all remaining gas is also refunded at the original purchasing price. The gas refunds go to the signer of the original transaction, in this case the relayer. This is only fair, since the relayer also paid for it.
Balance refunds in meta transactions
Unlike gas refunds, the protocol sends balance refunds to the predecessor (a.k.a. sender) of the receipt. This makes sense, as we deposit the attached balance to the receiver, who has to explicitly reattach a new balance to new receipts they might spawn.
In the world of meta transactions, this assumption is also challenged. If an inner action requires an attached balance (for example a transfer action) then this balance is taken from the relayer.
The relayer can see what the cost will be before submitting the meta transaction
and agrees to pay for it, so nothing wrong so far. But what if the transaction
fails execution on Bob's shard? At this point, the predecessor is Alice
and
therefore she receives the token balance refunded, not the relayer. This is
something relayer implementations must be aware of since there is a financial
incentive for Alice to submit meta transactions that have high balances attached
but will fail on Bob's shard.
Function access keys in meta transactions
Assume alice sends a meta transaction and signs with a function access key. How exactly are permissions applied in this case?
Function access keys can limit the allowance, the receiving contract, and the contract methods. The allowance limitation acts slightly strange with meta transactions.
But first, both the methods and the receiver will be checked as expected. That is, when the delegate action is unwrapped on Alice's shard, the access key is loaded from the DB and compared to the function call. If the receiver or method is not allowed, the function call action fails.
For allowance, however, there is no check. All costs have been covered by the relayer. Hence, even if the allowance of the key is insufficient to make the call directly, indirectly through meta transaction it will still work.
This behavior is in the spirit of allowance limiting how much financial resources the user can use from a given account. But if someone were to limit a function access key to one trivial action by setting a very small allowance, that is circumventable by going through a relayer. An interesting twist that comes with the addition of meta transactions.
Serialization: Borsh, Json, ProtoBuf
If you spent some time looking at NEAR code, youâll notice that we have different methods of serializing structures into strings. So in this article, weâll compare these different approaches, and explain how and where weâre using them.
ProtocolSchema
All structs which need to be persisted or sent over the network must derive the ProtocolSchema trait:
#![allow(unused)] fn main() { // First, the schema checksums (hashes) are calculated at compile time and // require `TypeId` for cross-navigation. However, it is a nightly feature, // so we enable it manually by putting this in lib.rs: #![cfg_attr(enable_const_type_id, feature(const_type_id))] // Then, import schema calculation functionality by putting in Cargo.toml: // near-schema-checker-lib.workspace = true [features] protocol_schema = [ "near-schema-checker-lib/protocol_schema", ...the same feature in all dependent crates... ] // Then, mark your struct with `#[derive(ProtocolSchema)]`: use near_schema_checker_lib::ProtocolSchema; #[derive(ProtocolSchema)] pub struct BlockHeader { pub hash: CryptoHash, pub height: BlockHeight, } // Lastly, mark your crate in `tools/protocol-schema-check/Cargo.toml` // as dependency with `protocol_schema` feature enabled. }
This is done to protect structures from accidental changes that could corrupt the database or disrupt the protocol. Dedicated CI check is responsible to check the consistency of the schema. See README for more details.
All these structures are likely to implement BorshSerialize and BorshDeserialize (see below).
Borsh
Borsh is our custom serializer (link), that we use mostly for things that have to be hashed.
The main feature of Borsh is that, there are no two binary representations that deserialize into the same object.
You can read more on how Borsh serializes the data, by looking at the Specification tab on borsh.io.
The biggest pitfall/risk of Borsh, is that any change to the structure, might cause previous data to no longer be parseable.
For example, inserting a new enum âin the middleâ:
#![allow(unused)] fn main() { pub enum MyCar { Bmw, Ford, } If we change our enum to this: pub enum MyCar { Bmw, Citroen, Ford, // !! WRONG - Ford objects cannot be deserialized anymore } }
This is especially tricky if we have conditional compilation:
#![allow(unused)] fn main() { pub enum MyCar { Bmw, #[cfg(feature = "french_cars")] Citroen, Ford, } }
Is such a scenario - some of the objects created by binaries with this feature enabled, will not be parseable by binaries without this feature.
Removing and adding fields to structures is also dangerous.
Basically - the only âsafeâ thing that you can do with Borsh - is add a new Enum value at the end.
JSON
JSON doesnât need much introduction. Weâre using it for external APIs (jsonrpc) and configuration. It is a very popular, flexible and human-readable format.
Proto (Protocol Buffers)
We started using proto recently - and we plan to use it mostly for our network communication. Protocol buffers are strongly typed - they require you to create a .proto file, where you describe the contents of your message.
For example:
message HandshakeFailure {
// Reason for rejecting the Handshake.
Reason reason = 1;
// Data about the peer.
PeerInfo peer_info = 2;
// GenesisId of the NEAR chain that the peer belongs to.
GenesisId genesis_id = 3;
}
Afterwards, such a proto file is fed to protoc âcompilerâ that returns auto-generated code (in our case Rust code) - that can be directly imported into your library.
The main benefit of protocol buffers is their backwards compatibility (as long as you adhere to the rules and donât reuse the same field ids).
Summary
So to recap what weâve learned:
JSON - mostly used for external APIs - look for serde::Serialize/Deserialize
Proto - currently being developed to be used for network connections - objects have to be specified in proto file.
Borsh - for things that we hash (and currently also for all the things that we store on disk - but we might move to proto with this in the future). Look for BorshSerialize/BorshDeserialize
Questions
Why donât you use JSON for everything?
While this is a tempting option, JSON has a few drawbacks:
- size (json is self-describing, so all the field names etc are included every time)
- non-canonical: JSON doesnât specify strict ordering of the fields, so weâd have to do additional restrictions/rules on that - otherwise the same âconceptualâ message would end up with different hashes.
Ok - so how about proto for everything?
There are couple risks related with using proto for things that have to be hashed. A Serialized protocol buffer can contain additional data (for example fields with tag ids that youâre not using) and still successfully parse (thatâs how it achieves backward compatibility).
For example, in this proto:
message First {
string foo = 1;
string bar = 2;
}
message Second {
string foo = 1;
}
Every âFirstâ message will be successfully parsed as âSecondâ message - which could lead to some programmatic bugs.
Advanced section - RawTrieNode
There is one more place in the code where we use a âcustomâ encoding:
RawTrieNodeWithSize defined in store/src/trie/raw_node.rs. While the format
uses Borsh derives and API, there is a difference in how branch children
([Option<CryptoHash>; 16]
) are encoded. Standard Borsh encoding would
encode Option<CryptoHash>
sixteen times. Instead, RawTrieNodeWithSize uses
a bitmap to indicate which elements are set resulting in a different layout.
Imagine a children vector like this:
#![allow(unused)] fn main() { [Some(0x11), None, Some(0x12), None, None, âŚ] }
Here, we have children at index 0 and 2 which has a bitmap of 101
Custom encoder:
// Number of children determined by the bitmask
[16 bits bitmask][32 bytes child][32 bytes child]
[5][0x11][0x12]
// Total size: 2 + 32 + 32 = 68 bytes
Borsh:
[8 bits - 0 or 1][32 bytes child][8 bits 0 or 1][8 bits ]
[1][0x11][0][1][0x11][0][0]...
// Total size: 16 + 32 + 32 = 80 bytes
Code for encoding children is given in BorshSerialize implementation for ChildrenRef type and code for decoding in BorshDeserialize implementation for Children. All of that is in aforementioned store/src/trie/raw_node.rs file.
Proofs
âDonât trust, but verifyâ - letâs talk about proofs
Was your transaction included?
How do you know that your transaction was actually included in the blockchain? Sure, you can âsimplyâ ask the RPC node, and it might say âyesâ, but is it enough?
The other option would be to ask many nodes - hoping that at least one of them would be telling the truth. But what if that is not enough?
The final solution would be to run your own node - this way youâd check all the transactions yourself, and then you could be sure - but this can become a quite expensive endeavor - especially when many shards are involved.
But there is actually a better solution - that doesnât require you to trust the single (or many) RPC nodes, and to verify, by yourself, that your transaction was actually executed.
Letâs talk about proofs (merkelization):
Imagine you have 4 values that youâd like to store, in such a way, that you can easily prove that a given value is present.
One way to do it, would be to create a binary tree, where each node would hold a hash:
- leaves would hold the hashes that represent the hash of the respective value.
- internal nodes would hold the hash of âconcatenation of hashes of their childrenâ
- the top node would be called a root node (in this image it is the node n7)
With such a setup, you can prove that a given value exists in this tree, by providing a âpathâ from the corresponding leaf to the root, and including all the siblings.
For example to prove that value v[1] exists, we have to provide all the nodes marked as green, with the information about which sibling (left or right) they are:
# information needed to verify that node v[1] is present in a tree
# with a given root (n7)
[(Left, n0), (Right, n6)]
# Verification
assert_eq!(root, hash(hash(n0, hash(v[1])), n6))
We use the technique above (called merkelization) in a couple of places in our protocol, but for todayâs article, Iâd like to focus on receipts & outcome roots.
Merkelization, receipts and outcomes
In order to prove that a given receipt belongs to a given block, we will need to fetch some additional information.
As NEAR is sharded, the receipts actually belong to âChunksâ not Blocks
themselves, so the first step is to find the correct chunk and fetch its
ChunkHeader
.
ShardChunkHeaderV3 {
inner: V2(
ShardChunkHeaderInnerV2 {
prev_block_hash: `C9WnNCbNvkQvnS7jdpaSGrqGvgM7Wwk5nQvkNC9aZFBH`,
prev_state_root: `5uExpfRqAoZv2dpkdTxp1ZMcids1cVDCEYAQwAD58Yev`,
outcome_root: `DBM4ZsoDE4rH5N1AvCWRXFE9WW7kDKmvcpUjmUppZVdS`,
encoded_merkle_root: `2WavX3DLzMCnUaqfKPE17S1YhwMUntYhAUHLksevGGfM`,
encoded_length: 425,
height_created: 417,
shard_id: 0,
gas_used: 118427363779280,
gas_limit: 1000000000000000,
balance_burnt: 85084341232595000000000,
outgoing_receipts_root: `4VczEwV9rryiVSmFhxALw5nCe9gSohtRpxP2rskP3m1s`,
tx_root: `11111111111111111111111111111111`,
validator_proposals: [],
},
),
}
The field that we care about is called outcome_root
. This value represents the
root of the binary merkle tree, that is created based on all the receipts that
were processed in this chunk.
Note: You can notice that we also have a field here called
encoded_merkle_root
- this is another case where we use merkelization in our
chain - this field is a root of a tree that holds hashes of all the "partial
chunks" into which we split the chunk to be distributed over the network.
So, in order to verify that a given receipt/transaction was really included, we have to compute its hash (see details below), get the path to the root, and voila, we can confirm that it was really included.
But how do we get the siblings on the path to the root? This is actually something that RPC nodes do return in their responses.
If you ever looked closely at NEARâs tx-status response, you can notice a "proof" section there. For every receipt, you'd see something like this:
proof: [
{
direction: 'Right',
hash: '2wTFCh2phFfANicngrhMV7Po7nV7pr6gfjDfPJ2QVwCN'
},
{
direction: 'Right',
hash: '43ei4uFk8Big6Ce6LTQ8rotsMzh9tXZrjsrGTd6aa5o6'
},
{
direction: 'Left',
hash: '3fhptxeChNrxWWCg8woTWuzdS277u8cWC9TnVgFviu3n'
},
{
direction: 'Left',
hash: '7NTMqx5ydMkdYDFyNH9fxPNEkpgskgoW56Y8qLoVYZf7'
}
]
And the values in there are exactly the siblings (plus info on which side of the tree the sibling is), on the path to the root.
Note: proof section doesnât contain the root itself and also doesnât include the hash of the receipt.
[Advanced section]: Letâs look at a concrete example
Imagine that we have the following receipt:
{
block_hash: '7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND',
id: '6bdKUtGbybhYEQ2hb2BFCTDMrtPBw8YDnFpANZHGt5im',
outcome: {
executor_id: 'node0',
gas_burnt: 223182562500,
logs: [],
metadata: { gas_profile: [], version: 1 },
receipt_ids: [],
status: { SuccessValue: '' },
tokens_burnt: '0'
},
proof: [
{
direction: 'Right',
hash: 'BWwZ4wHuzaUxdDSrhAEPjFQtDgwzb8K4zoNzfX9A3SkK'
},
{
direction: 'Left',
hash: 'Dpg4nQQwbkBZMmdNYcZiDPiihZPpsyviSTdDZgBRAn2z'
},
{
direction: 'Right',
hash: 'BruTLiGx8f71ufoMKzD4H4MbAvWGd3FLL5JoJS3XJS3c'
}
]
}
Remember that the outcomes of the execution will be added to the NEXT block, so letâs find the next block hash, and the proper chunk.
(in this example, Iâve used the view-state chain
from neard)
417 7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND | node0 | parent: 416 C9WnNCbNvkQvnS7jdpaSGrqGvgM7Wwk5nQvkNC9aZFBH | .... 0: E6pfD84bvHmEWgEAaA8USCn2X3XUJAbFfKLmYez8TgZ8 107 Tgas |1: Ch1zr9TECSjDVaCjupNogLcNfnt6fidtevvKGCx8c9aC 104 Tgas |2: 87CmpU6y7soLJGTVHNo4XDHyUdy5aj9Qqy4V7muF5LyF 0 Tgas |3: CtaPWEvtbV4pWem9Kr7Ex3gFMtPcKL4sxDdXD4Pc7wah 0 Tgas
418 J9WQV9iRJHG1shNwGaZYLEGwCEdTtCEEDUTHjboTLLmf | node0 | parent: 417 7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND | .... 0: 7APjALaoxc8ymqwHiozB5BS6mb3LjTgv4ofRkKx2hMZZ 0 Tgas |1: BoVf3mzDLLSvfvsZ2apPSAKjmqNEHz4MtPkmz9ajSUT6 0 Tgas |2: Auz4FzUCVgnM7RsQ2noXsHW8wuPPrFxZToyLaYq6froT 0 Tgas |3: 5ub8CZMQmzmZYQcJU76hDC3BsajJfryjyShxGF9rzpck 1 Tgas
I know that the receipt should belong to Shard 3 so letâs fetch the chunk header:
$ neard view-state chunks --chunk-hash 5ub8CZMQmzmZYQcJU76hDC3BsajJfryjyShxGF9rzpck
ShardChunkHeaderV3 {
inner: V2(
ShardChunkHeaderInnerV2 {
prev_block_hash: `7FtuLHR3VSNhVTDJ8HmrzTffFWoWPAxBusipYa2UfrND`,
prev_state_root: `6rtfqVEXx5STLv5v4zwLVqAfq1aRAvLGXJzZPK84CPpa`,
outcome_root: `2sZ81kLj2cw5UHTjdTeMxmaWn2zFeyr5pFunxn6aGTNB`,
encoded_merkle_root: `6xxoqYzsgrudgaVRsTV29KvdTstNYVUxis55KNLg6XtX`,
encoded_length: 8,
height_created: 418,
shard_id: 3,
gas_used: 1115912812500,
gas_limit: 1000000000000000,
balance_burnt: 0,
outgoing_receipts_root: `8s41rye686T2ronWmFE38ji19vgeb6uPxjYMPt8y8pSV`,
tx_root: `11111111111111111111111111111111`,
validator_proposals: [],
},
),
height_included: 0,
signature: ed25519:492i57ZAPggqWEjuGcHQFZTh9tAKuQadMXLW7h5CoYBdMRnfY4g7A749YNXPfm6yXnJ3UaG1ahzcSePBGm74Uvz3,
hash: ChunkHash(
`5ub8CZMQmzmZYQcJU76hDC3BsajJfryjyShxGF9rzpck`,
),
}
So the outcome_root
is 2sZ81kLj2cw5UHTjdTeMxmaWn2zFeyr5pFunxn6aGTNB
- letâs
verify it then.
Our first step is to compute the hash of the receipt, which is equal to
hash([receipt_id, hash(borsh(receipt_payload)])
# this is a borsh serialized ExecutionOutcome struct.
# computing this, we leave as an exercise for the reader :-)
receipt_payload_hash = "7PeGiDjssz65GMCS2tYPHUm6jYDeBCzpuPRZPmLNKSy7"
receipt_hash = base58.b58encode(hashlib.sha256(struct.pack("<I", 2) + base58.b58decode("6bdKUtGbybhYEQ2hb2BFCTDMrtPBw8YDnFpANZHGt5im") + base58.b58decode(receipt_payload_hash)).digest())
And then we can start reconstructing the tree:
def combine(a, b):
return hashlib.sha256(a + b).digest()
# one node example
# combine(receipt_hash, "BWwZ4wHuzaUxdDSrhAEPjFQtDgwzb8K4zoNzfX9A3SkK")
# whole tree
combine(combine("Dpg4nQQwbkBZMmdNYcZiDPiihZPpsyviSTdDZgBRAn2z", combine(receipt_hash, "BWwZ4wHuzaUxdDSrhAEPjFQtDgwzb8K4zoNzfX9A3SkK")), "BruTLiGx8f71ufoMKzD4H4MbAvWGd3FLL5JoJS3XJS3c")
# result == 2sZ81kLj2cw5UHTjdTeMxmaWn2zFeyr5pFunxn6aGTNB
And success - our result is matching the outcome root, so it means that our receipt was indeed processed by the blockchain.
Resharding V2
DISCLAIMER: This document describes Resharding V2 and it may not be up to date with recent changes to nearcore
.
Resharding
Resharding is the process in which the shard layout changes. The primary purpose of resharding is to keep the shards small so that a node meeting minimum hardware requirements can safely keep up with the network while tracking some set minimum number of shards.
Specification
The resharding is described in more detail in the following NEPs:
Shard layout
The shard layout determines the number of shards and the assignment of accounts to shards (as single account cannot be split between shards).
There are two versions of the ShardLayout enum.
- v0 - maps the account to a shard taking hash of the account id modulo number of shards
- v1 - maps the account to a shard by looking at a set of predefined boundary accounts and selecting the shard where the accounts fits by using alphabetical order
At the time of writing there are three pre-defined shard layouts but more can be added in the future.
- v0 - The first shard layout that contains only a single shard encompassing all the accounts.
- simple nightshade - Splits the accounts into 4 shards.
- simple nightshade v2 - Splits the accounts into 5 shards.
IMPORTANT: Using alphabetical order applies to the full account name, so a.near
could belong to
shard 0, while z.a.near
to shard 3.
Currently in mainnet & testnet, we use the fixed shard split (which is defined in get_simple_nightshade_layout
):
vec!["aurora", "aurora-0", "kkuuue2akv_1630967379.near"]
In the near future we are planning on switching to simple nightshade v2 (which is defined in get_simple_nightshade_layout_v2
)
vec!["aurora", "aurora-0", "kkuuue2akv_1630967379.near", "tge-lockup.sweat"]
Shard layout changes
Shard Layout is determined at epoch level in the AllEpochConfig based on the protocol version of the epoch.
The shard layout can change at the epoch boundary. Currently in order to change the
shard layout it is necessary to manually determine the new shard layout and setting it
for the desired protocol version in the AllEpochConfig
.
Deeper technical details
It all starts in preprocess_block
- if the node sees, that the block it is
about to preprocess is the first block of the epoch (X+1) - it calls
get_state_sync_info
, which is responsible for figuring out which shards will
be needed in next epoch (X+2).
This is the moment, when node can request new shards that it didn't track before (using StateSync) - and if it detects that the shard layout would change in the next epoch, it also involves the StateSync - but skips the download part (as it already has the data) - and starts from resharding.
StateSync in this phase would send the ReshardingRequest
to the SyncJobsActor
(you can think about the SyncJobsActor
as a background thread).
We'd use the background thread to perform resharding: the goal is to change the one trie (that represents the state of the current shard) - to multiple tries (one for each of the new shards).
In order to split a trie into children tries we use a snapshot of the flat storage. We iterate over all of the entries in the flat storage and we build the children tries by inserting the parent entry into either of the children tries.
Extracting of the account from the key happens in parse_account_id_from_raw_key
- and we do it for all types of data that we store in the trie (contract code, keys, account info etc) EXCEPT for Delayed receipts. Then, we figure out the shard that this account is going to belong to, and we add this key/value to that new trie.
This way, after going over all the key/values from the original trie, we end up with X new tries (one for each new shard).
IMPORTANT: in the current code, we only support such 'splitting' (so a new shard can have just one parent).
Why delayed receipts are special?
For all the other columns, there is no dependency between entries, but in case of delayed receipts - we are forming a 'queue'. We store the information about the first index and the last index (in DelayedReceiptIndices struct).
Then, when receipt arrives, we add it as the 'DELAYED_RECEIPT + last_index' key (and increment last_index by 1).
That is why we cannot move this trie entry type in the same way as others where account id is part of the key. Instead we do it by iterating over this queue and inserting entries to the queue of the relevant child shard.
Constraints
The state sync of the parent shard, the resharing and the catchup of the children shards must all complete within a single epoch.
Rollout
Flow
The resharding will be initiated by having it included in a dedicated protocol version together with neard. Here is the expected flow of events:
- A new neard release is published and protocol version upgrade date is set to D, roughly a week from the release.
- All node operators upgrade their binaries to the newly released version within the given timeframe, ideally as soon as possible but no later than D.
- The protocol version upgrade voting takes place at D in an epoch E and nodes vote in favour of switching to the new protocol version in epoch E+2.
- The resharding begins at the beginning of epoch E+1.
- The network switches to the new shard layout in the first block of epoch E+2.
Monitoring
Resharding exposes a number of metrics and logs that allow for monitoring the resharding process as it is happening. Resharding requires manual recovery in case anything goes wrong and should be monitored in order to ensure smooth node operation.
- near_resharding_status is the primary metric that should be used for tracking the progress of resharding. It's tagged with a shard_uid label of the parent shard. It's set to corresponding ReshardingStatus enum and can take one of the following values
- 0 - Scheduled - resharding is scheduled and waiting to be executed.
- 1 - Building - resharding is running. Only one shard at a time can be in that state while the rest will be either finished or waiting in the Scheduled state.
- 2 - Finished - resharding is finished.
- -1 - Failed - resharding failed and manual recovery action is required. The node will operate as usual until the end of the epoch but will then stop being able to process blocks.
- near_resharding_batch_size and near_resharding_batch_count - those two metrics show how much data has been resharded. Both metrics should progress with the near_resharding_status as follows.
- While in the Scheduled state both metrics should remain 0.
- While in the Building state both metrics should be gradually increasing.
- While in the Finished state both metrics should remain at the same value.
- near_resharding_batch_prepare_time_bucket, near_resharding_batch_apply_time_bucket and near_resharding_batch_commit_time_bucket - those three metrics can be used to track the performance of resharding and fine tune throttling if needed. As a rule of thumb the combined time of prepare, apply and commit for a batch should remain at the 100ms-200ms level on average. Higher batch processing time may lead to disruptions in block processing, missing chunks and blocks.
Here are some example metric values when finished for different shards and networks. The duration column reflects the duration of the building phase. Those were captured in production like environment in November 2023 and actual times at the time of resharding in production may be slightly higher.
mainnet | duration | batch count | batch size |
---|---|---|---|
total | 2h23min | ||
shard 0 | 32min | 12,510 | 6.6GB |
shard 1 | 30min | 12,203 | 6.1GB |
shard 2 | 26min | 10,619 | 6.0GB |
shard 3 | 55min | 21,070 | 11.5GB |
testnet | duration | batch count | batch size |
---|---|---|---|
total | 5h32min | ||
shard 0 | 21min | 10,259 | 10.9GB |
shard 1 | 18min | 7,034 | 3.5GB |
shard 2 | 2h31min | 75,529 | 75.6GB |
shard 3 | 2h22min | 63,621 | 49.2GB |
Here is an example of what that may look like in a grafana dashboard. Please keep in mind that the values and duration is not representative as the sample data below is captured in a testing environment with different configuration.
Throttling
The resharding process can be quite resource intensive and affect the regular operation of a node. In order to mitigate that as well as limit any need for increasing hardware specifications of the nodes throttling was added. Throttling slows down resharding to not have it impact other node operations. Throttling can be configured by adjusting the resharding_config in the node config file.
- batch_size - controls the size of batches in which resharding moves data around. Setting a smaller batch size will slow down the resharding process and make it less resource-consuming.
- batch_delay - controls the delay between processing of batches. Setting a smaller batch delay will speed up the resharding process and make it more resource-consuming.
The remaining fields in the ReshardingConfig are only intended for testing purposes and should remain set to their default values.
The default configuration for ReshardingConfig should provide a good and safe setting for resharding in the production networks. There is no need for node operators to make any changes to it unless they observe issues.
The resharding config can be adjusted at runtime, without restarting the node. The config needs to be updated first and then a SIGHUP signal should be sent to the neard process. When received the signal neard will update the config and print a log message showing what fields were changed. It's recommended to check the log to make sure the relevant config change was correctly picked up.
Future possibilities
Localize resharding to a single shard
Currently when resharding we need to move the data for all shards even if only a single shard is being split. That is due to having the version field in the storage key that needs to be updated when changing shard layout version.
This can be improved by changing how ShardUId works e.g. removing the version and instead using globally unique shard ids.
Dynamic resharding
The current implementation relies on having the shard layout determined offline and manually added to the node implementation.
The dynamic resharding would mean that the network itself can automatically determine that resharding is needed, what should be the new shard layout and schedule the resharding.
Support different changes to shard layout
The current implementation only supports splitting a shard. In the future we can consider adding support for other operations such as merging two shards or moving an existing boundary account.
How neard will work
The documents under this chapter are talking about the future of NEAR - what we're planning on improving and how.
(This also means that they can get out of date quickly :-).
If you have comments, suggestions or want to help us designing and implementing some of these things here - please reach out on Zulip or github.
This document is still a DRAFT.
This document covers our improvement plans for state sync and catchup. Before reading this doc, you should take a look at How sync works
State sync is used in two situations:
- when your node is behind for more than 2 epochs (and it is not an archival node) - then rather than trying to apply block by block (that can take hours) - you 'give up' and download the fresh state (a.k.a state sync) and apply blocks from there.
- when you're a block (or chunk) producer - and in the upcoming epoch, you'll have to track a shard that you are not currently tracking.
In the past (and currently) - the state sync was mostly used in the first scenario (as all block & chunk producers had to track all the shards for security reasons - so they didn't actually have to do catchup at all).
As we progress towards phase 2 and keep increasing number of shards - the catchup part starts being a lot more critical. When we're running a network with a 100 shards, the single machine is simply not capable of tracking (a.k.a applying all transactions) of all shards - so it will have to track just a subset. And it will have to change this subset almost every epoch (as protocol rebalances the shard-to-producer assignment based on the stakes).
This means that we have to do some larger changes to the state sync design, as requirements start to differ a lot:
- catchups are high priority (the validator MUST catchup within 1 epoch - otherwise it will not be able to produce blocks for the new shards in the next epoch - and therefore it will not earn rewards).
- a lot more catchups in progress (with lots of shards basically every validator would have to catchup at least one shard at each epoch boundary) - this leads to a lot more potential traffic on the network
- malicious attacks & incentives - the state data can be large and can cause a lot of network traffic. At the same time it is quite critical (see point above), so we'll have to make sure that the nodes are incentivised to provide the state parts upon request.
- only a subset of peers will be available to request the state sync from (as not everyone from our peers will be tracking the shard that we're interested in).
Things that we're actively analysing
Performance of state sync on the receiver side
We're looking at the performance of state sync:
- how long does it take to create the parts,
- pro-actively creating the parts as soon as epoch starts
- creating them in parallel
- allowing user to ask for many at once
- allowing user to provide a bitmask of parts that are required (therefore allowing the server to return only the ones that it already cached).
Better performance on the requestor side
Currently the parts are applied only once all of them are downloaded - instead we should try to apply them in parallel - after each part is received.
When we receive a part, we should announce this information to our peers - so that they know that they can request it from us if they need it.
Ideas - not actively working on them yet
Better networking (a.k.a Tier 3)
Currently our networking code is picking the peers to connect at random (as most of them are tracking all the shards). With phase2 it will no longer be the case, so we should work on improvements of our peer-selection mechanism.
In general - we should make sure that we have direct connection to at least a few nodes that are tracking the same shards that we're tracking right now (or that we'll want to track in the near future).
Dedicated nodes optimized towards state sync responses
The idea is to create a set of nodes that would specialize in state sync responses (similar to how we have archival nodes today).
The sub-idea of this, is to store such data on one of the cloud providers (AWS, GCP).
Sending deltas instead of full state syncs
In case of catchup, the requesting node might have tracked that shard in the past. So we could consider just sending a delta of the state rather than the whole state.
While this helps with the amount of data being sent - it might require the receiver to do a lot more work (as the data that it is about to send cannot be easily cached).
Malicious producers in phase 2 of sharding.
In this document, we'll compare the impact of the hypothetical malicious producer on the NEAR system (both in the current setup and how it will work when phase2 is implemented).
Current state (Phase 1)
Let's assume that a malicious chunk producer C1
has produced a bad chunk
and sent it to the block producer at this height B1
.
The block producer IS going to add the chunk to the block (as we don't validate the chunks before adding to blocks - but only when signing the block - see Transactions and receipts - last section).
After this block is produced, it is sent to all the validators to get the signatures.
As currently all the validators are tracking all the shards - they will quickly notice that the chunk is invalid, so they will not sign the block.
Therefore the next block producer B2
is going to ignore B1
's block, and
select block from B0
as a parent instead.
So TL;DR - a bad chunk would not be added to the chain.
Phase 2 and sharding
Unfortunately things get a lot more complicated, once we scale.
Let's assume the same setup as above (a single chunk producer C1
being
malicious). But this time, we have 100 shards - each validator is tracking just
a few (they cannot track all - as today - as they would have to run super
powerful machines with > 100 cores).
So in the similar scenario as above - C1
creates a malicious chunks, and
sends it to B1
, which includes it in the block.
And here's where the complexity starts - as most of the validators will NOT
track the shard which C1
was producing - so they will still sign the block.
The validators that do track that shard will of course (assuming that they are non-malicious) refuse the sign. But overall, they will be a small majority - so the block is going to get enough signatures and be added to the chain.
Challenges, Slashing and Rollbacks
So we're in a pickle - as a malicious chunk was just added to the chain. And that's why need to have mechanisms to automatically recover from such situations: Challenges, Slashing and Rollbacks.
Challenge
Challenge is a self-contained proof, that something went wrong in the chunk processing. It must contain all the inputs (with their merkle proof), the code that was executed, and the outputs (also with merkle proofs).
Such a challenge allows anyone (even nodes that don't track that shard or have any state) to verify the validity of the challenge.
When anyone notices that a current chain contains a wrong transition - they submit such challenge to the next block producer, which can easily verify it and it to the next block.
Then the validators do the verification themselves, and if successful, they sign the block.
When such block is successfully signed, the protocol automatically slashes
malicious nodes (more details below) and initiates the rollback to bring the
state back to the state before the bad chunk (so in our case, back to the block
produced by B0
).
Slashing
Slashing is the process of taking away the part of the stake from validators that are considered malicious.
In the example above, we'll definitely need to slash the C1
- and potentially also any validators that were tracking that shard and did sign the bad block.
Things that we'll have to figure out in the future:
- how much do we slash? all of the stake? some part?
- what happens to the slashed stake? is it burned? does it go to some pool?
State rollbacks
// TODO: add
Problems with the current Phase 2 design
Is slashing painful enough?
In the example above, we'd successfully slash the C1
producer - but was it
enough?
Currently (with 4 shards) you need around 20k NEAR to become a chunk producer. If we increase the number of shards to 100, it would drop the minimum stake to around 1k NEAR.
In such scenario, by sacrificing 1k NEAR, the malicious node can cause the system to rollback a couple blocks (potentially having bad impact on the bridge contracts etc).
On the other side, you could be a non-malicious chunk producer with a corrupted database (or a nasty bug in the code) - and the effect would be the same - the chunk that you produced would be marked as malicious, and you'd lose your stake (which will be a super-scary even for any legitimate validator).
So the open question is - can we do something 'smarter' in the protocol to detect the case, where there is 'just a single' malicious (or buggy) chunk producer and avoid the expensive rollback?
Storage
This is our work-in-progress storage documentation. Things are raw and incomplete. You are encouraged to help improve it, in any capacity you can!
Flow
Here we present the flow of a single read or write request from the transaction runtime all the way to the OS. As you can see, there are many layers of read-caching and write-buffering involved.
Blue arrow means a call triggered by read.
Red arrow means a call triggered by write.
Black arrow means a non-trivial data dependency. For example:
- Nodes which are read on TrieStorage go to TrieRecorder to generate proof, so they are connected with black arrow.
- Memtrie lookup needs current state of accounting cache to compute costs. When query completes, accounting cache is updated with memtrie nodes. So they are connected with bidirectional black arrow.
Trie
We use Merkle-Patricia Trie to store blockchain state. Trie is persistent, which means that insertion of new node actually leads to creation of a new path to this node, and thus root of Trie after insertion will also be represented by a new object.
Here we describe its implementation details which are closely related to Runtime.
Main structures
Trie
Trie stores the state - accounts, contract codes, access keys, etc. Each state item corresponds to the unique trie key. You can read more about this structure on Wikipedia.
There are two ways to access trie - from memory and from disk. The first one is currently the main one, where only the loading stage requires disk, and the operations are fully done in memory. The latter one relies only on disk with several layers of caching. Here we describe the disk trie.
Disk trie is stored in the RocksDB, which is persistent across node restarts. Trie
communicates with database using TrieStorage
. On the database level, data is
stored in key-value format in DBCol::State
column. There are two kinds of
records:
- trie nodes, for the which key is constructed from shard id and
RawTrieNodeWithSize
hash, and value is aRawTrieNodeWithSize
serialized by a custom algorithm; - values (encoded contract codes, postponed receipts, etc.), for which the key is constructed from shard id and the hash of value, which maps to the encoded value.
So, value can be obtained from TrieKey
as follows:
- start from the hash of
RawTrieNodeWithSize
corresponding to the root; - descend to the needed node using nibbles from
TrieKey
; - extract underlying
RawTrieNode
; - if it is a
Leaf
orBranch
, it should contain the hash of the value; - get value from storage by its hash and shard id.
Note that Trie
is almost never called directly from Runtime
, modifications
are made using TrieUpdate
.
TrieUpdate
Provides a way to access storage and record changes to commit in the future. Update is prepared as follows:
- changes are made using
set
andremove
methods, which are added toprospective
field, - call
commit
method which movesprospective
changes tocommitted
, - call
finalize
method which preparesTrieChanges
and state changes based oncommitted
field.
Prospective changes correspond to intermediate state updates, which can be discarded if the transaction is considered invalid (because of insufficient balance, invalidity, etc.). While they can't be applied yet, they must be cached this way if the updated keys are accessed again in the same transaction.
Committed changes are stored in memory across transactions and receipts. Similarly, they must be cached if the updated keys are accessed across transactions. They can be discarded only if the chunk is discarded.
Note that finalize
, Trie::insert
and Trie::update
do not update the
database storage. These functions only modify trie nodes in memory. Instead,
these functions prepare the TrieChanges
object, and Trie
is actually updated
when ShardTries::apply_insertions
is called, which puts new values to
DBCol::State
part of the key-value database.
TrieStorage
Stores all Trie
nodes and allows to get serialized nodes by TrieKey
hash
using the retrieve_raw_bytes
method.
There are two major implementations of TrieStorage
:
TrieCachingStorage
- caches all big values ever read byretrieve_raw_bytes
.TrieMemoryPartialStorage
- used for validating recorded partial storage.
Note that these storages use database keys, which are retrieved using hashes of
trie nodes using the get_key_from_shard_id_and_hash
method.
ShardTries
This is the main struct that is used to access all Tries. There's usually only a single instance of this and it contains stores and caches. We use this to gain access to the Trie
for a single shard by calling the get_trie_for_shard
or equivalent methods.
Each shard within ShardTries
has their own cache
and view_cache
. The cache
stores the most frequently accessed nodes and is usually used during block production. The view_cache
is used to serve user request to get data, which usually come in via network. It is a good idea to have an independent cache for this as we can have patterns in accessing user data independent of block production.
Primitives
TrieChanges
Stores result of updating Trie
.
old_root
: root before updatingTrie
, i.e. inserting new nodes and deleting old ones,new_root
: root after updatingTrie
,insertions
,deletions
: vectors ofTrieRefcountChange
, describing all inserted and deleted nodes.
This way to update trie allows to add new nodes to storage and remove old ones separately. The former corresponds to saving new block, the latter - to garbage collection of old block data which is no longer needed.
TrieRefcountChange
Because we remove unused nodes during garbage collection, we need to track
the reference count (rc
) for each node. Another reason is that we can dedup
values. If the same contract is deployed 1000 times, we only store one contract
binary in storage and track its count.
This structure is used to update rc
in the database:
trie_node_or_value_hash
- hash of the trie node or value, used for uniting with shard id to get DB key,trie_node_or_value
- serialized trie node or value,rc
- change of reference count.
Note that for all reference-counted records, the actual value stored in DB is
the concatenation of trie_node_or_value
and rc
. The reference count is
updated using a custom merge operation refcount_merge
.
On-Disk Database Format
We store the database in RocksDB. This document is an attempt to give hints about how to navigate it.
RocksDB
- The column families are defined in
DBCol
, defined incore/store/src/columns.rs
- The column families are seen on the rocksdb side as per the
col_name
function defined incore/store/src/db/rocksdb.rs
The Trie (col5)
- The trie is stored in column family
State
, number 5 - In this family, each key is of the form
ShardUId | CryptoHash
whereShardUId: u64
andCryptoHash: [u8; 32]
All Historical State Changes (col35)
- The state changes are stored in column family
StateChanges
, number 35 - In this family, each key is of the form
BlockHash | Column | AdditionalInfo
where:BlockHash: [u8; 32]
is the block hash for this changeColumn: u8
is defined near the top ofcore/primitives/src/trie_key.rs
AdditionalInfo
depends onColumn
and it can be found in the code for theTrieKey
struct, same file asColumn
Contract Deployments
- Contract deployments happen with
Column = 0x01
AdditionalInfo
is the account id for which the contract is being deployed- The key value contains the contract code alongside other pieces of data. It is possible to extract the contract code by removing everything until the wasm magic number, 0061736D01000000
- As such, it is possible to dump all the contracts that were ever deployed on-chain using this command on an archival node:
(Note that the last grep is required because not every such value appears to contain contract code) We should implement a feature to state-viewer thatâd allow better visualization of this data, but in the meantime this seems to work.ldb --db=~/.near/data scan --column_family=col35 --hex | \ grep -E '^0x.{64}01' | \ sed 's/0061736D01000000/x/' | \ sed 's/^.*x/0061736D01000000/' | \ grep -v ' : '
High level overview
When mainnet launched, the neard client stored all the chain's state in a single
RocksDB column DBCol::State
. This column embeds the entire NEAR state
trie directly in the key-value database, using roughly
hash(borsh_encode(trie_node))
as the key to store a trie_node
. This gives a
content-addressed storage system that can easily self-verify.
Flat storage is a bit like a database index for the values stored in the trie. It stores a copy of the data in a more accessible way to speed up the lookup time.
Drastically oversimplified, flat storage uses a hashmap instead of a trie. This
reduces the lookup time from O(d)
to O(1)
where d
is the tree depth of the
value.
But the devil is in the detail. Below is a high-level summary of implementation challenges, which are the reasons why the above description is an oversimplification.
Time dimension of the state
The blockchain state is modified with every chunk. Neard must be able to travel back in time and resolve key lookups for older versions, as well as travel to alternative universes to resolve requests for chunks that belong to a different fork.
Using the full trie embedding in RocksDB, this is almost trivial. We only need to know the state root for a chunk and we can start traversing from the root to any key. As long as we do not delete (garbage collect) unused trie nodes, the data remains available. The overhead is minimal, too, since all the trie nodes that have not been changed are shared between tries.
Enter flat storage territory: A simple hashmap only stores a snapshot. When we write a new value to the same key, the old value is overwritten and no longer accessible. A solution to access different versions on each shard is required.
The minimal solution only tracks the final head and all forks building on top. A full implementation would also allow replaying older chunks and doing view calls. But even this minimal solution pulls in all complicated details regarding consensus and multiplies them with all the problems listed below.
Fear of data corruption
State and FlatState keep a copy of the same data and need to be in sync at all times. This is a source for errors which any implementation needs to test properly. Ideally, there are also tools to quickly compare the two and verify which is correct.
Note: The trie storage is verifiable by construction of the hashes. The flat state is not directly verifiable. But it is possible to reconstruct the full trie just from the leaf values and use that for verification.
Gas accounting requires a protocol change
The trie path we take in a trie lookup affects the gas costs, and hence the balance subtracted from the user. In other words, the lookup algorithm leaks into the protocol. Hence, we cannot switch between different ways of looking up state without a protocol change.
This makes things a whole lot more complicated. We have to do the data migration and prepare flat storage, while still using trie storage. Keeping flat storage up to date at this point is pure overhead. And then, at the epoch switch where the new protocol version begins, we have to start using the new storage in all clients simultaneously. Anyone that has not finished migration, yet, will fail to produce a chunk due to invalid gas results.
In an ideal future, we want to make gas costs independent of the position in the trie and then this would no longer be a problem.
Performance reality check
Going from O(d)
to O(1)
sounds great. But let's look at actual numbers.
A flat state lookup requires exactly 2 database requests. One for finding the
ValueRef
and one for dereferencing the value. (Dereferencing only happens if
the value is present and if enough gas was paid to cover reading the potentially
large value, otherwise we short-circuit.)
A trie lookup requires exactly d
trie node lookups where d
is the depth in
the trie, plus one more for dereferencing the final value.
Clearly, d + 1
is worse than 1 + 1
, right? Well, as it turns out, we can
cache the trie nodes with surprisingly high effectiveness. In mainnet
workloads (which were often optimized to work well with the trie shape) we
observe a 99% cache hit rate in many cases.
Combine that with the fact that a typical value for d
is somewhere between 10
and 20. Then we may conclude that, in expectation, a trie lookup (d * 0.01 + 1
requests) requires less DB requests than a flat state lookup (1 + 1
requests).
In practice, however, flat state still has an edge over accessing trie storage directly. And that is due to two reasons.
- DB keys are in better order, leading to more cache hits in RocksDB's block cache.
- The flat state column is much smaller than the state column.
We observed a speedup of 100x and beyond for reading a single value from
DBCol::FlatState
compared to reading it from DBCol::State
. So that is
clearly a win. But one that is due to the data layout inside RocksDB, not due to
algorithmic improvements.
Updating the Merkle tree
When we update a value in the blockchain state, all ancestor nodes change their value due to the recursive nature of Merkle trees.
Updating flat state is easy enough but then we would not know the new state root. Annoyingly, it is rather important to have the state root in due time, as it is included in the chunk header.
To update the state root, we need to read all nodes between the root and all changed values. At which point, we are doing all the same reads we were trying to avoid in the first place.
This makes flat state for writes algorithmically useless, as we still have to do
O(d)
requests, no matter what. But there are potential benefits from data
locality, as flat state stores values sorted by a trie key instead of a
perfectly random hash.
For that, we would need a fast index not only for the state value but also for all intermediate trie nodes. At this point, we would be back to a full embedding of the trie in a key-value store / hashmap. Just changing the keys we use for the database.
Note that we can enjoy the benefits of read-only flat storage to improve read heavy contracts. But a read-modify-write pattern using this hybrid solution is strictly worse than the original implementation without flat storage.
Implementation status and future ideas
As of March 2023, we have implemented a read-only flat storage that only works for the frontier of non-final blocks and the final block itself. Archival calls and view calls still use the trie directly.
Things we have solved
We are fairly confident we have solved the time-travel issues, the data migration / protocol upgrade problems, and got a decent handle on avoiding accidental data corruption.
This improves the worst case (no cache hits) for reads dramatically. And it paves the way for further improvements, as we have now a good understanding of all these seemingly-simple but actually-really-hard problems.
Flat state for writes
How to use flat storage for writes is not fully designed, yet, but we have some rough ideas on how to do it. But we don't know the performance we should expect. Algorithmically, it can only get worse but the speedup on RocksDB we found with the read-only flat storage is promising. But one has to wonder if there are not also simpler ways to achieve better data locality in RocksDB.
Inlining values
We hope to get a jump in performance by avoiding dereferencing ValueRef
after
the flat state lookup. At least for small values we could store the value
itself also in flat state. (to be defined what small means)
This seems very promising because the value dereferencing still happens in the
much slower DBCol::State
. If we assume small values are the common case, we
would thus expect huge performance improvements for the average case.
It is not clear yet, if we can also optimize large lookups somehow. If not, we could at least charge them at a higher rate than we do today, to reflect the real DB cost better.
Code guide
Here we describe structures used for flat storage implementation.
FlatStorage
This is the main structure which owns information about ValueRefs for all keys from some fixed shard for some set of blocks. It is shared by multiple threads, so it is guarded by RwLock:
- Chain thread, because it sends queries like:
- "new block B was processed by chain" - supported by add_block
- "flat storage head can be moved forward to block B" - supported by update_flat_head
- Thread that applies a chunk, because it sends read queries "what is the ValueRef for key for block B"
- View client (not fully decided)
Requires ChainAccessForFlatStorage on creation because it needs to know the tree of blocks after the flat storage head, to support getting queries correctly.
FlatStorageManager
It holds all FlatStorages which NightshadeRuntime knows about and:
- provides views for flat storage for some fixed block - supported by new_flat_state_for_shard
- sets initial flat storage state for genesis block - set_flat_storage_for_genesis
- adds/removes/gets flat storage if we started/stopped tracking a shard or need to create a view - create_flat_storage_for_shard, etc.
FlatStorageChunkView
Interface for getting ValueRefs from flat storage for some shard for some fixed block, supported by get_ref method.
Other notes
Chain dependency
If storage is fully empty, then we need to create flat storage from scratch. FlatStorage is stored inside NightshadeRuntime, and it itself is stored inside Chain, so we need to create them in the same order and dependency hierarchy should be the same. But at the same time, we parse genesis file only during Chain creation. Thatâs why FlatStorageManager has set_flat_storage_for_genesis method which is called during Chain creation.
Regular block processing vs. catchups
For these two usecases we have two different flows: first one is handled in Chain.postprocess_block, the second one in Chain.block_catch_up_postprocess. Both, when results of applying chunk are ready, should call Chain.process_apply_chunk_result â RuntimeAdapter.get_flat_storage_for_shard â FlatStorage.add_block, and when results of applying ALL processed/postprocessed chunks are ready, should call RuntimeAdapter.get_flat_storage_for_shard â FlatStorage.update_flat_head.
(because applying some chunk may result in error and we may need to exit there without updating flat head - ?)
This document describes how our network works. At this moment, it is known to be somewhat outdated, as we are in the process of refactoring the network protocol somewhat significantly.
1. Overview
Near Protocol uses its own implementation of a custom peer-to-peer network. Peers who join the network are represented by nodes and connections between them by edges.
The purpose of this document is to describe the inner workings of the near-network
package; and to be used as reference by future engineers to understand the network
code without any prior knowledge.
2. Code structure
near-network
runs on top of the actor
framework called
Actix
. Code structure is split between 4 actors
PeerManagerActor
, PeerActor
, RoutingTableActor
, EdgeValidatorActor
2.1 EdgeValidatorActor
(currently called EdgeVerifierActor
in the code)
EdgeValidatorActor
runs on separate thread. The purpose of this actor
is to
validate edges
, where each edge
represents a connection between two peers,
and it's signed with a cryptographic signature of both parties. The process of
edge validation involves verifying cryptographic signatures, which can be quite
expensive, and therefore was moved to another thread.
Responsibilities:
- Validating edges by checking whenever cryptographic signatures match.
2.2 RoutingTableActor
RoutingTableActor
maintains a view of the P2P network
represented by a set of
nodes and edges.
In case a message needs to be sent between two nodes, that can be done directly
through a TCP connection
. Otherwise, RoutingTableActor
is responsible for pinging
the best path between them.
Responsibilities:
- Keep set of all edges of
P2P network
called routing table. - Connects to
EdgeValidatorActor
, and asks for edges to be validated, when needed. - Has logic related to exchanging edges between peers.
2.3 PeerActor
Whenever a new connection gets accepted, an instance of PeerActor
gets
created. Each PeerActor
keeps a physical TCP connection
to exactly one
peer.
Responsibilities:
- Maintaining physical connection.
- Reading messages from peers, decoding them, and then forwarding them to the right place.
- Encoding messages, sending them to peers on physical layer.
- Routing messages between
PeerManagerActor
and other peers.
2.4 PeerManagerActor
PeerManagerActor
is the main actor of near-network
crate. It acts as a
bridge connecting to the world outside, the other peers, and ClientActor
and
ClientViewActor
, which handle processing any operations on the chain.
PeerManagerActor
maintains information about p2p network via RoutingTableActor
,
and indirectly, through PeerActor
, connections to all nodes on the network.
All messages going to other nodes, or coming from other nodes will be routed
through this Actor
. PeerManagerActor
is responsible for accepting incoming
connections from the outside world and creating PeerActors
to manage them.
Responsibilities:
- Accepting new connections.
- Maintaining the list of
PeerActors
, creating, deleting them. - Routing information about new edges between
PeerActors
andRoutingTableManager
. - Routing messages between
ViewClient
,ViewClientActor
andPeerActors
, and consequently other peers. - Maintains
RouteBack
structure, which has information on how to send replies to messages.
3. Code flow - initialization
First, the PeerManagerActor
actor gets started. PeerManagerActor
opens the
TCP server, which listens to incoming connections. It starts the
RoutingTableActor
, which then starts the EdgeValidatorActor
. When
an incoming connection gets accepted, it starts a new PeerActor
on its own thread.
4. NetworkConfig
near-network
reads configuration from NetworkConfig
, which is a part of client config
.
Here is a list of features read from config:
boot_nodes
- list of nodes to connect to on start.addr
- listening address.max_num_peers
- by default we connect up to 40 peers, current implementation supports up to 128.
5. Connecting to other peers.
Each peer maintains a list of known peers. They are stored in the database. If
the database is empty, the list of peers, called boot nodes, will be read from
the boot_nodes
option in the config. The peer to connect to is chosen at
random from a list of known nodes by the PeerManagerActor::sample_random_peer
method.
6. Edges & network - in code representation
P2P network
is represented by a list of peers
, where each peer
is
represented by a structure PeerId
, which is defined by the peer
's public key
PublicKey
, and a list of edges, where each edge is represented by the
structure Edge
.
Both are defined below.
6.1 PublicKey
We use two types of public keys:
- a 256 bit
ED25519
public key. - a 512 bit
Secp256K1
public key.
Public keys are defined in the PublicKey
enum, which consists of those two
variants.
#![allow(unused)] fn main() { pub struct ED25519PublicKey(pub [u8; 32]); pub struct Secp256K1PublicKey([u8; 64]); pub enum PublicKey { ED25519(ED25519PublicKey), SECP256K1(Secp256K1PublicKey), } }
6.2 PeerId
Each peer
is uniquely defined by its PublicKey
, and represented by PeerId
struct.
#![allow(unused)] fn main() { pub struct PeerId(PublicKey); }
6.3 Edge
Each edge
is represented by the Edge
structure. It contains the following:
- pair of nodes represented by their public keys.
nonce
- a unique number representing the state of an edge. Starting with1
. Odd numbers represent an active edge. Even numbers represent an edge in which one of the nodes, confirmed that the edge is removed.- Signatures from both peers for active edges.
- Signature from one peer in case an edge got removed.
6.4 Graph representation
RoutingTableActor
is responsible for storing and maintaining the set of all edges.
They are kept in the edge_info
data structure of the type HashSet<Edge>
.
#![allow(unused)] fn main() { pub struct RoutingTableActor { /// Collection of edges representing P2P network. /// It's indexed by `Edge::key()` key and can be search through by calling `get()` function /// with `(PeerId, PeerId)` as argument. pub edges_info: HashSet<Edge>, /// ... } }
7. Code flow - connecting to a peer - handshake
When PeerManagerActor
starts, it starts to listen to a specific port.
7.1 - Step 1 - monitor_peers_trigger
runs
PeerManager
checks if we need to connect to another peer by running the
PeerManager::is_outbound_bootstrap_needed
method. If true
we will try to
connect to a new node. Let's call the current node, node A
.
7.2 - Step 2 - choosing the node to connect to
Method PeerManager::sample_random_peer
will be called, and it returns node B
that we will try to connect to.
7.3 - Step 3 - OutboundTcpConnect
message
PeerManagerActor
will send itself a message OutboundTcpConnect
in order
to connect to node B
.
#![allow(unused)] fn main() { pub struct OutboundTcpConnect { /// Peer information of the outbound connection pub target_peer_info: PeerInfo, } }
7.4 - Step 4 - OutboundTcpConnect
message
On receiving the message the handle_msg_outbound_tcp_connect
method will be
called, which calls TcpStream::connect
to create a new connection.
7.5 - Step 5 - Connection gets established
Once connection with the outgoing peer gets established. The try_connect_peer
method will be called. And then a new PeerActor
will be created and started. Once
the PeerActor
starts it will send a Handshake
message to the outgoing node B
over a tcp connection.
This message contains protocol_version
, node A
's metadata, as well as all
information necessary to create an Edge
.
#![allow(unused)] fn main() { pub struct Handshake { /// Current protocol version. pub(crate) protocol_version: u32, /// Oldest supported protocol version. pub(crate) oldest_supported_version: u32, /// Sender's peer id. pub(crate) sender_peer_id: PeerId, /// Receiver's peer id. pub(crate) target_peer_id: PeerId, /// Sender's listening addr. pub(crate) sender_listen_port: Option<u16>, /// Peer's chain information. pub(crate) sender_chain_info: PeerChainInfoV2, /// Represents new `edge`. Contains only `none` and `Signature` from the sender. pub(crate) partial_edge_info: PartialEdgeInfo, } }
7.6 - Step 6 - Handshake
arrives at node B
Node B
receives a Handshake
message. Then it performs various validation
checks. That includes:
- Check signature of edge from the other peer.
- Whenever
nonce
is the edge, send matches. - Check whether the protocol is above the minimum
OLDEST_BACKWARD_COMPATIBLE_PROTOCOL_VERSION
. - Other node
view of chain
state.
If everything is successful, PeerActor
will send a RegisterPeer
message to
PeerManagerActor
. This message contains everything needed to add PeerActor
to the list of active connections in PeerManagerActor
.
Otherwise, PeerActor
will be stopped immediately or after some timeout.
#![allow(unused)] fn main() { pub struct RegisterPeer { pub(crate) actor: Addr<PeerActor>, pub(crate) peer_info: PeerInfo, pub(crate) peer_type: PeerType, pub(crate) chain_info: PeerChainInfoV2, // Edge information from this node. // If this is None it implies we are outbound connection, so we need to create our // EdgeInfo part and send it to the other peer. pub(crate) this_edge_info: Option<EdgeInfo>, // Edge information from other node. pub(crate) other_edge_info: EdgeInfo, // Protocol version of new peer. May be higher than ours. pub(crate) peer_protocol_version: ProtocolVersion, } }
7.7 - Step 7 - PeerManagerActor
receives RegisterPeer
message - node B
In the handle_msg_consolidate
method, the RegisterPeer
message will be validated.
If successful, the register_peer
method will be called, which adds the PeerActor
to the list of connected peers.
Each connected peer is represented in PeerActorManager
in ActivePeer
the data
structure.
#![allow(unused)] fn main() { /// Contains information relevant to an active peer. struct ActivePeer { // will be renamed to `ConnectedPeer` see #5428 addr: Addr<PeerActor>, full_peer_info: FullPeerInfo, /// Number of bytes we've received from the peer. received_bytes_per_sec: u64, /// Number of bytes we've sent to the peer. sent_bytes_per_sec: u64, /// Last time requested peers. last_time_peer_requested: Instant, /// Last time we received a message from this peer. last_time_received_message: Instant, /// Time where the connection was established. connection_established_time: Instant, /// Who started connection. Inbound (other) or Outbound (us). peer_type: PeerType, } }
7.8 - Step 8 - Exchange routing table part 1 - node B
At the end of the register_peer
method node B
will perform a
RoutingTableSync
sync. Sending the list of known edges
representing a
full graph, and a list of known AnnounceAccount
. Those will be
covered later, in their dedicated sections see sections (to be added).
message: PeerMessage::RoutingTableSync(SyncData::edge(new_edge)),
#![allow(unused)] fn main() { /// Contains metadata used for routing messages to particular `PeerId` or `AccountId`. pub struct RoutingTableSync { // also known as `SyncData` (#5489) /// List of known edges from `RoutingTableActor::edges_info`. pub(crate) edges: Vec<Edge>, /// List of known `account_id` to `PeerId` mappings. /// Useful for `send_message_to_account` method, to route message to particular account. pub(crate) accounts: Vec<AnnounceAccount>, } }
7.9 - Step 9 - Exchange routing table part 2 - node A
Upon receiving a RoutingTableSync
message. Node A
will reply with its own
RoutingTableSync
message.
7.10 - Step 10 - Exchange routing table part 2 - node B
Node B
will get the message from A
and update its routing table.
8. Adding new edges to routing tables
This section covers the process of adding new edges, received from another node, to the routing table. It consists of several steps covered below.
8.1 Step 1
PeerManagerActor
receives RoutingTableSync
message containing list of new
edges
to add. RoutingTableSync
contains list of edges of the P2P network.
This message is then forwarded to RoutingTableActor
.
8.2 Step 2
PeerManagerActor
forwards those edges to RoutingTableActor
inside of
the ValidateEdgeList
struct.
ValidateEdgeList
contains:
- list of edges to verify.
- peer who sent us the edges.
8.3 Step 3
RoutingTableActor
gets the ValidateEdgeList
message. Filters out edges
that have already been verified, those that are already in
RoutingTableActor::edges_info
.
Then, it updates edge_verifier_requests_in_progress
to mark that edge
verifications are in progress, and edges shouldn't be pruned from Routing Table
(see section (to be added)).
Then, after removing already validated edges, the modified message is forwarded
to EdgeValidatorActor
.
8.4 Step 4
EdgeValidatorActor
goes through the list of all edges. It checks whether all edges
are valid (their cryptographic signatures match, etc.).
If any edge is not valid, the peer will be banned.
Edges that are validated are written to a concurrent queue
ValidateEdgeList::sender
. This queue is used to transfer edges from
EdgeValidatorActor
back to PeerManagerActor
.
8.5 Step 5
broadcast_validated_edges_trigger
runs, and gets validated edges from
EdgeVerifierActor
.
Every new edge will be broadcast to all connected peers.
And then, all validated edges received from EdgeVerifierActor
will be sent
again to RoutingTableActor
inside AddVerifiedEdges
.
8.5 Step 6
When RoutingTableActor
receives RoutingTableMessages::AddVerifiedEdges
, the
method add_verified_edges_to_routing_table
will be called. It will add edges to
RoutingTableActor::edges_info
struct, and mark routing table, that it needs
a recalculation (see RoutingTableActor::needs_routing_table_recalculation
).
9 Routing table computation
Routing table computation does a few things:
- For each peer
B
, calculates set of peers|C_b|
, such that each peer is on the shortest path toB
. - Removes unreachable edges from memory and stores them to disk.
- The distance is calculated as the minimum number of nodes on the path from
given node
A
, to each other node on the network. That is,A
has a distance of0
to itself. Its neighbors will have a distance of1
. The neighbors of their neighbors will have a distance of2
, etc.
9.1 Step 1
PeerManagerActor
runs a update_routing_table_trigger
every
UPDATE_ROUTING_TABLE_INTERVAL
seconds.
RoutingTableMessages::RoutingTableUpdate
message is sent to
RoutingTableActor
to request routing table re-computation.
9.2 Step 2
RoutingTableActor
receives the message, and then:
- calls
recalculate_routing_table
method, which computesRoutingTableActor::peer_forwarding: HashMap<PeerId, Vec<PeerId>>
. For eachPeerId
on the network, gives a list of connected peers, which are on the shortest path to the destination. It marks reachable peers in thepeer_last_time_reachable
struct. - calls
prune_edges
which removes from memory all the edges that were not reachable for at least 1 hour, based on thepeer_last_time_reachable
data structure. Those edges are then stored to disk.
9.3 Step 3
RoutingTableActor
sends a RoutingTableUpdateResponse
message back to
PeerManagerActor
.
PeerManagerActor
keeps a local copy of edges_info
, called local_edges_info
containing only edges adjacent to current node.
RoutingTableUpdateResponse
contains a list of local edges, whichPeerManagerActor
should remove.peer_forwarding
which represents how to route messages in the P2P networkpeers_to_ban
represents a list of peers to ban for sending us edges, which failed validation inEdgeVerifierActor
.
9.4 Step 4
PeerManagerActor
receives RoutingTableUpdateResponse
and then:
- updates local copy of
peer_forwarding
, used for routing messages. - removes
local_edges_to_remove
fromlocal_edges_info
. - bans peers, who sent us invalid edges.
10. Message transportation layers.
This section describes different protocols of sending messages currently used in
Near
.
10.1 Messages between Actors.
Near
is built on Actix
's actor
framework. Usually each actor
runs on its own dedicated thread. Some, like PeerActor
have one thread per
each instance. Only messages implementing actix::Message
, can be sent
using between threads. Each actor has its own queue; Processing of messages
happens asynchronously.
We should not leak implementation details into the spec.
Actix messages can be found by looking for impl actix::Message
.
10.2 Messages sent through TCP
Near is using borsh
serialization to exchange messages between nodes (See
borsh.io). We should be careful when making changes to
them. We have to maintain backward compatibility. Only messages implementing
BorshSerialize
, BorshDeserialize
can be sent. We also use borsh
for
database storage.
10.3 Messages sent/received through chain/jsonrpc
Near runs a json REST server
. (See actix_web::HttpServer
). All messages sent
and received must implement serde::Serialize
and serde::Deserialize
.
11. Code flow - routing a message
This is the example of the message that is being sent between nodes
RawRoutedMessage
.
Each of these methods have a target
- that is either the account_id
or peer_id
or hash (which seems to be used only for route back...). If target is the
account - it will be converted using routing_table.account_owner
to the peer.
Upon receiving the message, the PeerManagerActor
will sign it
and convert into RoutedMessage (which also have things like TTL etc.).
Then it will use the routing_table
, to find the route to the target peer (add
route_back
if needed) and then send the message over the network as
PeerMessage::Routed
. Details about routing table computations are covered in
section 8.
When Peer receives this message (as PeerMessage::Routed
), it will pass it to
PeerManager (as RoutedMessageFrom
), which would then check if the message is
for the current PeerActor
. (if yes, it would pass it to the client) and if
not - it would pass it along the network.
All these messages are handled by receive_client_message
in Peer.
(NetworkClientMessages
) - and transferred to ClientActor
in
(chain/client/src/client_actor.rs
)
NetworkRequests
to PeerManager
actor trigger the RawRoutedMessage
for
messages that are meant to be sent to another peer
.
lib.rs
(ShardsManager
) has a network_adapter
- coming from the clientâs
network_adapter
that comes from ClientActor
that comes from the start_client
call
that comes from start_with_config
(that creates PeerManagerActor
- that is
passed as target to network_recipent
).
12. Database
12.1 Storage of deleted edges
Every time a group of peers becomes unreachable at the same time; We store edges belonging to them in components. We remove all of those edges from memory, and save them to the database. If any of them were to be reachable again, we would re-add them. This is useful in case there is a network split, to recover edges if needed.
Each component is assigned a unique nonce
, where first one is assigned nonce
0. Each new component gets assigned a consecutive integer.
To store components, we have the following columns in the DB.
DBCol::LastComponentNonce
Storescomponent_nonce: u64
, which is the last used nonce.DBCol::ComponentEdges
Mapping fromcomponent_nonce
to a list of edges.DBCol::PeerComponent
Mapping frompeer_id
to the last componentnonce
it belongs to.
12.2 Storage of account_id
to peer_id
mapping
ColAccountAnouncements
-> Stores a mapping from account_id
to a tuple
(account_id
, peer_id
, epoch_id
, signature
).
Gas Cost Parameters
Gas in NEAR Protocol solves two problems.
- To avoid spam, validator nodes only perform work if a user's tokens are burned. Tokens are automatically converted to gas using the current gas price.
- To synchronize shards, they must all produce chunks following a strict schedule of 1 second execution time. Gas is used to measure how heavy the workload of a transaction is, so that the number of transactions that fit in a block can be deterministically computed by all nodes.
In other words, each transaction costs a fixed amount of gas. This gas cost determines how much a user has to pay and how much time nearcore has to execute the transaction.
What happens if nearcore executes a transaction too slowly? Chunk production for the shard gets delayed, which delays block production for the entire blockchain, increasing latency and reducing throughput for everybody. If the chunk is really late, the block producer will decide to not include the chunk at all and inserts an empty chunk. The chunk may be included in the next block.
By now, you probably wonder how we can know the time it takes to execute a transaction, given that validators use hardware of their choice. Getting these timings right is indeed a difficult problem. Or flipping the problem, assuming the timings are already known, then we must implement nearcore such that it guarantees to operate within the given time constraints. How we tackle this is the topic of this chapter.
If you want to learn more about Gas from a user perspective, Gas basic concepts, Gas advanced concepts, and the runtime fee specification are good places to dig deeper.
Hardware and Timing Assumptions
For timing to make sense at all, we must first define hardware constraints. The official hardware requirements for a validator are published on near-nodes.io/validator/hardware-validator. They may change over time but the main principle is that a moderately configured, cloud-hosted virtual machine suffices.
For our gas computation, we assume the minimum required hardware. Then we define 1015 gas to be executed in at most 1s. We commonly use 1 Tgas (= 1012 gas) in conversation, which corresponds to 1ms execution time.
Obviously, this definition means that a validator running more powerful hardware will execute the transactions faster. That is perfectly okay, as far as the protocol is concerned we just need to make sure the chunk is available in time. If it is ready in even less time, no problem.
Less obviously, this means that even a minimally configured validator is often idle. Why is that? Well, the hardware must be prepared to execute chunks that are always full. But that is rarely the case, as the gas price increases exponentially when chunks are full, which would cause traffic to go back eventually.
Furthermore, the hardware has to be ready for transactions of all types, including transactions chosen by a malicious actor selecting only the most complex transactions. Those transactions can also be unbalanced in what bottlenecks they hit. For example, a chunk can be filled with transactions that fully utilize the CPU's floating point units. Or they could be using all the available disk IO bandwidth.
Because the minimum required hardware needs to meet the timing requirements for any of those scenarios, the typical, more balanced case is usually computed faster than the gas rule states.
Transaction Gas Cost Model
A transaction is essentially just a list of actions to be executed on the same
account. For example it could be CreateAccount
combined with
FunctionCall("hello_world")
.
The reference for available actions shows the conclusive list of possible actions. The protocol defines fixed fees for each of them. More details on actions fees follow below.
Fixed fees are an important design decision. It means that a given action will
always cost the exact same amount of gas, no matter on what hardware it
executes. But the content of the action can impact the cost, for example a
DeployContract
action's cost scales with the size of the contract code.
So, to be more precise, the protocol defines fixed gas cost parameters for
each action, together with a formula to compute the gas cost for the action. All
actions today either use a single fixed gas cost or they use a base cost and a
linear scaling parameter. With one important exception, FunctionCall
, which
shall be discussed further below.
There is an entire section on Parameter Definitions that explains how to find the source of truth for parameter values in the nearcore repository, how they can be referenced in code, and what steps are necessary to add a new parameter.
Let us dwell a bit more on the linear scaling factors. The fact that contract deployment cost, which includes code compilation, scales linearly limits the compiler to use only algorithms of linear complexity. Either that, or the parameters must be set to match the 1ms = 1Tgas rule at the largest possible contract size. Today, we limit ourselves to linear-time algorithms in the compiler.
Likewise, an action that has no scaling parameters must only use constant time
to execute. Taking the CreateAccount
action as an example, with a cost of 0.1
Tgas, it has to execute within 0.1ms. Technically, the execution time depends
ever so slightly on the account name length. But there is a fairly low upper
limit on that length and it makes sense to absorb all the cost in the constant
base cost.
This concept of picking parameters according to algorithmic complexity is key. If you understand this, you know how to think about gas as a nearcore developer. This should be enough background to understand what the estimator does.
The runtime parameter estimator is a separate binary within the nearcore repository. It contains benchmarking-like code used to validate existing parameter values against the 1ms = 1 Tgas rule. When implementing new features, code should be added there to estimate the safe values of the new parameters. This section is for you if you are adding new features such as a new pre-compiled method or other host functions.
Next up are more details on the specific costs that occur when executing NEAR transactions, which help to understand existing parameters and how they are organized.
Action Costs
Actions are executed in two steps. First, an action is verified and inserted to
an action receipt, which is sent to the receiver of the action. The send
fee
is paid for this. It is charged either in fn process_transaction(..)
if the
action is part of a fresh transaction, or inside
logic.rs
through fn pay_action_base(..)
if the action is generated by a function call.
The send fee is meant to cover the cost to validate an action and transmit it
over the network.
The second step is action execution. It is charged in fn apply_action(..)
.
The execution cost has to cover everything required to apply the action to the
blockchain's state.
These two steps are done on the same shard for local receipts. Local receipts
are defined as those where the sender account is also the receiver, abbreviated
as sir
which stands for "sender is receiver".
For remote receipts, which is any receipt where the sender and receiver accounts are different, we charge a different fee since sending between shards is extra work. Notably, we charge that extra work even if the accounts are on the same shard. In terms of gas costs, each account is conceptually its own shard. This makes dynamic resharding possible without user-observable impact.
When the send step is performed, the minimum required gas to start execution of that action is known. Thus, if the receipt does not have enough gas, it can be aborted instead of forwarding it. Here we have to introduce the concept of used gas.
gas_used
is different from gas_burnt
. The former includes the gas that needs
to be reserved for the execution step whereas the latter only includes the gas
that has been burnt in the current chunk. The difference between the two is
sometimes also called prepaid gas, as this amount of gas is paid for during the
send step and it is available in the execution step for free.
If execution fails, the prepaid cost that has not been burned will be refunded. But this is not the reason why it must burn on the receiver shard instead of the sender shard. The reason is that we want to properly compute the gas limits on the chunk that does the execution work.
In conclusion, each action parameter is split into three costs, send_sir
,
send_not_sir
, and execution
. Local receipts charge the first and last
parameters, remote receipts charge the second and third. They should be
estimated, defined, and charged separately. But the reality is that today almost
all actions are estimated as a whole and the parameters are split 50/50 between
send and execution cost, without discrimination on local vs remote receipts
i.e. send_sir
cost is the same as send_not_sir
.
The Gas Profile section goes into more details on how gas costs of a transaction are tracked in nearcore.
Dynamic Function Call Costs
Costs that occur while executing a function call on a deployed WASM app (a.k.a. smart contract) are charged only at the receiver. Thus, they have only one value to define them, in contrast to action costs.
The most fundamental dynamic gas cost is wasm_regular_op_cost
. It is
multiplied by the exact number of WASM operations executed. You can read about
Gas Instrumentation
if you are curious how we count WASM ops.
Currently, all operations are charged the same, although it could be more
efficient to charge less for opcodes like i32.add
compared to f64.sqrt
.
The remaining dynamic costs are for work done during host function calls. Each
host function charges a base cost. Either the general wasm_base
cost, or a
specific cost such as wasm_utf8_decoding_base
, or sometimes both. New host
function calls should define a separate base cost and not charge wasm_base
.
Additional host-side costs can be scaled per input byte, such as
wasm_sha256_byte
, or costs related to moving data between host and guest, or
any other cost that is specific to the host function. Each host function must
clearly define what its costs are and how they depend on the input.
Non-gas parameters
Not all runtime parameters are directly related to gas costs. Here is a brief overview.
- Gas economics config: Defines the conversion rate when purchasing gas with NEAR tokens and how gas rewards are split.
- Storage usage config: Costs in tokens, not gas, for storing data on chain.
- Account creation config: Rules for account creation.
- Smart contract limits: Rules for WASM execution.
None of the above define any gas costs directly. But there can be interplay between those parameters and gas costs. For example, the limits on smart contracts changes the assumptions for how slow a contract compilation could be, hence it affects the deploy action costs.
Parameter Definitions
Gas parameters are a subset of runtime parameters that are defined in runtime_configs/parameters.yaml. IMPORTANT: This is not the final list of parameters, it contains the base values which can be overwritten per protocol version. For example, 53.yaml changes several parameters starting from version 53. You can see the final list of parameters in runtime_configs/parameters.snap. This file is automatically updated whenever any of the parameters changes. To see all parameter values for a specific version, check out the list of JSON snapshots generated in this directory: parameters/src/snapshots.
Using Parameters in Code
As the introduction on this page already hints at it, parameter values are versioned. In other words, they can change if the protocol version changes. A nearcore binary has to support multiple versions and choose the correct parameter value at runtime.
To make this easy, there is
RuntimeConfigStore
.
It contains a sparse map from protocol versions to complete runtime
configurations (BTreeMap<ProtocolVersion, Arc<RuntimeConfig>>
).
The runtime then uses store.get_config(protocol_version)
to access a runtime
configuration for a specific version.
It is crucial to always use this runtime config store. Never hard-code parameter values. Never look them up in a different way.
In practice, this usually translates to a &RuntimeConfig
argument for any
function that depends on parameter values. This config object implicitly defines
the protocol version. It should therefore not be cached. It should be read from
the store once per chunk and then passed down to all functions that need it.
How to Add a New Parameter
First and foremost, if you are feeling lost, open a topic in our Zulip chat (pagoda/contract-runtime). We are here to help.
Principles
Before adding anything, please review the basic principles for gas parameters.
- A parameter must correspond to a clearly defined workload.
- When the workload is scalable by a factor
N
that depends on user input, it will likely require a base parameter and a second parameter that is multiplied byN
. (Example:N
= number of bytes when reading a value from storage.) - Charge gas before executing the workload.
- Parameters should be independent of specific implementation choices in nearcore.
- Ideally, contract developers can easily understand what the cost is simply by reading the name in a gas profile.
The section on Gas Profiles explains how to charge gas, please also consider that when defining a new parameter.
Necessary Code Changes
Adding the parameter in code involves several steps.
- Define the parameter by adding it to the list in
core/primitives/res/runtime_configs/parameters.yaml.
- Update the Rust view of parameters by adding a variant to
enum Parameter
incore/primitives-core/src/parameter.rs
. In the same file, updateenum FeeParameter
if you add an action cost or updateext_costs()
if you add a cost inside function calls. - Update
RuntimeConfig
, the configuration used to reference parameters in code. Depending on the type of parameter, you will need to updateRuntimeFeesConfig
(for action costs) orExtCostsConfig
(for gas costs). - Update the list used for gas profiles. This is defined by
enum Cost
incore/primitives-core/src/profile.rs
. You need to add a variant to eitherenum ActionCosts
orenum ExtCost
. Please also updatefn index()
that maps each profile entry to a unique position in serialized gas profiles. - The parameter should be available to use in the code section you need it in.
Now is a good time to ensure
cargo check
andcargo test --no-run
pass. Most likely you have to update some testing code, such asExtCostsConfig::test()
. - To merge your changes into nearcore, you will have to hide your parameter
behind a feature flag. Add the feature to the
Cargo.toml
of each crate touched in steps 3 and 4 and hide the code behind#[cfg(feature = "protocol_feature_MY_NEW_FEATURE")]
. Do not hide code in step 2 so that non-nightly builds can still readparameters.yaml
. Also, add your feature as a dependency onnightly
incore/primitives/Cargo.toml
to make sure it gets included when compiling for nightly. After that, checkcargo check
andcargo test --no-run
with and withoutfeatures=nightly
.
What Gas Value Should the Parameter Have?
For a first draft, the exact gas value used in the parameter is not crucial. Make sure the right set of parameters exists and try to set a number that roughly makes sense. This should be enough to enable discussions on the NEP around the feasibility and usefulness of the proposed feature. If you are not sure, a good rule of thumb is 0.1 Tgas for each disk operation and at least 1 Tgas for each ms of CPU time. Then round it up generously.
The value will have to be refined later. This is usually the last step after the implementation is complete and reviewed. Have a look at the section on estimating gas parameters in the book.
Gas Profile
What if you want to understand the exact gas spending of a smart contract call? It would be very complicated to predict exactly how much gas executing a piece of WASM code will require, including all host function calls and actions. An easier approach is to just run the code on testnet and see how much gas it burns. Gas profiles allow one to dig deeper and understand the breakdown of the gas costs per parameter.
Gas profiles are not very reliable, in that they are often incomplete and the details of how they are computed can change without a protocol version bump.
Example Transaction Gas Profile
You can query the gas profile of a transaction with NEAR CLI.
NEAR_ENV=mainnet near tx-status 8vYxsqYp5Kkfe8j9LsTqZRsEupNkAs1WvgcGcUE4MUUw \
--accountId app.nearcrowd.near \
--nodeUrl https://archival-rpc.mainnet.near.org # Allows to retrieve older transactions.
Transaction app.nearcrowd.near:8vYxsqYp5Kkfe8j9LsTqZRsEupNkAs1WvgcGcUE4MUUw
{
receipts_outcome: [
{
block_hash: '2UVQKpxH6PhEqiKr6zMggqux4hwMrqqjpsbKrJG3vFXW',
id: '14bwmJF21PXY9YWGYN1jpjF3BRuyCKzgVWfhXhZBKH4u',
outcome: {
executor_id: 'app.nearcrowd.near',
gas_burnt: 5302170867180,
logs: [],
metadata: {
gas_profile: [
{
cost: 'BASE',
cost_category: 'WASM_HOST_COST',
gas_used: '15091782327'
},
{
cost: 'CONTRACT_LOADING_BASE',
cost_category: 'WASM_HOST_COST',
gas_used: '35445963'
},
{
cost: 'CONTRACT_LOADING_BYTES',
cost_category: 'WASM_HOST_COST',
gas_used: '117474381750'
},
{
cost: 'READ_CACHED_TRIE_NODE',
cost_category: 'WASM_HOST_COST',
gas_used: '615600000000'
},
# ...
# skipping entries for presentation brevity
# ...
{
cost: 'WRITE_REGISTER_BASE',
cost_category: 'WASM_HOST_COST',
gas_used: '48713882262'
},
{
cost: 'WRITE_REGISTER_BYTE',
cost_category: 'WASM_HOST_COST',
gas_used: '4797573768'
}
],
version: 2
},
receipt_ids: [ '46Qsorkr6hy36ZzWmjPkjbgG28ko1iwz1NT25gvia51G' ],
status: { SuccessValue: 'ZmFsc2U=' },
tokens_burnt: '530217086718000000000'
},
proof: [ ... ]
},
{ ... }
],
status: { SuccessValue: 'ZmFsc2U=' },
transaction: { ... },
transaction_outcome: {
block_hash: '7MgTTVi3aMG9LiGV8ezrNvoorUwQ7TwkJ4Wkbk3Fq5Uq',
id: '8vYxsqYp5Kkfe8j9LsTqZRsEupNkAs1WvgcGcUE4MUUw',
outcome: {
executor_id: 'evgeniya.near',
gas_burnt: 2428068571644,
...
tokens_burnt: '242806857164400000000'
},
}
}
The gas profile is in receipts_outcome.outcome.metadata.gas_profile
. It shows
gas costs per parameter and with associated categories such as WASM_HOST_COST
or ACTION_COST
. In the example, all costs are of the former category, which is
gas expended on smart contract execution. The latter is for gas spent on
actions.
To be complete, the output above should also have a gas profile entry for the function call action. But currently this is not included since gas profiles only work properly on function call receipts. Improving this is planned, see nearcore#8261.
The tx-status
query returns one gas profile for each receipt. The output above
contains a single gas profile because the transaction only spawned one receipt.
If there was a chain of cross contract calls, there would be multiple profiles.
Besides receipts, also note the transaction_outcome
in the output. It contains
the gas cost for converting the transaction into a receipt. To calculate the
full gas cost, add up the transaction cost with all receipt costs.
The transaction outcome currently does not have a gas profile, it only shows the total gas spent converting the transaction. Arguably, it is not necessary to provide the gas profile since the costs only depend on the list of actions. With sufficient understanding of the protocol, one could reverse-engineer the exact breakdown simply by looking at the action list. But adding the profile would still make sense to make it easier to understand.
Gas Profile Versions
Depending on the version in receipts_outcome.outcome.metadata.version
, you
should expect a different format of the gas profile. Version 1 has no profile
data at all. Version 2 has a detailed profile but some parameters are conflated,
so you cannot extract the exact gas spending in some cases. Version 3 will have
the cost exactly per parameter.
Which version of the profile an RPC node returns depends on the version it had when it first processed the transaction. The profiles are stored in the database with one version and never updated. Therefore, older transactions will usually only have old profiles. However, one could replay the chain from genesis with a new nearcore client and generate the newest profile for all transactions in this way.
Note: Due to bugs, some nodes will claim they send version 1 but actually send version 2. (Did I mention that profiles are unreliable?)
How Gas Profiles are Created
The transaction runtime charges gas in various places around the code.
ActionResult
keeps a summary of all costs for an action. The gas_burnt
and
gas_used
fields track the total gas burned and reserved for spawned receipts.
These two fields are crucial for the protocol to function correctly, as they are
used to determine when execution runs out of gas.
Additionally, ActionResult
also has a profile
field which keeps a detailed
breakdown of the gas spending per parameter. Profiles are not stored on chain
but RPC nodes and archival nodes keep them in their databases. This is mostly a
debug tool and has no direct impact on the correct functioning of the protocol.
Charging Gas
Generally speaking, gas is charged right before the computation that it pays for is executed. It has to be before to avoid cheap resource exhaustion attacks. Imagine the user has only 1 gas unit left, but if we start executing an expensive step, we would waste a significant duration of computation on all validators without anyone paying for it.
When charging gas for an action, the ActionResult
can be updated directly. But
when charging WASM costs, it would be too slow to do a context switch each time,
Therefore, a fast gas counter exists that can be updated from within the VM.
(See
gas_counter.rs)
At the end of a function call execution, the gas counter is read by the host and
merged into the ActionResult
.
Runtime Parameter Estimator
The runtime parameter estimator is a byzantine benchmarking suite. Byzantine benchmarking is not a commonly used term but I feel it describes it quite well. It measures the performance assuming that up to a third of validators and all users collude to make the system as slow as possible.
This benchmarking suite is used to check that the gas parameters defined in the protocol are correct. Correct in this context means, a chunk filled with 1 Pgas (Peta gas) will take at most 1 second to be applied. Or more generally, per 1 Tgas of execution, we spend no more than 1ms wall-clock time.
For now, nearcore timing is the only one that matters. Things will become more complicated once there are multiple client implementations. But knowing that nearcore can serve requests fast enough proves that it is possible to be at least as fast. However, we should be careful not to couple costs too tightly with the specific implementation of nearcore to allow for innovation in new clients.
The estimator code is part of the nearcore repository in the directory runtime/runtime-params-estimator.
For a practical guide on how to run the estimator, please take a look at Running the Estimator in the workflows chapter.
Code Structure
The estimator contains a binary and a library module. The main.rs contains the CLI arguments parsing code and logic to fill the test database.
The interesting code lives in
lib.rs
and its submodules. The comments at the top of that file provide a
high-level overview of how estimations work. More details on specific
estimations are available as comments on the enum variants of Cost
in
costs.rs.
If you roughly understand the three files above, you already have a great
overview of the estimator.
estimator_context.rs
is another central file. A full estimation run creates a single
EstimatorContext
. Each estimation will use it to spawn a new Testbed
with a fresh database that contains the same data as the setup in the
estimator context.
Most estimations fill blocks with transactions to be executed and hand them to
Testbed::measure_blocks
. To allow for easy repetitions, the block is usually
filled by an instance of the
TransactionBuilder
,
which can be retrieved from a testbed.
But even filling blocks with transactions becomes repetitive since many parameters are estimated similarly. utils.rs has a collection of helpful functions that let you write estimations very quickly.
Estimation Metrics
The estimation code is generally not concerned with the metric used to estimate
gas. We use let clock = GasCost::measure();
and clock.elapsed()
to measure
the cost in whatever metric has been specified in the CLI argument --metric
.
But when you run estimations and especially when you want to interpret the
results, you want to understand the metric used. Available metrics are time
and icount
.
Starting with time
, this is a simple wall-clock time measurement. At the end
of the day, this is what counts in a validator setup. But unfortunately, this
metric is very dependent on the specific hardware and what else is running on
that hardware right now. Dynamic voltage and frequency scaling (DVFS) also plays
a role here. To a certain degree, all these factors can be controlled. But it
requires full control over a system (often not the case when running on
cloud-hosted VMs) and manual labor to set it up.
The other supported metric icount
is much more stable. It uses
qemu to emulate an x86 CPU. We then insert a custom
TCG plugin
(counter.c)
that counts the number of executed x86 instructions. It also intercepts system
calls and counts the number of bytes seen in sys_read
, sys_write
and their
variations. This gives an approximation for IO bytes, as seen on the interface
between the operating system and nearcore. To convert to gas, we use three
constants to multiply with instruction count, read bytes, and write bytes.
We run qemu inside a Docker container using the Podman runtime, to make sure the qemu and qemu
plugin versions match with system libraries. Make sure to add --containerize
when running with
--metric icount
.
The great thing about icount
is that you can run it on different machines and
it will always return the same result. It is not 100% deterministic but very
close, so it can usually detect code changes that degrade performance in major
ways.
The problem with icount
is how unrepresentative it is for real-life
performance. First, x86
instructions are not all equally complex. Second, how
many of them are executed per cycle depends on instruction level pipelining,
branch prediction, memory prefetching, and more CPU features like that which are
just not captured by an emulator like qemu. Third, the time it takes to serve
bytes in system calls depends less on the sum of all bytes and more on data
locality and how it can be cached in the OS page cache. But regardless of all
these inaccuracies, it can still be useful to compare different implementations
both measured using icount
.
From Estimations to Parameter Values
To calculate the final gas parameter values, there is more to be done than just running a single command. After all, these parameters are part of the protocol specification. They cannot be changed easily. And setting them to a wrong value can cause severe system instability.
Our current strategy is to run estimations with two different metrics and do so on standardized cloud hardware. The output is then sanity checked manually by several people. Based on that, the final gas parameter value is determined. Usually, it will be the higher output of the two metrics rounded up.
The PR #8031 to set the ed25519 verification gas parameters is a good example of how such an analysis and report could look like.
More details on the process will be added to this document in due time.
Overview
This chapter describes various development processes and best practices employed at nearcore.
Rust đŚ
This short chapter collects various useful general resources about the Rust programming language. If you are already familiar with Rust, skip this chapter. Otherwise, this chapter is for you!
Getting Help
Rust community actively encourages beginners to ask questions, take advantage of that!
We have a dedicated stream for Rust questions on our Zulip: Rust đŚ.
There's a general Rust forum at https://users.rust-lang.org.
For a more interactive chat, take a look at Discord: https://discord.com/invite/rust-lang.
Reference Material
Rust is very well documented. It's possible to learn the whole language and most of the idioms by just reading the official docs. Starting points are
- The Rust Book (any resemblance to "Guide to Nearcore Development" is purely coincidental)
- Standard Library API
Alternatives are:
- Programming Rust is an alternative book that moves a bit faster.
- Rust By Example is a great resource for learning by doing.
Rust has some great tooling, which is also documented:
- Cargo, the build system. Worth at least skimming through!
- For IDE support, see IntelliJ Rust if you like JetBrains products or rust-analyzer if you use any other editor (fun fact: NEAR was one of the sponsors of rust-analyzer!).
- Rustup manages versions of Rust itself. It's unobtrusive, so feel free to skip this.
Cheat Sheet
This is a thing in its category, do check it out:
Language Mastery
- Rust for Rustaceans â the book to read after "The Book".
- Tokio docs explain asynchronous programming in Rust (async/await).
- Rust API Guidelines codify rules for idiomatic Rust APIs. Note that guidelines apply to semver surface of libraries, and most of the code in nearcore is not on the semver boundary. Still, a lot of insight there!
- Rustonomicon explains
unsafe
. (any resemblance to https://nomicon.io is purely coincidental)
Selected Blog Posts
A lot of finer knowledge is hidden away in various dusty corners of Web-2.0. Here are some favorites:
- https://docs.rs/dtolnay/latest/dtolnay/macro._02__reference_types.html
- https://limpet.net/mbrubeck/2019/02/07/rust-a-unique-perspective.html
- https://smallcultfollowing.com/babysteps/blog/2018/02/01/in-rust-ordinary-vectors-are-values/
- https://smallcultfollowing.com/babysteps/blog/2016/10/02/observational-equivalence-and-unsafe-code/
- https://matklad.github.io/2021/09/05/Rust100k.html
And on the easiest topic of error handling specifically:
- http://sled.rs/errors.html
- https://kazlauskas.me/entries/errors
- http://joeduffyblog.com/2016/02/07/the-error-model/
- https://blog.burntsushi.net/rust-error-handling/
Finally, as a dessert, the first rust slide deck: http://venge.net/graydon/talks/rust-2012.pdf.
Workflows
This chapter documents various ways you can run neard
during development:
running a local net, joining a test net, doing benchmarking and load testing.
Run a Node
This chapter focuses on the basics of running a node you've just built from source. It tries to explain how the thing works under the hood and pays relatively little attention to the various shortcuts we have.
Building the Node
Start with the following command:
$ cargo run --profile dev-release -p neard -- --help
This command builds neard
and asks it to show --help
. Building neard
takes
a while, take a look at Fast Builds chapter to learn how to
speed it up.
Let's dissect the command:
cargo run
asksCargo
, the package manager/build tool, to run our application. If you don't havecargo
, install it via https://rustup.rs--profile dev-release
is our custom profile to build a somewhat optimized version of the code. The default debug profile is faster to compile, but produces a node that is too slow to participate in a real network. The--release
profile produces a fully optimized node, but that's very slow to compile. So--dev-release
is a sweet spot for us! However, never use it for actual production nodes.-p neard
asks to build theneard
package. We use cargo workspaces to organize our code. Theneard
package in the top-level/neard
directory is the final binary that ties everything together.--
tells cargo to pass the rest of the arguments through toneard
.--help
instructsneard
to list available CLI arguments and subcommands.
Note: Building neard
might fail with an openssl or CC error. This means
that you lack some non-rust dependencies we use (openssl and rocksdb mainly). We
currently don't have docs on how to install those, but (basically) you want to
sudo apt install
(or whichever distro/package manager you use) missing bits.
Preparing Tiny Network
Typically, you want neard
to connect to some network, like mainnet
or
testnet
. We'll get there in time, but we'll start small. For the current
chapter, we will run a network consisting of just a single node -- our own.
The first step there is creating the required configuration. Run the init
command to create config files:
$ cargo run --profile dev-release -p neard -- init
INFO neard: version="trunk" build="1.1.0-3091-ga8964d200-modified" latest_protocol=57
INFO near: Using key ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ for test.near
INFO near: Using key ed25519:34d4aFJEmc2A96UXMa9kQCF8g2EfzZG9gCkBAPcsVZaz for node
INFO near: Generated node key, validator key, genesis file in ~/.near
As the log output says, we are just generating some things in ~/.near
.
Let's take a look:
$ ls ~/.near
config.json
genesis.json
node_key.json
validator_key.json
The most interesting file here is perhaps genesis.json
-- it specifies the
initial state of our blockchain. There are a bunch of hugely important fields
there, which we'll ignore here. The part we'll look at is the .records
, which
contains the actual initial data:
$ cat ~/.near/genesis.json | jq '.records'
[
{
"Account": {
"account_id": "test.near",
"account": {
"amount": "1000000000000000000000000000000000",
"locked": "50000000000000000000000000000000",
"code_hash": "11111111111111111111111111111111",
"storage_usage": 0,
"version": "V1"
}
}
},
{
"AccessKey": {
"account_id": "test.near",
"public_key": "ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ",
"access_key": {
"nonce": 0,
"permission": "FullAccess"
}
}
},
{
"Account": {
"account_id": "near",
"account": {
"amount": "1000000000000000000000000000000000",
"locked": "0",
"code_hash": "11111111111111111111111111111111",
"storage_usage": 0,
"version": "V1"
}
}
},
{
"AccessKey": {
"account_id": "near",
"public_key": "ed25519:546XB2oHhj7PzUKHiH9Xve3Ze5q1JiW2WTh6abXFED3c",
"access_key": {
"nonce": 0,
"permission": "FullAccess"
}
}
}
(I am using the jq utility here)
We see that we have two accounts here, and we also see their public keys (but not the private ones).
One of these accounts is a validator:
$ cat ~/.near/genesis.json | jq '.validators'
[
{
"account_id": "test.near",
"public_key": "ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ",
"amount": "50000000000000000000000000000000"
}
]
Now, if we
$ cat ~/.near/validator_key.json
we'll see
{
"account_id": "test.near",
"public_key": "ed25519:B41GMfqE2jWHVwrPLbD7YmjZxxeQE9WA9Ua2jffP5dVQ",
"secret_key": "ed25519:3x2dUQgBoEqNvKwPjfDE8zDVJgM8ysqb641PYHV28mGPu61WWv332p8keMDKHUEdf7GVBm4f6z4D1XRgBxnGPd7L"
}
That is, we have a secret key for the sole validator in our network, how convenient.
To recap, neard init
without arguments creates a config for a new network
that starts with a single validator, for which we have the keys.
You might be wondering what ~/.near/node_key.json
is. That's not too
important, but, in our network, there's no 1-1 correspondence between machines
participating in the peer-to-peer network and accounts on the blockchain. So the
node_key
specifies the keypair we'll use when signing network packets. These
packets internally will contain messages signed with the validator's key, and
these internal messages will drive the evolution of the blockchain state.
Finally, ~/.near/config.json
contains various configs for the node itself.
These are configs that don't affect the rules guiding the evolution of the
blockchain state, but rather things like timeouts, database settings and
such.
The only field we'll look at is boot_nodes
:
$ cat ~/.near/config.json | jq '.network.boot_nodes'
""
It's empty! The boot_nodes
specify IPs of the initial nodes our node will
try to connect to on startup. As we are looking into running a single-node
network, we want to leave it empty. But, if you would like to connect to
mainnet, you'd have to set this to some nodes from the mainnet you already know.
You'd also have to ensure that you use the same genesis as the mainnet though
-- if the node tries to connect to a network with a different genesis, it
is rejected.
Running the Network
Finally,
$ cargo run --profile dev-release -p neard -- run
INFO neard: version="trunk" build="1.1.0-3091-ga8964d200-modified" latest_protocol=57
INFO near: Creating a new RocksDB database path=/home/matklad/.near/data
INFO db: Created a new RocksDB instance. num_instances=1
INFO stats: # 0 4xecSHqTKx2q8JNQNapVEi5jxzewjxAnVFhMd4v5LqNh Validator | 1 validator 0 peers ⏠0 B/s ⏠0 B/s NaN bps 0 gas/s CPU: 0%, Mem: 50.8 MB
INFO near_chain::doomslug: ready to produce block @ 1, has enough approvals for 59.907Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 2, has enough approvals for 40.732Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 3, has enough approvals for 65.341Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 4, has enough approvals for 51.916Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 5, has enough approvals for 37.155Âľs, has enough chunks
...
đ it's alive!
So, what's going on here?
Our node is running a single-node network. As the network only has a single
validator, and the node has the keys for the validator, the node can produce
blocks by itself. Note the increasing @ 1
, @ 2
, ... numbers. That
means that our network grows.
Let's stop the node with ^C
and look around
INFO near_chain::doomslug: ready to produce block @ 42, has enough approvals for 56.759Âľs, has enough chunks
^C WARN neard: SIGINT, stopping... this may take a few minutes.
INFO neard: Waiting for RocksDB to gracefully shutdown
INFO db: Waiting for remaining RocksDB instances to shut down num_instances=1
INFO db: All RocksDB instances shut down
$
The main change now is that we have a ~/.near/data
directory which holds the
state of the network in various rocksdb tables:
$ ls ~/.near/data
000004.log
CURRENT
IDENTITY
LOCK
LOG
MANIFEST-000005
OPTIONS-000107
OPTIONS-000109
It doesn't matter what those are, "rocksdb stuff" is a fine level of understanding here. The important bit here is that the node remembers the state of the network, so, when we restart it, it continues from around the last block:
$ cargo run --profile dev-release -p neard -- run
INFO neard: version="trunk" build="1.1.0-3091-ga8964d200-modified" latest_protocol=57
INFO db: Created a new RocksDB instance. num_instances=1
INFO db: Dropped a RocksDB instance. num_instances=0
INFO near: Opening an existing RocksDB database path=/home/matklad/.near/data
INFO db: Created a new RocksDB instance. num_instances=1
INFO stats: # 5 Cfba39eH7cyNfKn9GoKTyRg8YrhoY1nQxQs66tLBYwRH Validator | 1 validator 0 peers ⏠0 B/s ⏠0 B/s NaN bps 0 gas/s CPU: 0%, Mem: 49.4 MB
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 366.58789ms, has enough approvals for 78.776Âľs
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 265.547148ms, has enough approvals for 101.119518ms
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 164.509153ms, has enough approvals for 202.157513ms
INFO near_chain::doomslug: not ready to produce block @ 43, need to wait 63.176926ms, has enough approvals for 303.48974ms
INFO near_chain::doomslug: ready to produce block @ 43, has enough approvals for 404.41498ms, does not have enough chunks
INFO near_chain::doomslug: ready to produce block @ 44, has enough approvals for 50.07Âľs, has enough chunks
INFO near_chain::doomslug: ready to produce block @ 45, has enough approvals for 45.093Âľs, has enough chunks
Interacting With the Node
Ok, now our node is running, let's poke it! The node exposes a JSON RPC interface which can be used to interact with the node itself (to, e.g., do a health check) or with the blockchain (to query information about the blockchain state or to submit a transaction).
$ http get http://localhost:3030/status
HTTP/1.1 200 OK
access-control-allow-credentials: true
access-control-expose-headers: accept-encoding, accept, connection, host, user-agent
content-length: 1010
content-type: application/json
date: Tue, 15 Nov 2022 13:58:13 GMT
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers
{
"chain_id": "test-chain-rR8Ct",
"latest_protocol_version": 57,
"node_key": "ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf",
"node_public_key": "ed25519:5A5QHyLayA9zksJZGBzveTgBRecpsVS4ohuxujMAFLLa",
"protocol_version": 57,
"rpc_addr": "0.0.0.0:3030",
"sync_info": {
"earliest_block_hash": "6gJLCnThQENYFbnFQeqQvFvRsTS5w87bf3xf8WN1CMUX",
"earliest_block_height": 0,
"earliest_block_time": "2022-11-15T13:45:53.062613669Z",
"epoch_id": "6gJLCnThQENYFbnFQeqQvFvRsTS5w87bf3xf8WN1CMUX",
"epoch_start_height": 501,
"latest_block_hash": "9JC9o3rZrDLubNxVr91qMYvaDiumzwtQybj1ZZR9dhbK",
"latest_block_height": 952,
"latest_block_time": "2022-11-15T13:58:13.185721125Z",
"latest_state_root": "9kEYQtWczrdzKCCuFzPDX3Vtar1pFPXMdLU5HJyF8Ght",
"syncing": false
},
"uptime_sec": 570,
"validator_account_id": "test.near",
"validator_public_key": "ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf",
"validators": [
{
"account_id": "test.near",
"is_slashed": false
}
],
"version": {
"build": "1.1.0-3091-ga8964d200-modified",
"rustc_version": "1.65.0",
"version": "trunk"
}
}
(I am using HTTPie here)
Note how "latest_block_height": 952
corresponds to @ 952
we see in the logs.
Let's query the blockchain state:
$ http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
params:='{"request_type": "view_account", "finality": "final", "account_id": "test.near"}'
Îť http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
params:='{"request_type": "view_account", "finality": "final", "account_id": "test.near"}'
HTTP/1.1 200 OK
access-control-allow-credentials: true
access-control-expose-headers: content-length, accept, connection, user-agent, accept-encoding, content-type, host
content-length: 294
content-type: application/json
date: Tue, 15 Nov 2022 14:04:54 GMT
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers
{
"id": "1",
"jsonrpc": "2.0",
"result": {
"amount": "1000000000000000000000000000000000",
"block_hash": "Hn4v5CpfWf141AJi166gdDK3e3khCxgfeDJ9dSXGpAVi",
"block_height": 1611,
"code_hash": "11111111111111111111111111111111",
"locked": "50003138579594550524246699058859",
"storage_paid_at": 0,
"storage_usage": 182
}
}
Note how we use an HTTP post
method when we interact with the blockchain RPC.
The full set of RPC endpoints is documented at
https://docs.near.org/api/rpc/introduction
Sending Transactions
Transactions are submitted via RPC as well. Submitting a transaction manually
with http
is going to be cumbersome though â transactions are borsh encoded
to bytes, then signed, then encoded in base64 for JSON.
So we will use the official NEAR CLI utility.
Install it via npm
:
$ npm install -g near-cli
$ near -h
Usage: near <command> [options]
Commands:
near create-account <accountId> create a new developer account
....
Note that, although you install near-cli
, the name of the utility is near
.
As a first step, let's redo the view_account
call we did with raw httpie
with near-cli
:
$ NEAR_ENV=local near state test.near
Loaded master account test.near key from ~/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Account test.near
{
amount: '1000000000000000000000000000000000',
block_hash: 'ESGN7H1kVLp566CTQ9zkBocooUFWNMhjKwqHg4uCh2Sg',
block_height: 2110,
code_hash: '11111111111111111111111111111111',
locked: '50005124762657986708532525400812',
storage_paid_at: 0,
storage_usage: 182,
formattedAmount: '1,000,000,000'
}
NEAR_ENV=local
tells near-cli
to use our local network, rather than the
mainnet
.
Now, let's create a couple of accounts and send tokes between them:
$ NEAR_ENV=local near create-account alice.test.near --masterAccount test.near
NOTE: In most cases, when connected to network "local", masterAccount will end in ".node0"
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Saving key to 'undefined/local/alice.test.near.json'
Account alice.test.near for network "local" was created.
$ NEAR_ENV=local near create-account bob.test.near --masterAccount test.near
NOTE: In most cases, when connected to network "local", masterAccount will end in ".node0"
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Saving key to 'undefined/local/bob.test.near.json'
Account bob.test.near for network "local" was created.
$ NEAR_ENV=local near send alice.test.near bob.test.near 10
Sending 10 NEAR to bob.test.near from alice.test.near
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:71QRP9qKcYRUYXTLNnrmRc1NZSdBaBo9nKZ88DK5USNf
Transaction Id BBPndo6gR4X8pzoDK7UQfoUXp5J8WDxkf8Sq75tK5FFT
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/BBPndo6gR4X8pzoDK7UQfoUXp5J8WDxkf8Sq75tK5FFT
Note: You can export the variable NEAR_ENV
in your shell if you are planning
to do multiple commands to avoid repetition:
$ export NEAR_ENV=local
NEAR CLI printouts are not always the most useful or accurate, but this seems to work.
Note that near
automatically creates keypairs and stores them at
.near-credentials
:
$ ls ~/.near-credentials/local
alice.test.near.json
bob.test.near.json
To verify that this did work, and that near-cli
didn't cheat us, let's
query the state of accounts manually:
$ http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
params:='{"request_type": "view_account", "finality": "final", "account_id": "alice.test.near"}' \
| jq '.result.amount'
"89999955363487500000000000"
14:30:52|~
Îť http post http://localhost:3030/ method=query jsonrpc=2.0 id=1 \
params:='{"request_type": "view_account", "finality": "final", "account_id": "bob.test.near"}' \
| jq '.result.amount'
"110000000000000000000000000"
Indeed, some amount of tokes was transferred from alice
to bob
, and then
some amount of tokens was deducted to account for transaction fees.
Recap
Great! So we've learned how to run our very own single-node NEAR network using a binary we've built from source. The steps are:
- Create configs with
cargo run --profile dev-release -p neard -- init
- Run the node with
cargo run --profile dev-release -p neard -- run
- Poke the node with
httpie
or - Install
near-cli
vianpm install -g near-cli
- Submit transactions via
NEAR_ENV=local near create-account ...
In the next chapter, we'll learn how to deploy a simple WASM contract.
Deploy a Contract
In this chapter, we'll learn how to build, deploy, and call a minimal smart contract on our local node.
Preparing Ground
Let's start with creating a fresh local network with an account to which we'll deploy a contract. You might want to re-read how to run a node to understand what's going on here:
$ cargo run --profile dev-release -p neard -- init
$ cargo run --profile dev-release -p neard -- run
$ NEAR_ENV=local near create-account alice.test.near --masterAccount test.near
As a sanity check, querying the state of alice.test.near
account should work:
$ NEAR_ENV=local near state alice.test.near
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:7tU4NtFozPWLotcfhbT9KfBbR3TJHPfKJeCri8Me6jU7
Account alice.test.near
{
amount: '100000000000000000000000000',
block_hash: 'EEMiLrk4ZiRzjNJXGdhWPJfKXey667YBnSRoJZicFGy9',
block_height: 24,
code_hash: '11111111111111111111111111111111',
locked: '0',
storage_paid_at: 0,
storage_usage: 182,
formattedAmount: '100'
}
Minimal Contract
NEAR contracts are WebAssembly blobs of bytes. To
create a contract, a contract developer typically uses an SDK for some
high-level programming language, such as JavaScript, which takes care of
producing the right .wasm
.
In this guide, we are interested in how things work under the hood, so we'll do everything manually, and implement a contract in Rust without any help from SDKs.
As we are looking for something simple, let's create a contract with a single
"method", hello
, which returns a "hello world"
string. To "define a method",
a wasm module should export a function. To "return a value", the contract needs
to interact with the environment to say "hey, this is the value I am returning".
Such "interactions" are carried through host functions, which are quite a bit
like syscalls in traditional operating systems.
The set of host functions that the contract can import is defined in
imports.rs
.
In this particular case, we need the value_return
function:
value_return<[value_len: u64, value_ptr: u64] -> []>
This means that the value_return
function takes a pointer to a slice of bytes,
the length of the slice, and returns nothing. If the contract calls this function,
the slice would be considered a result of the function.
To recap, we want to produce a .wasm
file with roughly the following content:
(module
(import "env" "value_return" (func $value_return (param i64 i64)))
(func (export "hello") ... ))
Cargo Boilerplate
Armed with this knowledge, we can write Rust code to produce the required WASM. Before we start doing that, some amount of setup code is required.
Let's start with creating a new crate:
$ cargo new hello-near --lib
To compile to wasm, we also need to add a relevant rustup toolchain:
$ rustup toolchain add wasm32-unknown-unknown
Then, we need to tell Cargo that the final artifact we want to get is a WebAssembly module.
This requires the following cryptic spell in Cargo.toml:
# hello-near/Cargo.toml
[lib]
crate-type = ["cdylib"]
Here, we ask Cargo to build a "C dynamic library". When compiling for wasm,
that'll give us a .wasm
module. This part is a bit confusing, sorry about
that :(
Next, as we are aiming for minimalism here, we need to disable optional bits
of the Rust runtime. Namely, we want to make our crate no_std
(this means
that we are not going to use the Rust standard library), set panic=abort
as our panic strategy and define a panic handler to abort execution.
# hello-near/Cargo.toml
[package]
name = "hello-near"
version = "0.1.0"
edition = "2021"
[lib]
crate-type = ["cdylib"]
[profile.release]
panic = "abort"
#![allow(unused)] fn main() { // hello-near/src/lib.rs #![no_std] #[panic_handler] fn panic_handler(_info: &core::panic::PanicInfo) -> ! { core::arch::wasm32::unreachable() } }
At this point, we should be able to compile our code to wasm, and it should be fairly small. Let's do that:
$ cargo b -r --target wasm32-unknown-unknown
Compiling hello-near v0.1.0 (~/hello-near)
Finished release [optimized] target(s) in 0.24s
$ ls target/wasm32-unknown-unknown/release/hello_near.wasm
.rwxr-xr-x 106 matklad 15 Nov 15:34 target/wasm32-unknown-unknown/release/hello_near.wasm
106 bytes is pretty small! Let's see what's inside. For that, we'll use
the wasm-tools
suite of CLI utilities.
$ cargo install wasm-tools
Îť wasm-tools print target/wasm32-unknown-unknown/release/hello_near.wasm
(module
(memory (;0;) 16)
(global $__stack_pointer (;0;) (mut i32) i32.const 1048576)
(global (;1;) i32 i32.const 1048576)
(global (;2;) i32 i32.const 1048576)
(export "memory" (memory 0))
(export "__data_end" (global 1))
(export "__heap_base" (global 2))
)
Rust Contract
Finally, let's implement an actual contract. We'll need an extern "C"
block to
declare the value_return
import, and a #[unsafe(no_mangle)] extern "C"
function to
declare the hello
export:
#![allow(unused)] fn main() { // hello-near/src/lib.rs #![no_std] extern "C" { fn value_return(len: u64, ptr: u64); } #[unsafe(no_mangle)] pub extern "C" fn hello() { let msg = "hello world"; unsafe { value_return(msg.len() as u64, msg.as_ptr() as u64) } } #[panic_handler] fn panic_handler(_info: &core::panic::PanicInfo) -> ! { core::arch::wasm32::unreachable() } }
After building the contract, the output wasm shows us that it's roughly what we want:
$ cargo b -r --target wasm32-unknown-unknown
Compiling hello-near v0.1.0 (/home/matklad/hello-near)
Finished release [optimized] target(s) in 0.05s
$ wasm-tools print target/wasm32-unknown-unknown/release/hello_near.wasm
(module
(type (;0;) (func (param i64 i64)))
(type (;1;) (func))
(import "env" "value_return" (; <- Here's our import. ;)
(func $value_return (;0;) (type 0)))
(func $hello (;1;) (type 1)
i64.const 11
i32.const 1048576
i64.extend_i32_u
call $value_return
)
(memory (;0;) 17)
(global $__stack_pointer (;0;) (mut i32) i32.const 1048576)
(global (;1;) i32 i32.const 1048587)
(global (;2;) i32 i32.const 1048592)
(export "memory" (memory 0))
(export "hello" (func $hello)) (; <- And export! ;)
(export "__data_end" (global 1))
(export "__heap_base" (global 2))
(data $.rodata (;0;) (i32.const 1048576) "hello world")
)
Deploying the Contract
Now that we have the WASM, let's deploy it!
$ NEAR_ENV=local near deploy alice.test.near \
./target/wasm32-unknown-unknown/release/hello_near.wasm
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:ChLD1qYic3G9qKyzgFG3PifrJs49CDYeERGsG58yaSoL
Starting deployment. Account id: alice.test.near, node: http://127.0.0.1:3030, helper: http://localhost:3000, file: ./target/wasm32-unknown-unknown/release/hello_near.wasm
Transaction Id GDbTLUGeVaddhcdrQScVauYvgGXxSssEPGUSUVAhMWw8
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/GDbTLUGeVaddhcdrQScVauYvgGXxSssEPGUSUVAhMWw8
Done deploying to alice.test.near
And, finally, let's call our contract:
$ NEAR_ENV=local $near call alice.test.near hello --accountId alice.test.near
Scheduling a call: alice.test.near.hello()
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:ChLD1qYic3G9qKyzgFG3PifrJs49CDYeERGsG58yaSoL
Doing account.functionCall()
Transaction Id 9WMwmTf6pnFMtj1KBqjJtkKvdFXS4kt3DHnYRnbFpJ9e
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/9WMwmTf6pnFMtj1KBqjJtkKvdFXS4kt3DHnYRnbFpJ9e
'hello world'
Note that we pass alice.test.near
twice: the first time to specify which contract
we are calling, the second time to determine who calls the contract. That is,
the second account is the one that spends tokens. In the following example bob
spends NEAR to call the contact deployed to the alice
account:
$ NEAR_ENV=local $near call alice.test.near hello --accountId bob.test.near
Scheduling a call: alice.test.near.hello()
Loaded master account test.near key from /home/matklad/.near/validator_key.json with public key = ed25519:ChLD1qYic3G9qKyzgFG3PifrJs49CDYeERGsG58yaSoL
Doing account.functionCall()
Transaction Id 4vQKtP6zmcR4Xaebw8NLF6L5YS96gt5mCxc5BUqUcC41
To see the transaction in the transaction explorer, please open this url in your browser
http://localhost:9001/transactions/4vQKtP6zmcR4Xaebw8NLF6L5YS96gt5mCxc5BUqUcC41
'hello world'
Running the Estimator
This workflow describes how to run the gas estimator byzantine-benchmark suite. To learn about its background and purpose, refer to Runtime Parameter Estimator in the architecture chapter.
Type this in your console to quickly run estimations on a couple of action costs.
cargo run -p runtime-params-estimator --features required -- \
--accounts-num 20000 --additional-accounts-num 20000 \
--iters 3 --warmup-iters 1 --metric time \
--costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase
You should get an output like this.
[elapsed 00:00:17 remaining 00:00:00] Writing into storage ââââââââââââââââââââ 20000/20000
ActionReceiptCreation 4_499_673_502_000 gas [ 4.499674ms] (computed in 7.22s)
ActionTransfer 410_122_090_000 gas [ 410.122Âľs] (computed in 4.71s)
ActionCreateAccount 237_495_890_000 gas [ 237.496Âľs] (computed in 4.64s)
ActionFunctionCallBase 770_989_128_914 gas [ 770.989Âľs] (computed in 4.65s)
Finished in 40.11s, output saved to:
/home/you/near/nearcore/costs-2022-11-11T11:11:11Z-e40863c9b.txt
This shows how much gas a parameter should cost to satisfy the 1ms = 1Tgas rule. It also shows how much time that corresponds to and how long it took to compute each of the estimations.
Note that the above does not produce very accurate results and it can have high variance as well. It runs an unoptimized binary, the state is small, and the metric used is wall-clock time which is always prone to variance in hardware and can be affected by other processes currently running on your system.
Once your estimation code is ready, it is better to run it with a larger state and an optimized binary.
cargo run --release -p runtime-params-estimator --features required -- \
--accounts-num 20000 --additional-accounts-num 2000000 \
--iters 3 --warmup-iters 1 --metric time \
--costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase
You might also want to run a hardware-agnostic estimation using the following
command. It uses podman
and qemu
under the hood, so it will be quite a bit
slower. You will need to install podman
to run this command.
cargo run --release -p runtime-params-estimator --features required -- \
--accounts-num 20000 --additional-accounts-num 2000000 \
--iters 3 --warmup-iters 1 --metric icount --containerize \
--costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase
Note how the output looks a bit different now. The i
, r
and w
values show
instruction count, read IO bytes, and write IO bytes respectively. The IO byte
count is known to be inaccurate.
+ /host/nearcore/runtime/runtime-params-estimator/emu-cost/counter_plugin/qemu-x86_64 -plugin file=/host/nearcore/runtime/runtime-params-estimator/emu-cost/counter_plugin/libcounter.so -cpu Haswell-v4 /host/nearcore/target/release/runtime-params-estimator --home /.near --accounts-num 20000 --iters 3 --warmup-iters 1 --metric icount --costs=ActionReceiptCreation,ActionTransfer,ActionCreateAccount,ActionFunctionCallBase --skip-build-test-contract --additional-accounts-num 0 --in-memory-db
ActionReceiptCreation 214_581_685_500 gas [ 1716653.48i 0.00r 0.00w] (computed in 6.11s)
ActionTransfer 21_528_212_916 gas [ 172225.70i 0.00r 0.00w] (computed in 4.71s)
ActionCreateAccount 26_608_336_250 gas [ 212866.69i 0.00r 0.00w] (computed in 4.67s)
ActionFunctionCallBase 12_193_364_898 gas [ 97546.92i 0.00r 0.00w] (computed in 2.39s)
Finished in 17.92s, output saved to:
/host/nearcore/costs-2022-11-01T16:27:36Z-e40863c9b.txt
The difference between the metrics is discussed in the Estimation Metrics chapter.
You should now be all set up for running estimations on your local machine. Also,
check cargo run -p runtime-params-estimator --features required -- --help
for
the list of available options.
Running near localnet on 2 machines
Quick instructions on how to run a localnet on 2 separate machines.
Setup
- Machine1: "pc" - 192.168.0.1
- Machine2: "laptop" - 192.168.0.2
Run on both machines (make sure that they are using the same version of the code):
cargo build -p neard
Then on machine1 run the command below, which will generate the configurations:
./target/debug/neard --home ~/.near/localnet_multi localnet --shards 3 --v 2
This command has generated configuration for 3 shards and 2 validators (in directories ~/.near/localnet_multi/node0 and ~/.near/localnet_multi/node1).
Now - copy the contents of node1 directory to the machine2
rsync -r ~/.near/localnet_multi/node1 192.168.0.2:~/.near/localnet_multi/node1
Now open the config.json file on both machines (node0/config.json on machine1 and node1/config.json on machine2) and:
- for rpc->addr and network->addr:
- Change the address from 127.0.0.1 to 0.0.0.0 (this means that the port will be accessible from other computers)
- Remember the port numbers (they are generated randomly).
- Also write down the node0's node_key (it is probably: "ed25519:7PGseFbWxvYVgZ89K1uTJKYoKetWs7BJtbyXDzfbAcqX")
Running
On machine1:
./target/debug/neard --home ~/.near/localnet_multi/node0 run
On machine2:
./target/debug/neard --home ~/.near/localnet_multi/node1 run --boot-nodes ed25519:7PGseFbWxvYVgZ89K1uTJKYoKetWs7BJtbyXDzfbAcqX@192.168.0.1:37665
The boot node address should be the IP of the machine1 + the network addr port from the node0/config.json
And if everything goes well, the nodes should communicate and start producing blocks.
Troubleshooting
The debug mode is enabled by default, so you should be able to see what's going on by going to http://machine1:RPC_ADDR_PORT/debug
If node keeps saying "waiting for peers"
See if you can see the machine1's debug page from machine2. (if not - there might be a firewall blocking the connection).
Make sure that you set the right ports (it should use node0's NETWORK port) and that you set the ip add there to 0.0.0.0
Resetting the state
Simply stop both nodes, and remove the data
subdirectory (~/.near/localnet_multi/node0/data and ~/.near/localnet_multi/node1/data).
Then after restart, the nodes will start the blockchain from scratch.
IO tracing
When should I use IO traces?
IO traces can be used to identify slow receipts and to understand why they are slow. Or to detect general inefficiencies in how we use the DB.
The results give you counts of DB requests and some useful statistics such as trie node cache hits. It will NOT give you time measurements, use Graphana to observe those.
The main use cases in the past were to estimate the performance of new storage features (prefetcher, flat state) and to find out why specific contracts produce slow receipts.
Setup
When compiling neard (or the parameter estimator) with feature=io_trace
it
instruments the binary code with fine-grained database operations tracking.
Aside: We don't enable it by default because we are afraid the overhead could be too much, since we store some information for very hot paths such as trie node cache hits. Although we haven't properly evaluated if it really is a performance problem.
This allows using the --record-io-trace=/path/to/output.io_trace
CLI flag on
neard. Run it in combination with the subcommands neard run
, neard view-state
, or runtime-params-estimator
and it will record an IO trace. Make
sure to provide the flag to neard
itself, however, not to the subcommands.
(See examples below)
# Example command for normal node
# (Careful! This will quickly fill your disk if you let it run.)
cargo build --release -p neard --features=io_trace
target/release/neard \
--record-io-trace=/mnt/disks/some_disk_with_enough_space/my.io_trace \
run
# Example command for state viewer, applying a range of chunks in shard 0
cargo build --release -p neard --features=io_trace
target/release/neard \
--record-io-trace=75220100-75220101.s0.io_trace \
view-state apply-range --start-index 75220100 --end-index 75220101 \
--shard-id 0 sequential
# Example command for params estimator
cargo run --release -p runtime-params-estimator --features=required,io_trace \
-- --accounts-num 200000 --additional-accounts-num 200000 \
--iters 3 --warmup-iters 1 --metric time \
--record-io-trace=tmp.io \
--costs ActionReceiptCreation
IO trace content
Once you have collected an IO trace, you can inspect its content manually, or use existing tools to extract statistics. Let's start with the manual approach.
Simple example trace: Estimator
An estimator trace typically starts something like this:
commit
SET DbVersion "'VERSION'" size=2
commit
SET DbVersion "'KIND'" size=3
apply num_transactions=0 shard_cache_miss=7
GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
GET State "AAAAAAAAAACGDsmYvNoBGZnc8PzDKoF4F2Dvw3N6XoAlRrg8ezA8FA==" size=107
GET State "AAAAAAAAAAB3I0MYevRcExi1ql5PSQX+fuObsPH30yswS7ytGPCgyw==" size=46
...
Depending on the source, traces look a bit different at the start. But you should always see some setup at the beginning and per-chunk workload later on.
Indentation is used to display the call hierarchy. The commit
keyword shows
when a commit starts, and all SET
and UPDATE_RC
commands that follow with
one level deeper indentation belong to that same database transaction commit.
Later, you see a group that starts with an apply
header. It groups all IO
requests that were performed for a call to fn apply
that applies transactions and receipts of a chunk to the previous state root.
In the example, you see a list of GET
requests that belong to that apply
,
each with the DB key used and the size of the value read. Further, you can read
in the trace that this specific chunk had 0 transactions and that it
cache-missed all 7 of the DB requests it performed to apply this empty chunk.
Example trace: Full mainnet node
Next let's look at an excerpt of an IO trace from a real node on mainnet.
...
GET State "AQAAAAMAAACm9DRx/dU8UFEfbumiRhDjbPjcyhE6CB1rv+8fnu81bw==" size=9
GET State "AQAAAAAAAACLFgzRCUR3inMDpkApdLxFTSxRvprJ51eMvh3WbJWe0A==" size=203
GET State "AQAAAAIAAACXlEo0t345S6PHsvX1BLaGw6NFDXYzeE+tlY2srjKv8w==" size=299
apply_transactions shard_id=3
process_state_update
apply num_transactions=3 shard_cache_hit=207
process_transaction tx_hash=C4itKVLP5gBoAPsEXyEbi67Gg5dvQVugdMjrWBBLprzB shard_cache_hit=57
GET FlatState "AGxlb25hcmRvX2RheS12aW5jaGlrLm5lYXI=" size=36
GET FlatState "Amxlb25hcmRvX2RheS12aW5jaGlrLm5lYXICANLByB1merOzxcGB1HI9/L60QvONzOE6ovF3hjYUbhA8" size=36
process_transaction tx_hash=GRSXC4QCBJHN4hmJiATAbFGt9g5PiksQDNNRaSk666WX shard_cache_miss=3 prefetch_pending=3 shard_cache_hit=35
GET FlatState "AnJlbGF5LmF1cm9yYQIA5iq407bcLgisCKxQQi47TByaFNe9FOgQg5y2gpU4lEM=" size=36
process_transaction tx_hash=6bDPeat12pGqA3KEyyg4tJ35kBtRCuFQ7HtCpWoxr8qx shard_cache_miss=2 prefetch_pending=1 shard_cache_hit=21 prefetch_hit=1
GET FlatState "AnJlbGF5LmF1cm9yYQIAyKT1vEHVesMEvbp2ICA33x6zxfmBJiLzHey0ZxauO1k=" size=36
process_receipt receipt_id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC predecessor=1663adeba849fb7c26195678e1c5378278e5caa6325d4672246821d8e61bb160 receiver=token.sweat id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC shard_cache_too_large=1 shard_cache_miss=1 shard_cache_hit=38
GET FlatState "AXRva2VuLnN3ZWF0" size=36
GET State "AQAAAAMAAADVYp4vtlIbDoVhji22CZOEaxVWVTJKASq3iMvpNEQVDQ==" size=206835
input
register_len
read_register
storage_read READ key='STATE' size=70 tn_db_reads=20 tn_mem_reads=0 shard_cache_hit=21
register_len
read_register
attached_deposit
predecessor_account_id
register_len
read_register
sha256
read_register
storage_read READ key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=10 tn_mem_reads=19 shard_cache_hit=11
GET FlatState "CXRva2VuLnN3ZWF0LHQAMWoGDDhG9NfuerJfbzjNMh22UT0mmEi98wVyHwjou6M=" size=36
register_len
read_register
sha256
read_register
storage_write WRITE key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=0 tn_mem_reads=30
...
Maybe that's a bit much. Let's break it down into pieces.
It start with a few DB get requests that are outside of applying a chunk. It's quite common that we have these kinds of constant overhead requests that are independent of what's inside a chunk. If we see too many such requests, we should take a close look to see if we are wasting performance.
GET State "AQAAAAMAAACm9DRx/dU8UFEfbumiRhDjbPjcyhE6CB1rv+8fnu81bw==" size=9
GET State "AQAAAAAAAACLFgzRCUR3inMDpkApdLxFTSxRvprJ51eMvh3WbJWe0A==" size=203
GET State "AQAAAAIAAACXlEo0t345S6PHsvX1BLaGw6NFDXYzeE+tlY2srjKv8w==" size=299
Next let's look at apply_transactions
but limit the depth of items to 3
levels.
apply_transactions shard_id=3
process_state_update
apply num_transactions=3 shard_cache_hit=207
process_transaction tx_hash=C4itKVLP5gBoAPsEXyEbi67Gg5dvQVugdMjrWBBLprzB shard_cache_hit=57
process_transaction tx_hash=GRSXC4QCBJHN4hmJiATAbFGt9g5PiksQDNNRaSk666WX shard_cache_miss=3 prefetch_pending=3 shard_cache_hit=35
process_transaction tx_hash=6bDPeat12pGqA3KEyyg4tJ35kBtRCuFQ7HtCpWoxr8qx shard_cache_miss=2 prefetch_pending=1 shard_cache_hit=21 prefetch_hit=1
process_receipt receipt_id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC predecessor=1663adeba849fb7c26195678e1c5378278e5caa6325d4672246821d8e61bb160 receiver=token.sweat id=GRB3skohuShBvdGAoEoR3SdJJw7MwCxxscJHKLdPoYUC shard_cache_too_large=1 shard_cache_miss=1 shard_cache_hit=38
Here you can see that before we even get to apply
, we go through
apply_transactions
and process_state_update
. The excerpt does not show it
but there are DB requests listed further below that belong to these levels but
not to apply
.
Inside apply
, we see 3 transactions being converted to receipts as part of
this chunk, and one already existing action receipt getting processed.
Cache hit statistics for each level are also displayed. For example, the first transaction has 57 read requests and all of them hit in the shard cache. For the second transaction, we miss the cache 3 times but the values were already in the process of being prefetched. This would be account data which we fetch in parallel for all transactions in the chunk.
Finally, there are several process_receipt
groups, although the excerpt was
cut to display only one. Here we see the receiving account
(receiver=token.sweat
) and the receipt ID to potentially look it up on an
explorer, or dig deeper using state viewer commands.
Again, cache hit statistics are included. Here you can see one value missed the cache because it was too large. Usually that's a contract code. We do not include it in the shard cache because it would take up too much space.
Zooming in a bit further, let's look at the DB request at the start of the receipt.
GET FlatState "AXRva2VuLnN3ZWF0" size=36
GET State "AQAAAAMAAADVYp4vtlIbDoVhji22CZOEaxVWVTJKASq3iMvpNEQVDQ==" size=206835
input
register_len
read_register
storage_read READ key='STATE' size=70 tn_db_reads=20 tn_mem_reads=0 shard_cache_hit=21
register_len
read_register
FlatState "AXRva2VuLnN3ZWF0"
reads the ValueRef
of the contract code, which
is 36 bytes in its serialized format. Then, the ValueRef
is dereferenced to
read the actual code, which happens to be 206kB in size. This happens in the
State
column because at the time of writing, we still read the actual values
from the trie, not from flat state.
What follows are host function calls performed by the SDK. It uses input
to
check the function call arguments and copies it from a register into WASM
memory.
Then the SDK reads the serialized contract state from the hardcoded key
"STATE"
. Note that we charge 20 tn_db_reads
for it, since we missed the
accounting cache, but we hit everything in the shard cache. Thus, there are no DB
requests. If there were DB requests for this tn_db_reads
, you would see them
listed.
The returned 70 bytes are again copied into WASM memory. Knowing the SDK code a little bit, we can guess that the data is then deserialized into the struct used by the contract for its root state. That's not visible on the trace though, as this happens completely inside the WASM VM.
Next we start executing the actual contract code, which again calls a bunch of host functions. Apparently the code starts by reading the attached deposit and the predecessor account id, presumably to perform some checks.
The sha256
call here is used to shorten implicit account ids.
(Link to code for comparison).
Afterwards, a value with 16 bytes (a u128
) is fetched from the trie state.
To serve this, it required reading 30 trie nodes, 19 of them were cached in the
accounting cache and were not charged the full gas cost. And the remaining 11
missed the accounting cache but they hit the shard cache. Nothing needed to be
fetched from DB because the Sweatcoin specific prefetcher has already loaded
everything into the shard cache.
Note: We see trie node requests despite flat state being used. This is because the trace was collected with a binary that performed a read on both the trie and flat state to do some correctness checks.
attached_deposit
predecessor_account_id
register_len
read_register
sha256
read_register
storage_read READ key=dAAxagYMOEb01+56sl9vOM0yHbZRPSaYSL3zBXIfCOi7ow== size=16 tn_db_reads=10 tn_mem_reads=19 shard_cache_hit=11
GET FlatState "CXRva2VuLnN3ZWF0LHQAMWoGDDhG9NfuerJfbzjNMh22UT0mmEi98wVyHwjou6M=" size=36
So that is how to read these traces and dig deep. But maybe you want aggregated statistics instead? Then please continue reading.
Evaluating an IO trace
When you collect an IO trace over an hour of mainnet traffic, it can quickly be above 1GB in uncompressed size. You might be able to sample a few receipts and eyeball them to get a feeling for what's going on. But you can't understand the whole trace without additional tooling.
The parameter estimator command replay
can help with that. (See also this
readme)
Run the following command to see an overview of available commands.
# will print the help page for the IO trace replay command
cargo run --profile dev-release -p runtime-params-estimator -- \
replay --help
All commands aggregate the information of a trace. Either globally, per chunk, or per receipt. For example, below is the output that gives a list of RocksDB columns that were accessed and how many times, aggregated by chunk.
cargo run --profile dev-release -p runtime-params-estimator -- \
replay ./path/to/my.io_trace chunk-db-stats
apply_transactions shard_id=3 block=DajBgxTgV8NewTJBsR5sTgPhVZqaEv9xGAKVnCiMiDxV
GET 12 FlatState 4 State
apply_transactions shard_id=0 block=DajBgxTgV8NewTJBsR5sTgPhVZqaEv9xGAKVnCiMiDxV
GET 14 FlatState 8 State
apply_transactions shard_id=2 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
GET 2 FlatState
apply_transactions shard_id=3 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
GET 6 FlatState 2 State
apply_transactions shard_id=1 block=HTptJFZKGfmeWs7y229df6WjMQ3FGfhiqsmXnbL2tpz8
GET 50 FlatState 5 State
...
apply_transactions shard_id=3 block=AUcauGxisMqNmZu5Ln7LLu8Li31H1sYD7wgd7AP6nQZR
GET 17 FlatState 3 State
top-level:
GET 8854 Block 981 BlockHeader 16556 BlockHeight 59155 BlockInfo 2 BlockMerkleTree 330009 BlockMisc 1 BlockOrdinal 31924 BlockPerHeight 863 BlockRefCount 1609 BlocksToCatchup 1557 ChallengedBlocks 4 ChunkExtra 5135 ChunkHashesByHeight 128788 Chunks 35 EpochInfo 1 EpochStart 98361 FlatState 1150 HeaderHashesByHeight 8113 InvalidChunks 263 NextBlockHashes 22 OutgoingReceipts 131114 PartialChunks 1116 ProcessedBlockHeights 968698 State
SET 865 BlockHeight 1026 BlockMerkleTree 12428 BlockMisc 1636 BlockOrdinal 865 BlockPerHeight 865 BlockRefCount 3460 ChunkExtra 3446 ChunkHashesByHeight 339142 FlatState 3460 FlatStateDeltas 3460 FlatStateMisc 865 HeaderHashesByHeight 3460 IncomingReceipts 865 NextBlockHashes 3442 OutcomeIds 3442 OutgoingReceipts 863 ProcessedBlockHeights 340093 StateChanges 3460 TrieChanges
The output contains one apply_transactions
for each chunk, with the block hash
and the shard id. Then it prints one line for each DB operations observed
(GET,SET,...) together with a list of columns and an OP count.
See the top-level
output at the end? These are all the DB requests that could
not be assigned to specific chunks. The way we currently count write operations
(SET, UPDATE_RC) they are never assigned to a specific chunk and instead only
show up in the top-level list. Clearly, there is some room for improvement here.
So far we simply haven't worried about RocksDB write performance so the tooling
to debug write performance is naturally lacking.
Profiling neard
Sampling performance profiling
It is a common task to need to look where neard
is spending time. Outside of instrumentation
we've also been successfully using sampling profilers to gain an intuition over how the code works
and where it spends time. It is a very quick way to get some baseline understanding of the
performance characteristics, but due to its probabilistic nature it is also not particularly precise
when it comes to small details.
Linux's perf
has been a tool of choice in most cases, although tools like Intel VTune could be
used too. In order to use either, first prepare your system:
$ sudo sysctl kernel.perf_event_paranoid=0
$ sudo sysctl kernel.kptr_restrict=0
Beware that this gives access to certain kernel state and environment to the unprivileged user. Once investigation is over either set these properties back to the more restricted settings or, better yet, reboot.
Definitely do not run untrusted code after running these commands.
Then collect a profile as such:
$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 YOUR_COMMAND_HERE
# or attach to a running process:
$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 -p NEARD_PID
This command will use the CPU time clock to determine when to trigger a sampling process and will
do such sampling roughly 1000 times (the -F
argument) every CPU second.
Once terminated, this command will produce a profile file in the current working directory.
Although you can inspect the profile already with perf report
, we've had much better experience
with using Firefox Profiler as the viewer. Although Firefox
Profiler supports perf
and many other different data formats, for perf
in particular a
conversion step is necessary:
$ perf script -F +pid > mylittleprofile.script
Then, load this mylittleprofile.script
file with the profiler.
Low overhead stack frame collection
The command above uses -g --call-graph dwarf,65528
parameter to instruct perf
to collect
stack trace for each sample using DWARF unwinding metadata. This will work no matter how neard
is
built, but is expensive and not super precise (e.g. it has trouble with JIT code.) If you have an
ability to build a profiling-tuned build of neard
, you can use higher quality stack frame
collection.
$ cargo build --release --config .cargo/config.profiling.toml -p neard
Then, replace the --call-graph dwarf
with --call-graph fp
:
$ perf record -e cpu-clock -F1000 -g --call-graph fp,65528 YOUR_COMMAND_HERE
Profiling with hardware counters
As mentioned earlier, sampling profiler is probabilistic and the data it produces is only really suitable for a broad overview. Any attempt to analyze the performance of the code at the microarchitectural level (which you might want to do if investigating how to speed up a small but frequently invoked function) will be severely hampered by the low quality of data.
For a long time now, CPUs are able to expose information about how it operates on the code at a very fine grained level: how many cycles have passed, how many instructions have been processed, how many branches have been taken, how many predictions were incorrect, how many cycles were spent waiting of a memory accesses and many more. These allow a much better look at how the code behaves.
Until recently, use of these detailed counters was still sampling based -- the CPU would produce
some information at a certain cadence of these counters (e.g. every 1000 instructions or cycles)
which still shares a fair number of the same downsides as sampling cpu-clock
. In order to address
this downside, recent CPUs from both Intel and AMD have implemented a list of recent branches taken
-- Last Branch Record or LBR. This is available on reasonably
recent Intel architectures as well as starting with Zen 4 on the side of AMD. With LBRs profilers
are able to gather information about the cycle counts between each branch, giving an accurate and
precise evaluation of the performance at a basic block
or function call level.
It all sounds really nice, so why are we not using these mechanisms all the time? That's because
GCP VMs don't allow access to these counters! In order to access them the code has to be run on
your own hardware, or a VM instance that provides direct access to the hardware, such as the (quite
expensive) c3-highcpu-192-metal
type.
Once everything is set up, though, the following command can gather some interesting information for you.
$ perf record -e cycles:u -b -g --call-graph fp,65528 YOUR_COMMAND_HERE
Analyzing this data is, unfortunately, not as easy as chucking it away to Firefox Profiler. I'm not
aware of any other ways to inspect the data other than using perf report
:
$ perf report -g --branch-history
$ perf report -g --branch-stack
You may also be able to gather some interesting results if you use --call-graph lbr
and the
relevant reporting options as well.
Memory usage profiling
neard
is a pretty memory-intensive application with many allocations occurring constantly.
Although Rust makes it pretty hard to introduce memory problems, it is still possible to leak
memory or to inadvertently retain too much of it.
Unfortunately, âjustâ throwing a random profiler at neard does not work for many reasons. Valgrind
for example is introducing enough slowdown to significantly alter the behaviour of the run, not to
mention that to run it successfully and without crashing it will be necessary to comment out
neard
âs use of jemalloc
for yet another substantial slowdown.
So far the only tool that worked out well out of the box was
bytehound
. Using it is quite straightforward, but needs
Linux, and ability to execute privileged commands.
First, checkout and build the profiler (you will need to have nodejs yarn
thing available as
well):
$ git clone git@github.com:koute/bytehound.git
$ cargo build --release -p bytehound-preload
$ cargo build --release -p bytehound-cli
You will also need a build of your neard
, once you have that, give it some ambient cabapilities
necessary for profiling:
$ sudo sysctl kernel.perf_event_paranoid=0
$ sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/neard
$ sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/libbytehound.so
And finally run the program with the profiler enabled (in this case neard run
command is used):
$ /lib64/ld-linux-x86-64.so.2 --preload /path/to/libbytehound.so /path/to/neard run
Viewing the profile
Do note that you will need about twice the amount of RAM as the size of the input file in order to load it successfully.
Once enough profiling data has been gathered, terminate the program. Use the bytehound
CLI tool
to operate on the profile. I recommend bytehound server
over directly converting to e.g. heaptrack
format using other subcommands as each invocation will read and parse the profile data from
scratch. This process can take quite some time. server
parses the inputs once and makes
conversions and other introspection available as interactive steps.
You can use server
interface to inspect the profile in one of the few ways, download a flamegraph
or a heaptrack file. Heaptrack in particular provides some interesting additional visualizations
and has an ability to show memory use over time from different allocation sources.
I personally found it a bit troublesome to figure out how to open the heaptrack file from the GUI.
However, heaptrack myexportedfile
worked perfectly. I recommend opening the file exactly this way.
Troubleshooting
No output file
- Set a higher profiler logging level. Verify that the profiler gets loaded at all. If you're not seeing any log messages, then something about your working environment is preventing the loader from including the profiler library.
- Try specifying an exact output file with e.g. environment variables that the profiler reads.
Crashes
If the profiled neard
crashes in your tests, there are a couple of things you can try to get past
it. First, make sure your binary has the necessary ambient capabilities (setcap
command above
needs to be executed every time binary is replaced!)
Another thing to try is disabling jemalloc
. Comment out this code in neard/src/main.rs
:
#![allow(unused)] fn main() { #[global_allocator] static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc; }
The other thing you can try is different profilers, different versions of the profilers or different options made available (in particular disabling the shadow stack in bytehound), although I don't have specific recommendations here.
We don't know what exactly it is about neard that leads to it crashing under the profiler as easily as it does. I have seen valgrind reporting that we have libraries that are deallocating with a wrong size class, so that might be the reason? Do definitely look into this if you have time.
What to profile?
This section provides some ideas on programs you could consider profiling if you are not sure where to start.
First and foremost you could go shotgun and profile a full running neard
node that's operating on
mainnet or testnet traffic. There are a couple ways to set up such a node: search for
my-own-mainnet or a forknet based
tooling.
From there either attach to a running neard run
process or stop the running one and start a new
instance under the profiler.
This approach will give you a good overview of the entire system, but at the same time the information might be so dense, it might be difficult to derive any signal from the noise. There are alternatives that isolate certain components of the runtime:
Runtime::apply
Profiling just the Runtime::apply
is going to include the work done by transaction runtime and
the contract runtime only. A smart use of the tools already present in the neard
binary can
achieve that today.
First, make sure all deltas in flat storage are applied and written:
neard view-state --readwrite apply-range --shard-id $SHARD_ID --storage flat sequential
You will need to do this for all shards you're interested in profiling. Then pick a block or a range of blocks you want to re-apply and set the flat head to the specified height:
neard flat-storage move-flat-head --shard-id 0 --version 0 back --blocks 17
Finally the following commands will apply the block or blocks from the height in various different ways. Experiment with the different modes and flags to find the best fit for your task. Don't forget to run these commands under the profiler :)
# Apply blocks from current flat head to the highest known block height in sequence
# using the memtrie storage (note that after this you will need to move the flat head again)
neard view-state --readwrite apply-range --shard-id 0 --storage memtrie sequential
# Same but with flat storage
neard view-state --readwrite apply-range --shard-id 0 --storage flat sequential
# Repeatedly apply a single block at the flat head using the memtrie storage.
# Will not modify the storage on the disk.
neard view-state apply-range --shard-id 0 --storage memtrie benchmark
# Same but with flat storage
neard view-state apply-range --shard-id 0 --storage flat benchmark
Working with OpenTelemetry Traces
neard
is instrumented in a few different ways. From the code perspective we have two major ways
of instrumenting code:
- Prometheus metrics â by computing various metrics in code and expoding them via the
prometheus
crate. - Execution tracing â this shows up in the code as invocations of functionality provided by the
tracing
crate.
The focus of this document is to provide information on how to effectively work with the data collected by the execution tracing approach to instrumentation.
Gathering and Viewing the Traces
Tracing the execution of the code produces two distinct types of data: spans and events. These then
are exposed as either logs (representing mostly the events) seen in the standard output of the
neard
process or sent onwards to an opentelemetry collector.
When deciding how to instrument a specific part of the code, consider the following decision tree:
- Do I need execution timing information? If so, use a span; otherwise
- Do I need call stack information? If so, use a span; otherwise
- Do I need to preserve information about inputs or outputs to a specific section of the code? If so, use key-values on a pre-existing span or an event; otherwise
- Use an event if it represents information applicable to a single point of execution trace.
As of writing (February 2024) our codebase uses spans somewhat sparsely and relies on events heavily to expose information about the execution of the code. This is largely a historical accident due to the fact that for a long time stdout logs were the only reasonable way to extract information out of the running executable.
Today we have more tools available to us. In production environments and environments replicating said environment (i.e. GCP environments such as mocknet) there's the ability to push this data to Grafana Loki (for events) and Tempo (for spans and events alike), so long as the amount of data is within reason. For that reason it is critical that the event and span levels are chosen appropriately and in consideration with the frequency of invocations. In local environments developers can use projects like Jaeger, or set up the Grafana stack if they wish to use a consistent interfaces.
It is still more straightforward to skip all the setup necessary for tracing, but relying exclusively on logs only increases noise for the other developers and makes it ever so slightly harder to extract signal in the future. Keep this trade off in mind.
Spans
We have a style guide section on the use of Spans, please make yourself familiar with it.
Every tracing::debug_span!()
creates a new span, and usually it is attached to its parent
automatically.
However, a few corner cases exist.
-
do_apply_chunks()
starts 4 sub-tasks in parallel and waits for their completion. To make it work, the parent span is passed explicitly to the sub-tasks. -
Messages to actix workers. If you do nothing, that the traces are limited to work done in a single actor. But that is very restrictive and not useful enough. To workaround that, each actix message gets attached
opentelemetry::Context
. That context somehow represents the information about the parent span. This mechanism is the reason you see annoying.with_span_context()
function calls whenever you send a message to an actix Actor. -
Inter-process tracing is theoretically available, but I have never tested it. The plan was to test it as soon as the Canary images get updated đ Therefore it most likely doesnât work. Each
PeerMessage
is injected withTraceContext
(1, 2) and the receiving node extracts that context and all spans generated in handling that message should be parented to the trace from another node. -
Some spans are created using
info_span!()
but they are few and mostly for the logs. Exporting only info-level spans doesnât give any useful tracing information in Grafana. -
actix::Actor::handle()
deserves a special note. The design choice was to provide a macro that lets us easily annotate every implementation ofactix::Actor::handle()
. This macro sets the following span attributes:actor
to the name of the struct that implements actix::Actorhandler
to the name of the message struct
And it lets you provide more span attributes. In the example, ClientActor specifies
msg_type
, which in all cases is identical tohandler
.
Configuration
The Tracing documentation page in nearone's Outline documents the steps necessary to start moving the trace data from the node to Nearone's Grafana Cloud instance. Once you set up your nodes, you can use the explore page to verify that the traces are coming through.
If the traces are not coming through quite yet, consider using the ability to set logging
configuration at runtime. Create $NEARD_HOME/log_config.json
file with the following contents:
{ "opentelemetry": "info" }
Or optionally with rust_log
setting to reduce logging on stdout:
{ "opentelemetry": "info", "rust_log": "WARN" }
and invoke sudo pkill -HUP neard
. Double check that the collector is running as well.
Good to know: You can modify the event/span/log targets youâre interested in just like when setting the
RUST_LOG
environment variable, including target filters. If you're setting verbose levels, consider selecting specific targets you're interested in too. This will help to keep trace ingest costs down.For more information about the dynamic settings refer to
core/dyn-configs
code in the repository.
Local development
TODO: the setup is going to depend on whether one would like to use grafana stack or just jaeger or something else. We should document setting either of these up, including the otel collector and such for a full end-to-end setup. Success criteria: running integration tests should allow you to see the traces in your grafana/jaeger. This may require code changes as well.
Using the Grafana Stack here gives the benefit of all of the visualizations that are built-in. Any dashboards you build are also portable between the local environment and the Grafana Cloud instance. Jaeger may give a nicer interactive exploration ability. You can also set up both if you wish.
Visualization
Now that the data is arriving into the databases, it is time to visualize the data to determine what you want to know about the node. The only general advise I have here is to check that the data source is indeed tempo or loki.
Explore
Initial exploration is best done with Grafana's Explore tool or some other mechanism to query and display individual traces.
The query builder available in Grafana makes the process quite straightforward to start with, but
is also somewhat limited. Underlying TraceQL has many more
features that are not available through the
builder. For example, you can query data in somewhat of a relational manner, such as this query
below queries only spans named process_receipt
that take 50ms when run as part of new_chunk
processing for shard 3!
{ name="new_chunk" && span.shard_id = "3" } >> { name="process_receipt" && duration > 50ms }
Good to know: When querying, keep in mind the "Options" dropdown that allows you to specify the limit of results and the format in which these results are presented! In particular, the "Traces/Spans" toggle will affect the durations shown in the result table.
Once you click on a span of interest, Grafana will open you a view with the trace that contains said span, where you can inspect both the overall trace and the properties of the span:
Dashboards
Once you have arrived at an interesting query, you may be inclined to create a dashboard that summarizes the data without having to dig into individual traces and spans.
As an example the author was interested in checking the execution speed before and after a change in a component. To make the comparison visual, the span of interest was graphed using the histogram visualization in order to obtain the following result. In this graph the Y axis displays the number of occurrences for spans that took X-axis long to complete.
In general most of the panels work with tracing results directly but some of the most interesting ones do not. It is necessary to experiment with certain options and settings to have grafana panels start showing data. Some notable examples:
- Time series â a âPrepare time seriesâ data transformation with âMulti-frame time seriesâ has to be added;
- Histogram â make sure to use "spans" table format option;
- Heatmap - set âCalculate from dataâ option to âYesâ;
- Bar chart â works out of the box, but x axis won't be readable ever.
You can also add a panel that shows all the trace events in a log-like representation using the log or table visualization.
Multiple nodes
One frequently asked question is whether Grafana lets you distinguish between nodes that export tracing information.
The answer is yes.
In addition to span attributes, each span has resource attributes. There you'll find properties
like node_id
which uniquely identify a node.
account_id
is theaccount_id
fromvalidator_key.json
;chain_id
is taken fromgenesis.json
;node_id
is the public key fromnode_key.json
;service.name
isaccount_id
if that is available, otherwise it isnode_id
.
Code Style
This document specifies the code style to use in the nearcore repository. The primary goal here is to achieve consistency, maintain it over time, and cut down on the mental overhead related to style choices.
Right now, nearcore
codebase is not perfectly consistent, and the style
acknowledges this. It guides newly written code and serves as a tie breaker for
decisions. Rewriting existing code to conform 100% to the style is not a goal.
Local consistency is more important: if new code is added to a specific file,
it's more important to be consistent with the file rather than with this style
guide.
This is a live document, which intentionally starts in a minimal case. When doing code-reviews, consider if some recurring advice you give could be moved into this document.
Formatting
Use rustfmt
for minor code formatting decisions. This rule is enforced by CI
Rationale: rustfmt
style is almost always good enough, even if not always
perfect. The amount of bikeshedding saved by rustfmt
far outweighs any
imperfections.
Idiomatic Rust
While the most important thing is to solve the problem at hand, we strive to
implement the solution in idiomatic Rust, if possible. To learn what is
considered idiomatic Rust, a good start are the
Rust API guidelines
(but keep in mind that nearcore
is not a library with public API, not all
advice applies literally).
When in doubt, ask question in the Rust đŚ Zulip stream or during code review.
Rationale:
- Consistency: there's usually only one idiomatic solution amidst many non-idiomatic ones.
- Predictability: you can use the APIs without consulting documentation.
- Performance, ergonomics and correctness: language idioms usually reflect learned truths, which might not be immediately obvious.
Style
This section documents all micro-rules which are not otherwise enforced by
rustfmt
.
Avoid AsRef::as_ref
When you have some concrete type, prefer .as_str
, .as_bytes
, .as_path
over
generic .as_ref
. Only use .as_ref
when the type in question is a generic
T: AsRef<U>
.
#![allow(unused)] fn main() { // GOOD fn log_validator(account_id: AccountId) { metric_for(account_id.as_str()) .increment() } // BAD fn log_validator(account_id: AccountId) { metric_for(account_id.as_ref()) .increment() } }
Note that Option::as_ref
, Result::as_ref
are great, do use them!
Rationale: Readability and churn-resistance. There might be more than one
AsRef<U>
implementation for a given type (with different U
s). If a new
implementation is added, some of the .as_ref()
calls might break. See also
this issue.
Avoid references to Copy
-types
Various generic APIs in Rust often return references to data (&T
). When T
is
a small Copy
type like i32
, you end up with &i32
while many API expect
i32
, so dereference has to happen somewhere. Prefer dereferencing as early
as possible, typically in a pattern:
#![allow(unused)] fn main() { // GOOD fn compute(map: HashMap<&'str, i32>) { if let Some(&value) = map.get("key") { process(value) } } fn process(value: i32) { ... } // BAD fn compute(map: HashMap<&'str, i32>) { if let Some(value) = map.get("key") { process(*value) } } fn process(value: i32) { ... } }
Rationale: If the value is used multiple times, dereferencing in the pattern saves keystrokes. If the value is used exactly once, we just want to be consistent. Additional benefit of early deref is reduced scope of borrow.
Note that for some big Copy
types, notably CryptoHash
, we sometimes use
references for performance reasons. As a rule of thumb, T
is considered big if
size_of::<T>() > 2 * size_of::<usize>()
.
Prefer for loops over for_each
and try_for_each
methods
Iterators offer for_each
and try_for_each
methods which allow executing
a closure over all items of the iterator. This is similar to using a for loop
but comes with various complications and may lead to less readable code. Prefer
using a loop rather than those methods, for example:
#![allow(unused)] fn main() { // GOOD for outcome_with_id in result? { *total_gas_burnt = safe_add_gas(*total_gas_burnt, outcome_with_id.outcome.gas_burnt)?; outcomes.push(outcome_with_id); } // BAD result?.into_iter().try_for_each( |outcome_with_id: ExecutionOutcomeWithId| -> Result<(), RuntimeError> { *total_gas_burnt = safe_add_gas(*total_gas_burnt, outcome_with_id.outcome.gas_burnt)?; outcomes.push(outcome_with_id); Ok(()) }, )?; }
Rationale: The for_each
and try_for_each
method donât play nice with
break
and continue
statements nor do they mesh well with async IO (since
.await
inside of the closure isnât possible). And while try_for_each
allows
for the use of question mark operator, one may end up having to uses it twice:
once inside the closure and second time outside the call to try_for_each
.
Furthermore, usage of the functions often introduce some minor syntax noise.
There are situations when those methods may lead to more readable code. Common example are long call chains. Even then such code may evolve with the closure growing and leading to less readable code. If advantages of using the methods arenât clear cut, itâs usually better to err on side of more imperative style.
Lastly, anecdotally the methods (e.g. when used with chain
or flat_map
) may
lead to faster code. This intuitively makes sense but itâs worth to keep in
mind that compilers are pretty good at optimising and in practice may generate
optimal code anyway. Furthermore, optimising code for readability may be more
important (especially outside of hot path) than small performance gains.
Prefer to_string
to format!("{}")
Prefer calling to_string
method on an object rather than passing it through
format!("{}")
if all youâre doing is converting it to a String
.
#![allow(unused)] fn main() { // GOOD let hash = block_hash.to_string(); let msg = format!("{}: failed to open", path.display()); // BAD let hash = format!("{block_hash}"); let msg = path.display() + ": failed to open"; }
Rationale: to_string
is shorter to type and also faster.
Import Granularity
Group imports by module, but not deeper:
#![allow(unused)] fn main() { // GOOD use std::collections::{hash_map, BTreeSet}; use std::sync::Arc; // BAD - nested groups. use std::{ collections::{hash_map, BTreeSet}, sync::Arc, }; // BAD - not grouped together. use std::collections::BTreeSet; use std::collections::hash_map; use std::sync::Arc; }
This corresponds to "rust-analyzer.imports.granularity.group": "module"
setting
in rust-analyzer
(docs).
Rationale: Consistency, matches existing practice.
Import Blocks
Do not separate imports into groups with blank lines. Write a single block of
imports and rely on rustfmt
to sort them.
#![allow(unused)] fn main() { // GOOD use crate::types::KnownPeerState; use borsh::BorshSerialize; use near_primitives::utils::to_timestamp; use near_store::{DBCol::Peers, Store}; use rand::seq::SliceRandom; use std::collections::HashMap; use std::net::SocketAddr; // BAD -- several groups of imports use std::collections::HashMap; use std::net::SocketAddr; use borsh::BorshSerialize; use rand::seq::SliceRandom; use near_primitives::utils::to_timestamp; use near_store::{DBCol::Peers, Store}; use crate::types::KnownPeerState; }
Rationale: Consistency, ease of automatic enforcement. Today, stable rustfmt can't split imports into groups automatically, and doing that manually consistently is a chore.
Derives
When deriving an implementation of a trait, specify a full path to the traits provided by the external libraries:
#![allow(unused)] fn main() { // GOOD #[derive(Copy, Clone, serde::Serialize, thiserror::Error, strum::Display)] struct Grapefruit; // BAD use serde::Serialize; use thiserror::Error; use strum::Display; #[derive(Copy, Clone, Serialize, Error, Display)] struct Banana; }
As an exception to this rule, it is okay to use either style when the derived trait already
includes the name of the library (as would be the case for borsh::BorshSerialize
.)
Rationale: Specifying a full path to the externally provided derivations here makes it
straightforward to differentiate between the built-in derivations and those provided by the
external crates. The surprise factor for derivations sharing a name with the standard
library traits (Display
) is reduced and it also acts as natural mechanism to tell apart names
prone to collision (Serialize
), all without needing to look up the list of imports.
Arithmetic integer operations
Use methods with an appropriate overflow handling over plain arithmetic operators (+-*/%
) when
dealing with integers.
// GOOD
a.wrapping_add(b);
c.saturating_sub(2);
d.widening_mul(3); // NB: unstable at the time of writing
e.overflowing_div(5);
f.checked_rem(7);
// BAD
a + b
c - 2
d * 3
e / 5
f % 7
If youâre confident the arithmetic operation cannot fail,
x.checked_[add|sub|mul|div](y).expect("explanation why the operation is safe")
is a great
alternative, as it neatly documents not just the infallibility, but also why that is the case.
This convention may be enforced by the clippy::arithmetic_side_effects
and
clippy::integer_arithmetic
lints.
Rationale: By default the outcome of an overflowing computation in Rust depends on a few factors, most notably the compilation flags used. The quick explanation is that in debug mode the computations may panic (cause side effects) if the result has overflowed, and when built with optimizations enabled, these computations will wrap-around instead.
For nearcore and neard we have opted to enable the panicking behaviour regardless of the
optimization level. By doing it this we hope to prevent accidental stabilization of protocol
mis-features that depend on incorrect handling of these overflows or similarly scary silent bugs.
The downside to this approach is that any such arithmetic operation now may cause a node to crash,
much like indexing a vector with a[idx]
may cause a crash when idx
is out-of-bounds. Unlike
indexing, however, developers and reviewers are not used to treating integer arithmetic operations
with the due suspicion. Having to make a choice, and explicitly spell out, how an overflow case
ought to be handled will result in an easier to review and understand code and a more resilient
project overall.
Standard Naming
- Use
-
rather than_
in crate names and in corresponding folder names. - Avoid single-letter variable names especially in long functions. Common
i
,j
etc. loop variables are somewhat of an exception but since Rust encourages use of iterators those cases arenât that common anyway. - Follow standard Rust naming patterns such as:
- Donât use
get_
prefix for getter methods. A getter method is one which returns (a reference to) a field of an object. - Use
set_
prefix for setter methods. An exception are builder objects which may use different a naming style. - Use
into_
prefix for methods which consumeself
andto_
prefix for methods which donât.
- Donât use
- Use
get_block_header
rather thanget_header
for methods which return a block header. - Donât use
_by_hash
suffix for methods which lookup chain objects (blocks, chunks, block headers etc.) by their hash (i.e. their primary identifier). - Use
_by_height
and similar suffixes for methods which lookup chain objects (blocks, chunks, block headers etc.) by their height or other property which is not their hash.
Rationale: Consistency.
Documentation
When writing documentation in .md
files, wrap lines at approximately 80
columns.
<!-- GOOD -->
Manually reflowing paragraphs is tedious. Luckily, most editors have this
functionality built in or available via extensions. For example, in Emacs you
can use `fill-paragraph` (<kbd>M-q</kbd>), (neo)vim allows rewrapping with `gq`,
and VS Code has `stkb.rewrap` extension.
<!-- BAD -->
One sentence per-line is also occasionally used for technical writing.
We avoid that format though.
While convenient for editing, it may be poorly legible in unrendered form
<!-- BAD -->
Definitely don't use soft-wrapping. While markdown mostly ignores source level line breaks, relying on soft wrap makes the source completely unreadable, especially on modern wide displays.
Tracing
When emitting events and spans with tracing
prefer adding variable data via
tracing
's field mechanism.
#![allow(unused)] fn main() { // GOOD debug!( target: "client", validator_id = self.client.validator_signer.get().map(|vs| { tracing::field::display(vs.validator_id()) }), %hash, "block.previous_hash" = %block.header().prev_hash(), "block.height" = block.header().height(), %peer_id, was_requested, "Received block", ); }
Most apparent violation of this rule will be when the event message utilizes any form of formatting, as seen in the following example:
#![allow(unused)] fn main() { // BAD debug!( target: "client", "{:?} Received block {} <- {} at {} from {}, requested: {}", self.client.validator_signer.get().map(|vs| vs.validator_id()), hash, block.header().prev_hash(), block.header().height(), peer_id, was_requested ); }
Always specify the target
explicitly. A good default value to use is the crate
name, or the module path (e.g. chain::client
) so that events and spans common
to a topic can be grouped together. This grouping can later be used for
customizing which events to output.
Rationale: This makes the events structured â one of the major value add propositions of the tracing ecosystem. Structured events allow for immediately actionable data without additional post-processing, especially when using some of the more advanced tracing subscribers. Of particular interest would be those that output events as JSON, or those that publish data to distributed event collection systems such as opentelemetry. Maintaining this rule will also usually result in faster execution (when logs at the relevant level are enabled.)
Spans
Use the spans to introduce context and grouping to and between events instead of manually adding such information as part of the events themselves. Most of the subscribers ingesting spans also provide a built-in timing facility, so prefer using spans for measuring the amount of time a section of code needs to execute.
Give spans simple names that make them both easy to trace back to code, and to
find a particular span in logs or other tools ingesting the span data. If a
span begins at the top of a function, prefer giving it a name of that function,
otherwise prefer a snake_case
name.
When instrumenting asynchronous functions the #[tracing::instrument]
macro or the
Future::instrument
is required. Using Span::entered
or a similar method that is not aware
of yield points will result in incorrect span data and could lead to difficult to troubleshoot
issues such as stack overflows.
Always explicitly specify the level
, target
, and skip_all
options and do not rely on the
default values. skip_all
avoids adding all function arguments as span fields which can lead
recording potentially unnecessary and expensive information. Carefully consider which information
needs recording and the cost of recording the information when using the fields
option.
#![allow(unused)] fn main() { #[tracing::instrument( level = "trace", target = "network", "handle_sync_routing_table", skip_all )] async fn handle_sync_routing_table( clock: &time::Clock, network_state: &Arc<NetworkState>, conn: Arc<connection::Connection>, rtu: RoutingTableUpdate, ) { ... } }
In regular synchronous code it is fine to use the regular span API if you need to instrument portions of a function without affecting the code structure:
#![allow(unused)] fn main() { fn compile_and_serialize_wasmer(code: &[u8]) -> Result<wasmer::Module> { // Some code... { let _span = tracing::debug_span!(target: "vm", "compile_wasmer").entered(); // ... // _span will be dropped when this scope ends, terminating the span created above. // You can also `drop` it manually, to end the span early with `drop(_span)`. } // Some more code... } }
Rationale: Much as with events, this makes the information provided by spans
structured and contextual. This information can then be output to tooling in an
industry standard format, and can be interpreted by an extensive ecosystem of
tracing
subscribers.
Event and span levels
The INFO
level is enabled by default, use it for information useful for node
operators. The DEBUG
level is enabled on the canary nodes, use it for
information useful in debugging testnet failures. The TRACE
level is not
generally enabled, use it for arbitrary debug output.
Metrics
Consider adding metrics to new functionality. For example, how often each type of error was triggered, how often each message type was processed.
Rationale: Metrics are cheap to increment, and they often provide a significant insight into operation of the code, almost as much as logging. But unlike logging metrics don't incur a significant runtime cost.
Naming
Prefix all nearcore
metrics with near_
. Follow the
Prometheus naming convention
for new metrics.
Rationale: The near_
prefix makes it trivial to separate metrics exported
by nearcore
from other metrics, such as metrics about the state of the machine
that runs neard
.
Performance
In most cases incrementing a metric is cheap enough never to give it a second thought. However accessing a metric with labels on a hot path needs to be done carefully.
If a label is based on an integer, use a faster way of converting an integer
to the label, such as the itoa
crate.
For hot code paths, re-use results of with_label_values()
as much as possible.
Rationale: We've encountered issues caused by the runtime costs of incrementing metrics before. Avoid runtime costs of incrementing metrics too often.
Documentation
This chapter describes nearcore's approach to documentation. There are three primary types of documentation to keep in mind:
- The NEAR Protocol Specification (source code) is the formal description of the NEAR protocol. The reference nearcore implementation and any other NEAR client implementations must follow this specification.
- User docs (source code) explain what is NEAR and how to participate in the network. In particular, they contain information pertinent to the users of NEAR: validators and smart contract developers.
- Documentation for nearcore developers (source code) is the book you are reading right now! The target audience here is the contributors to the main implementation of the NEAR protocol (nearcore).
Overview
The bulk of the internal docs is within this book. If you want to write some kind of document, add it here! The architecture and practices chapters are intended for somewhat up-to-date normative documents. The misc chapter holds everything else.
This book is not intended for user-facing documentation, so don't worry about proper English, typos, or beautiful diagrams -- just write stuff! It can easily be improved over time with pull requests. For docs, we use a lightweight review process and try to merge any improvement as quickly as possible. Rather than blocking a PR on some stylistic changes, just merge it and submit a follow-up.
Note the "edit" button at the top-right corner -- super useful for fixing any typos you spot!
In addition to the book, we also have some "inline" documentation in the code.
For Rust, it is customary to have a per-crate README.md
file and include it as
a doc comment via #![doc = include_str!("../README.md")]
in lib.rs
. We don't
require every item to be documented, but we certainly encourage documenting as
much as possible. If you spend some time refactoring or fixing a function,
consider adding a doc comment (///
) to it as a drive-by improvement.
We currently don't render rustdoc
, see #7836.
Book How To
We use mdBook to render a bunch of markdown files as a static website with a table of contents, search and themes. Full docs are here, but the basics are very simple.
To add a new page to the book:
- Add a
.md
file somewhere in the./docs
folder. - Add a link to that page to the
SUMMARY.md
. - Submit a PR (again, we promise to merge it without much ceremony).
The doc itself is in vanilla markdown.
To render documentation locally:
# Install mdBook
$ cargo install mdbook
$ mdbook serve --open ./docs
This will generate the book from the docs folder, open it in a browser and start a file watcher to rebuild the book every time the source files change.
Note that GitHub's default rendering mostly works just as well, so you don't need to go out of your way to preview your changes when drafting a page or reviewing pull requests to this book.
The book is deployed via the book GitHub Action workflow. This workflow runs mdBook and then deploys the result to GitHub Pages.
For internal docs, you often want to have pretty pictures. We don't currently have a recommended workflow, but here are some tips:
-
Don't add binary media files to Git to avoid inflating repository size. Rather, upload images as comments to this super-secret issue #7821, and then link to the images as
![image](https://user-images.githubusercontent.com/1711539/195626792-7697129b-7f9c-4953-b939-0b9bcacaf72c.png)
Use a single comment per page with multiple images.
-
Google Docs is an OK way to create technical drawings, you can add a link to the doc with source to that secret issue as well.
-
There's some momentum around using mermaid.js for diagramming, and there's an appropriate plugin for that. Consider if that's something you might want to use.
Tracking issues
nearcore
uses so-called "tracking issues" to coordinate larger pieces of work
(e.g. implementation of new NEPs). Such issues are tagged with the
C-tracking-issue
label.
The goal of tracking issues is to serve as a coordination point. They can help new contributors and other interested parties come up to speed with the current state of projects. As such, they should link to things like design docs, todo-lists of sub-issues, existing implementation PRs, etc.
One can further use tracking issues to:
- get a feeling for what's happening in
nearcore
by looking at the set of open tracking issues. - find larger efforts to contribute to as tracking issues usually contain up-for-grab to-do lists.
- follow the progress of specific features by subscribing to the issue on GitHub.
If you are leading or participating in a larger effort, please create a tracking issue for your work.
Guidelines
- Tracking issues should be maintained in the
nearcore
repository. If the projects are security sensitive, then they should be maintained in thenearcore-private
repository. - The issues should be kept up-to-date. At a minimum, all new context should be added as comments, but preferably the original description should be edited to reflect the current status.
- The issues should contain links to all the relevant design documents which should also be kept up-to-date.
- The issues should link to any relevant NEP if applicable.
- The issues should contain a list of to-do tasks that should be kept up-to-date as new work items are discovered and other items are done. This helps others gauge progress and helps lower the barrier of entry for others to participate.
- The issues should contain links to relevant Zulip discussions. Prefer open forums like Zulip for discussions. When necessary, closed forums like video calls can also be used but care should be taken to document a summary of the discussions.
- For security-sensitive discussions, use the appropriate private Zulip streams.
This issue is a good example of how tracking issues should be maintained.
Background
The idea of tracking issues is also used to track project work in the Rust language. See this post for a rough description and these issues for how they are used in Rust.
Security Vulnerabilities
The intended audience of the information presented here is developers working on the implementation of NEAR.
Are you a security researcher? Please report security vulnerabilities to security@near.org.
As nearcore is open source, all of its issues and pull requests are also publicly tracked on GitHub. However, from time to time, if a security-sensitive issue is discovered, it cannot be tracked publicly on GitHub. However, we should promote as similar a development process to work on such issues as possible. To enable this, below is the high-level process for working on security-sensitive issues.
-
There is a private fork of nearcore on GitHub. Access to this repository is restricted to the set of people who are trusted to work on and have knowledge about security-sensitive issues in nearcore.
This repository can be manually synced with the public nearcore repository using the following commands:
$ git remote add nearcore-public git@github.com:near/nearcore $ git remote add nearcore-private git@github.com:near/nearcore-private $ git fetch nearcore-public $ git push nearcore-private nearcore-public/master:master
-
All security-sensitive issues must be created on the private nearcore repository. You must also assign one of the
[P-S0, P-S1]
labels to the issue to indicate the severity of the issue. The two criteria to use to help you judge the severity are the ease of carrying out the attack and the impact of the attack. An attack that is easy to do or can have a huge impact should have theP-S0
label andP-S1
otherwise. -
All security-sensitive pull requests should also be created on the private nearcore repository. Note that once a PR has been approved, it should not be merged into the private repository. Instead, it should be first merged into the public repository and then the private fork should be updated using the steps above.
-
Once work on a security issue is finished, it needs to be deployed to all the impacted networks. Please contact the node team for help with this.
Fast Builds
nearcore is implemented in Rust and is a fairly sizable project, so it takes a while to build. This chapter collects various tips to make the development process faster.
Optimizing build times is a bit of a black art, so please do benchmarks on your machine to verify that the improvements work for you. Changing some configuration and making a typo, which prevents it from improving build times is an extremely common failure mode!
Rust Perf Book contains a section on compilation times as well!
Release Builds and Link Time Optimization
cargo build --release
is obviously slower than cargo build
. We enable full
lto (link-time optimization), so our -r
builds are very slow, use a lot of
RAM, and don't utilize the available parallelism fully.
As debug builds are much too slow at runtime for many purposes, we have a custom
profile --profile dev-release
which is equivalent to -r
, except that the
time-consuming options such as LTO are disabled, and debug assertions are enabled.
Use --profile dev-release
for most local development, or when connecting a
locally built node to a network. Use -r
for production, or if you want to get
absolute performance numbers.
Linker
By default, rustc
uses the default system linker, which tends to be quite
slow. Using lld
(LLVM linker) or mold
(very new, very fast linker) provides
big wins for many setups.
I don't know what's the official source of truth for using alternative linkers, I usually refer to this comment.
Usually, adding
[build]
rustflags = ["-C", "link-arg=-fuse-ld=lld"]
to ~/.cargo/config
is the most convenient approach.
lld
itself can be installed with sudo apt install lld
(or the equivalent in
the distro/package manager of your choice).
Prebuilt RocksDB
By default, we compile RocksDB (a C++ project) from source during the neard
build. By linking to a prebuilt copy of RocksDB this work can be avoided
entirely. This is a huge win, especially if you clean the ./target
directory
frequently.
To use a prebuilt RocksDB, set the ROCKSDB_LIB_DIR
environment variable to
a location containing librocksdb.a
:
$ export ROCKSDB_LIB_DIR=/usr/lib/x86_64-linux-gnu
$ cargo build -p neard
Note, that the system must provide a recent version of the library which,
depending on which operating system youâre using, may require installing packages
from a testing branch. For example, on Debian it requires installing
librocksdb-dev
from the experimental
repository:
Note: Based on which distro you are using this process will look different. Please refer to the documentation of the package manager you are using.
echo 'deb http://ftp.debian.org/debian experimental main contrib non-free' |
sudo tee -a /etc/apt/sources.list
sudo apt update
sudo apt -t experimental install librocksdb-dev
ROCKSDB_LIB_DIR=/usr/lib/x86_64-linux-gnu
export ROCKSDB_LIB_DIR
Global Compilation Cache
By default, Rust compiles incrementally, with the incremental cache and
intermediate outputs stored in the project-local ./target
directory.
The sccache
utility can be used to share
these artifacts between machines or checkouts within the same machine. sccache
works by intercepting calls to rustc
and will fetch the cached outputs from
the global cache whenever possible. This tool can be set up as such:
$ cargo install sccache
$ export RUSTC_WRAPPER="sccache"
$ export SCCACHE_CACHE_SIZE="30G"
$ cargo build -p neard
Refer to the projectâs README for further configuration options.
IDEs Are Bad For Environment Handling
Generally, the knobs in this section are controlled either via global
configuration in ~/.cargo/config
or environment variables.
Environment variables are notoriously easy to lose, especially if you are working both from a command line and a graphical IDE. Double-check that the environment within which builds are executed is identical to avoid nasty failure modes such as full cache invalidation when switching from the CLI to an IDE or vice-versa.
direnv
sometimes can be used to conveniently manage
project-specific environment variables.
General principles
- Every PR needs to have test coverage in place. Sending the code change and deferring tests for a future change is not acceptable.
- Tests need to either be sufficiently simple to follow or have good documentation to explain why certain actions are made and conditions are expected.
- When implementing a PR, make sure to run the new tests with the change disabled and confirm that they fail! It is extremely common to have tests that pass without the change that is being tested.
- The general rule of thumb for a reviewer is to first review the tests, and ensure that they can convince themselves that the code change that passes the tests must be correct. Only then the code should be reviewed.
- Have the assertions in the tests as specific as possible,
however do not make the tests change-detectors of the concrete implementation.
(assert only properties which are required for correctness).
For example, do not do
assert!(result.is_err())
, expect the specific error instead.
Tests hierarchy
In the NEAR Reference Client we largely split tests into three categories:
- Relatively cheap sanity or fast fuzz tests: It includes all the
#[test]
Rust tests not decorated by features. Our repo is configured in such a way that all such tests are run on every PR and failing at least one of them is blocking the PR from being merged.
To run such tests locally run cargo nextest run --all
.
It requires nextest harness which can be installed by running cargo install cargo-nextest
first.
- Expensive tests: This includes all the fuzzy tests that run many iterations,
as well as tests that spin up multiple nodes and run them until they reach a
certain condition. Such tests are decorated with
#[cfg(feature="expensive-tests")]
. It is not trivial to enable features that are not declared in the top-level crate, and thus the easiest way to run such tests is to enable all the features by passing--all-features
tocargo nextest run
, e.g:
cargo nextest run --package near-client -E 'test(=tests::cross_shard_tx::test_cross_shard_tx)' --all-features
- Python tests: We have an infrastructure to spin up nodes, both locally and
remotely, in python, and interact with them using RPC. The infrastructure and
the tests are located in the
pytest
folder. The infrastructure is relatively straightforward, see for exampleblock_production.py
here. See theTest infrastructure
section below for details.
Expensive and python tests are not part of CI, and are run by a custom nightly runner. The results of the latest runs are available here. Today, test runs launch approximately every 5-6 hours. For the latest results look at the second run, since the first one has some tests still scheduled to run.
Test infrastructure
Different levels of the reference implementation have different infrastructures available to test them.
Client
The Client is separated from the runtime via a RuntimeAdapter
trait.
In production, it uses NightshadeRuntime
which uses real runtime and epoch managers.
To test the client without instantiating runtime and epoch manager, we have a mock runtime
KeyValueRuntime
.
Most of the tests in the client work by setting up either a single node (via
setup_mock()
) or multiple nodes (via setup_mock_all_validators()
) and then
launching the nodes and waiting for a particular message to occur, with a
predefined timeout.
For the most basic example of using this infrastructure see produce_two_blocks
in
tests/process_blocks.rs
.
- The callback (
Box::new(move |msg, _ctx, _| { ...
) is what is executed whenever the client sends a message. The return value of the callback is sent back to the client, which allows for testing relatively complex scenarios. The tests generally expect a particular message to occur, in this case, the tests expect two blocks to be produced.System::current().stop();
is the way to stop the test and mark it as passed. near_network::test_utils::wait_or_panic(5000);
is how the timeout for the test is set (in milliseconds).
For an example of a test that launches multiple nodes, see
chunks_produced_and_distributed_common
in
tests/chunks_management.rs.
The setup_mock_all_validators
function is the key piece of infrastructure here.
Runtime
Tests for Runtime are listed in tests/test_cases_runtime.rs.
To run a test, usually, a mock RuntimeNode
is created via create_runtime_node()
.
In its constructor, the Runtime
is created in the
get_runtime_and_trie_from_genesis
function.
Inside a test, an abstracted User
is used for sending specific actions to the
runtime client. The helper functions function_call
, deploy_contract
, etc.
eventually lead to the Runtime.apply
method call.
For setting usernames during playing with transactions, use default names
alice_account
, bob_account
, eve_dot_alice_account
, etc.
Network
Chain, Epoch Manager, Runtime and other low-level changes
When building new features in the chain
, epoch_manager
and network
crates,
make sure to build new components sufficiently abstract so that they can be tested
without relying on other components.
For example, see tests for doomslug here, for network cache here, or for promises in runtime here.
Python tests
See this page for detailed coverage of how to write a python test.
We have a python library that allows one to create and run python tests.
To run python tests, from the nearcore
repo the first time, do the following:
cd pytest
virtualenv . --python=python3
. .env/bin/activate
pip install -r requirements.txt
python tests/sanity/block_production.py
This will create a python virtual environment, activate the environment, install
all the required packages specified in the requirements.txt
file and run the
tests/sanity/block_production.py
file. After the first time, we only need to
activate the environment and can then run the tests:
cd pytest
. .env/bin/activate
python tests/sanity/block_production.py
Use pytest/tests/sanity/block_production.py
as the basic example of starting a
cluster with multiple nodes, and doing RPC calls.
See pytest/tests/sanity/deploy_call_smart_contract.py
to see how contracts can
be deployed, or transactions called.
See pytest/tests/sanity/staking1.py
to see how staking transactions can be
issued.
See pytest/tests/sanity/state_sync.py
to see how to delay the launch of the
whole cluster by using init_cluster
instead of start_cluster
, and then
launching nodes manually.
Enabling adversarial behavior
To allow testing adversarial behavior, or generally, behaviors that a node should
not normally exercise, we have certain features in the code decorated with
#[cfg(feature="adversarial")]
. The binary normally is compiled with the
feature disabled, and when compiled with the feature enabled, it traces a
warning on launch.
The nightly runner runs all the python tests against the binary compiled with the feature enabled, and thus the python tests can make the binary perform actions that it normally would not perform.
The actions can include lying about the known chain height, producing multiple blocks for the same height, or disabling doomslug.
See all the tests under pytest/tests/adversarial
for some examples.
Python Tests
To simplify writing integration tests for nearcore we have a python infrastructure that allows writing a large variety of tests that run small local clusters, remove clusters, or run against full-scale live deployments.
Such tests are written in python and not in Rust (in which the nearcore itself, and most of the sanity and fuzz tests, are written) due to the availability of libraries to easily connect to, remove nodes and orchestrate cloud instances.
Nearcore itself has several features guarded by a feature flag that allows the python tests to invoke behaviors otherwise impossible to be exercised by an honest actor.
Basics
The infrastructure is located in {nearcore}/pytest/lib
and the tests themselves
are in subdirectories of {nearcore}/pytest/tests
. To prepare the local machine to
run the tests you'd need python3 (python 3.7), and have several dependencies
installed, for which we recommend using virtualenv:
cd pytest
virtualenv .env --python=python3
. .env/bin/activate
pip install -r requirements.txt
The tests are expected to be ran from the pytest
dir itself. For example, once
the virtualenv is configured:
cd pytest
. .env/bin/activate
python tests/sanity/block_production.py
This will run the most basic tests that spin up a small cluster locally and wait until it produces several blocks.
Compiling the client for tests
The local tests by default expect the binary to be in the default location for a
debug build ({nearcore}/target/debug
). Some tests might also expect
test-specific features guarded by a feature flag to be available. To compile the
binary with such features run:
cargo build -p neard --features=adversarial
The feature is called adversarial
to highlight that the many functions it enables,
outside of tests, would constitute malicious behavior. The node compiled with
such a flag will not start unless an environment variable ADVERSARY_CONSENT=1
is set and prints a noticeable warning when it starts, thus minimizing the chance
that an honest participant accidentally launches a node compiled with such
functionality.
You can change the way the tests run (locally or using Google Cloud), and where the local tests look for the binary by supplying a config file. For example, if you want to run tests against a release build, you can create a file with the following config:
{"local": true, "near_root": "../target/release/"}
and run the test with the following command:
NEAR_PYTEST_CONFIG=<path to config> python tests/sanity/block_production.py
Writing tests
We differentiate between "regular" tests, or tests that spin up their cluster, either local or on the cloud, and "mocknet" tests, or tests that run against an existing live deployment of NEAR.
In both cases, the test starts by importing the infrastructure and starting or connecting to a cluster
Starting a cluster
In the simplest case a regular test starts by starting a cluster. The cluster will run locally by default but can be spun up on the cloud by supplying the corresponding config.
import sys
sys.path.append('lib')
from cluster import start_cluster
nodes = start_cluster(4, 0, 4, None, [["epoch_length", 10], ["block_producer_kickout_threshold", 80]], {})
In the example above the first three parameters are num_validating_nodes
,
num_observers
and num_shards
. The third parameter is a config, which generally
should be None
, in which case the config is picked up from the environment
variable as shown above.
start_cluster
will spin up num_validating_nodes
nodes that are block
producers (with pre-staked tokens), num_observers
non-validating nodes and
will configure the system to have num_shards
shards. The fifth argument
changes the genesis config. Each element is a list of some length n
where the
first n-1
elements are a path in the genesis JSON file, and the last element
is the value. You'd often want to significantly reduce the epoch length, so that
your test triggers epoch switches, and reduce the kick-out threshold since with
shorter epochs it is easier for a block producer to get kicked out.
The last parameter is a dictionary from the node ordinal to changes to their local config.
Note that start_cluster
spins up all the nodes right away. Some tests (e.g.
tests that test syncing) might want to configure the nodes but delay their
start. In such a case you will initialize the cluster by calling
init_cluster
and will run the nodes manually, for example, see
state_sync.py
Connecting to a mocknet
Nodes that run against a mocknet would connect to an existing cluster instead of running their own.
import sys
sys.path.append('lib')
from cluster import connect_to_mocknet
nodes, accounts = connect_to_mocknet(None)
The only parameter is a config, with None
meaning to use the config from the
environment variable. The config should have the following format:
{
"nodes": [
{"ip": "(some_ip)", "port": 3030},
{"ip": "(some_ip)", "port": 3030},
{"ip": "(some_ip)", "port": 3030},
{"ip": "(some_ip)", "port": 3030}
],
"accounts": [
{"account_id": "node1", "pk": "ed25519:<public key>", "sk": "edd25519:<secret key>"},
{"account_id": "node2", "pk": "ed25519:<public key>", "sk": "edd25519:<secret key>"}
]
}
Manipulating nodes
The nodes returned by start_cluster
and init_cluster
have certain
convenience functions. You can see the full interface in
{nearcore}/pytest/lib/cluster.py
.
start(boot_public_key, (boot_ip, boot_port))
starts the node. If both
arguments are None
, the node will start as a boot node (note that the concept
of a "boot node" is relatively vague in a decentralized system, and from the
perspective of the tests the only requirement is that the graph of "node A
booted from node B" is connected).
The particular way to get the boot_ip
and boot_port
when launching node1
with node2
being its boot node is the following:
node1.start(node2.node_key.pk, node2.addr())
kill()
shuts down the node by sending it SIGKILL
reset_data()
cleans up the data dir, which could be handy between the calls to
kill
and start
to see if a node can start from a clean state.
Nodes on the mocknet do not expose start
, kill
and reset_data
.
Issuing RPC calls
Nodes in both regular and mocknet tests expose an interface to issue RPC calls.
In the most generic case, one can just issue raw JSON RPC calls by calling the
json_rpc
method:
validator_info = nodes[0].json_rpc('validators', [<some block_hash>])
For the most popular calls, there are convenience functions:
send_tx
sends a signed transaction asynchronouslysend_tx_and_wait
sends a signed transaction synchronouslyget_status
returns the current status (the output of the/status/endpoint
), which contains e.g. last block hash and heightget_tx
returns a transaction by the transaction hash and the recipient ID.
See all the methods in {nearcore}/pytest/lib/cluster.rs
after the definition
of the json_rpc
method.
Signing and sending transactions
There are two ways to send a transaction. A synchronous way (send_tx_and_wait
)
sends a tx and blocks the test execution until either the TX is finished, or the
timeout is hit. An asynchronous way (send_tx
+ get_tx
) sends a TX and then
verifies its result later. Here's an end-to-end example of sending a
transaction:
# the tx needs to include one of the recent hashes
last_block_hash = nodes[0].get_status()['sync_info']['latest_block_hash']
last_block_hash_decoded = base58.b58decode(last_block_hash.encode('utf8'))
# sign the actual transaction
# `fr` and `to` in this case are instances of class `Key`.
# In mocknet tests the list `Key`s for all the accounts are returned by `connect_to_mocknet`
# In regular tests each node is associated with a single account, and its key is stored in the
# `signer_key` field (e.g. `nodes[0].signer_key`)
# `15` in the example below is the nonce. Nonces needs to increase for consecutive transactions
# for the same sender account.
tx = sign_payment_tx(fr, to.account_id, 100, 15, last_block_hash_decoded)
# Sending the transaction synchronously. `10` is the timeout in seconds. If after 10 seconds the
# outcome is not ready, throws an exception
if want_sync:
outcome = nodes[0].send_tx_and_wait(tx, 10)
# Sending the transaction asynchronously.
if want_async:
tx_hash = nodes[from_ordinal % len(nodes)].send_tx(tx)['result']
# and then sometime later fetch the result...
resp = nodes[0].get_tx(tx_hash, to.account_id, timeout=1)
# and see if the tx has finished
finished = 'result' in resp and 'receipts_outcome' in resp['result'] and len(resp['result']['receipts_outcome']) > 0
See rpc_tx_forwarding.py for an example of signing and submitting a transaction.
Adversarial behavior
Some tests need certain nodes in the cluster to exercise behavior that is impossible to be invoked by an honest node. For such tests, we provide functionality that is protected by an "adversarial" feature flag.
It's an advanced feature, and more thorough documentation is a TODO. Most of the
tests that depend on the feature flag enabled are under
{nearcore}/pytest/tests/adversarial
, refer to them for how such features can
be used. Search for code in the nearcore
codebase guarded by the "adversarial"
feature flag for an example of how such features are added and exposed.
Interfering with the network
We have a library that allows running a proxy in front of each node that would
intercept all the messages between nodes, deserialize them in python and run a
handler on each one. The handler can then either let the message pass (return True
), drop it (return False
) or replace it (return <new message>
).
This technique can be used to both interfere with the network (by dropping or
replacing messages), and to inspect messages that flow through the network
without interfering with them. For the latter, note that the handler for each node
runs in a separate Process
, and thus you need to use multiprocessing
primitives if you want the handlers to exchange information with the main test
process, or between each other.
See the tests that match tests/sanity/proxy_*.py
for examples.
Contributing tests
We always welcome new tests, especially python tests that use the above infrastructure. We have a list of test requests here, but also welcome any other tests that test aspects of the network we haven't thought about.
Cheat sheet/overview of testing utils
This page covers the different testing utils/libraries that we have for easier unit testing in Rust.
Basics
CryptoHash
To create a new crypto hash:
#![allow(unused)] fn main() { "ADns6sqVyFLRZbSMCGdzUiUPaDDtjTmKCWzR8HxWsfDU".parse().unwrap(); }
Account
Also, prefer doing parse + unwrap:
#![allow(unused)] fn main() { let alice: AccountId = "alice.near".parse().unwrap(); }
Signatures
In memory signer (generates the key based on a seed). There is a slight preference to use the seed that is matching the account name.
This will create a signer for account 'test' using 'test' as a seed.
#![allow(unused)] fn main() { let signer: InMemoryValidatorSigner = create_test_signer("test"); }
Block
Use TestBlockBuilder
to create the block that you need. This class allows you to set custom values for most of the fields.
#![allow(unused)] fn main() { let test_block = test_utils::TestBlockBuilder::new(prev, signer).height(33).build(); }
Store
Use the in-memory test store in tests:
#![allow(unused)] fn main() { let store = create_test_store(); }
EpochManager
See usages of MockEpochManager. Note that this is deprecated. Try to use EpochManager itself wherever possible.
Runtime
You can use the KeyValueRuntime (instead of the Nightshade one):
#![allow(unused)] fn main() { KeyValueRuntime::new(store, &epoch_manager); }
Chain
No fakes or mocks.
Client
TestEnv - for testing multiple clients (without network):
#![allow(unused)] fn main() { TestEnvBuilder::new(genesis).client(vec!["aa"]).validators(..).epoch_managers(..).build(); }
Network
PeerManager
To create a PeerManager handler:
#![allow(unused)] fn main() { let pm = peer_manager::testonly::start(...).await; }
To connect to others:
#![allow(unused)] fn main() { pm.connect_to(&pm2.peer_info).await; }
Events handling
To wait/handle a given event (as a lot of network code is running in an async fashion):
#![allow(unused)] fn main() { pm.events.recv_util(|event| match event {...}).await; }
End to End
chain, runtime, signer
In chain/chain/src/test_utils.rs:
#![allow(unused)] fn main() { // Creates 1-validator (test): chain, KVRuntime and a signer let (chain, runtime, signer) = setup(); }
block, client actor, view client
In chain/client/src/test_utils.rs
#![allow(unused)] fn main() { let (block, client, view_client) = setup(MANY_FIELDS); }
Test Coverage
In order to focus the testing effort where it is most needed, we have a few ways we track test coverage.
Codecov
The main one is Codecov. Coverage is visible on this webpage, and displays the total coverage, including unit and integration tests. Codecov is especially interesting for its PR comments. The PR comments, in particular, can easily show which diff lines are being tested and which are not.
However, sometimes Codecov gives too rough estimates, and this is where artifact results come in.
Artifact Results
We also push artifacts, as a result of each CI run. You can access them here:
- Click "Details" on one of the CI actions run on your PR (literally any one of the actions is fine, you can also access CI actions runs on any CI)
- Click "Summary" on the top left of the opening page
- Scroll to the bottom of the page
- In the "Artifacts" section, just above the "Summary" section, there is a
coverage-html
link (there is alsocoverage-lcov
for people who use eg. the coverage gutters vscode integration) - Downloading it will give you a zip file with the interesting files.
In there, you can find:
- Two
-diff
files, that contain code coverage for the diff of your PR, to easily see exactly which lines are covered and which are not - Two
-full
folders, that contain code coverage for the whole repository - Each of these exists in one
unit-
variant, that only contains the unit tests, and oneintegration-
variant, that contains all the tests we currently have
To check that your PR is properly tested, if you want better quality
coverage than what codecov "requires," you can have a look at unit-diff
,
because we agreed that we want unit tests to be able to detect most bugs
due to the troubles of debugging failing integration tests.
To find a place that would deserve adding more tests, look at one of the
-full
directories on master, pick one not-well-tested file, and add (ideally
unit) tests for the lines that are missing.
The presentation is unfortunately less easy to access than codecov, and less
eye-catchy. On the other hand, it should be more precise. In particular, the
-full
variants show region-based coverage. It can tell you that eg. the ?
branch is not covered properly by highlighting it red.
One caveat to be aware of: the -full
variants do not highlight covered lines
in green, they just highlight non-covered lines in red.
Protocol Upgrade
This document describes the entire cycle of how a protocol upgrade is done, from the initial PR to the final release. It is important for everyone who contributes to the development of the protocol and its client(s) to understand this process.
Background
At NEAR, we use the term protocol version to mean the version of the blockchain protocol and is separate from the version of some specific client (such as nearcore), since the protocol version defines the protocol rather than some specific implementation of the protocol. More concretely, for each epoch, there is a corresponding protocol version that is agreed upon by validators through a voting mechanism. Our upgrade scheme dictates that protocol version X is backward compatible with protocol version X-1 so that nodes in the network can seamlessly upgrade to the new protocol. However, there is no guarantee that protocol version X is backward compatible with protocol version X-2.
Despite the upgrade mechanism, rolling out a protocol change can be scary, especially if the change is invasive. For those changes, we may want to have several months of testing before we are confident that the change itself works and that it doesn't break other parts of the system.
Protocol version voting and upgrade
When a new neard version, containing a new protocol version, is released, all node maintainers need to upgrade their binary. That typically means stopping neard, downloading or compiling the new neard binary and restarting neard. However the protocol version of the whole network is not immediately bumped to the new protocol version. Instead a process called voting takes place and determines if and when the protocol version upgrade will take place.
Voting is a fully automated process in which all block producers across the network vote in support or against upgrading the protocol version. The voting happens in the last block every epoch. Upgraded nodes will begin voting in favour of the new protocol version after a predetermined date. The voting date is configured by the release owner like this. Once at least 80% of the stake votes in favour of the protocol change in the last block of epoch X, the protocol version will be upgraded in the first block of epoch X+2.
For mainnet releases, the release on github typically happens on a Monday or Tuesday, the voting typically happens a week later and the protocol version upgrade happens 1-2 epochs after the voting. This gives the node maintainers enough time to upgrade their neard nodes. The node maintainers can upgrade their nodes at any time between the release and the voting but it is recommended to upgrade soon after the release. This is to accommodate for any database migrations or miscellaneous delays.
Starting a neard node with protocol version voting in the future in a network that is already operating at that protocol version is supported as well. This is useful in the scenario where there is a mainnet security release where mainnet has not yet voted or upgraded to the new version. That same binary with protocol voting date in the future can be released in testnet even though it has already upgraded to the new protocol version.
Nightly Protocol features
To make protocol upgrades more robust, we introduce the concept of a nightly
protocol version together with the protocol feature flags to allow easy testing
of the cutting-edge protocol changes without jeopardizing the stability of the
codebase overall. The use of the nightly and nightly_protocol for new features
is mandatory while the use of dedicated rust features for new protocol features
is optional and only recommended when necessary. Adding rust features leads to
conditional compilation which is generally not developer friendly. In Cargo.toml
file of the crates we have in nearcore, we introduce rust compile-time features
nightly_protocol
and nightly
:
nightly_protocol = []
nightly = [
"nightly_protocol",
...
]
where nightly_protocol
is a marker feature that indicates that we are on
nightly protocol whereas nightly
is a collection of new protocol features
which also implies nightly_protocol
.
When it is not necessary to use a rust feature for the new protocol feature the Cargo.toml file will remain unchanged.
When it is necessary to use a rust feature for the new protocol feature, it can be added to the Cargo.toml, to the nightly features. For example, when we introduce EVM as a new protocol change, suppose the current protocol version is 40, then we would do the following change in Cargo.toml:
nightly_protocol = []
nightly = [
"nightly_protocol",
"protocol_features_evm",
...
]
In core/primitives/src/version.rs, we would change the protocol version by:
#![allow(unused)] fn main() { #[cfg(feature = ânightly_protocolâ)] pub const PROTOCOL_VERSION: u32 = 100; #[cfg(not(feature = ânightly_protocolâ)] pub const PROTOCOL_VERSION: u32 = 40; }
This way the stable versions remain unaffected after the change. Note that nightly protocol version intentionally starts at a much higher number to make the distinction between the stable protocol and nightly protocol clearer.
To determine whether a protocol feature is enabled, we do the following:
- We maintain a
ProtocolFeature
enum where each variant corresponds to some protocol feature. For nightly protocol features, the variant may optionally be gated by the corresponding rust compile-time feature. - We implement a function
protocol_version
to return, for each variant, the corresponding protocol version in which the feature is enabled. - When we need to decide whether to use the new feature based on the protocol
version of the current network, we can simply compare it to the protocol
version of the feature. To make this simpler, we also introduced a macro
checked_feature
For more details, please refer to core/primitives/src/version.rs.
Feature Gating
It is worth mentioning that there are two types of checks related to protocol features:
- Runtime checks that compare the protocol version of the current epoch and the protocol version of the feature. Those runtime checks must be used for both stable and nightly features.
- Compile time checks that check if the rust feature corresponding with the protocol feature is enabled. This check is optional and can only be used for nightly features.
Testing
Nightly protocol features allow us to enable the most bleeding-edge code in some testing environments. We can choose to enable all nightly protocol features by
#![allow(unused)] fn main() { cargo build -p neard --release --features nightly }
or enable some specific protocol feature by
#![allow(unused)] fn main() { cargo build -p neard --release --features nightly_protocol,<protocol_feature> }
In practice, we have all nightly protocol features enabled for Nayduck tests and on betanet, which is updated daily.
Feature Stabilization
New protocol features are introduced first as nightly features and when the
author of the feature thinks that the feature is ready to be stabilized, they
should submit a pull request to stabilize the feature using
this template.
In this pull request, they should do the feature gating, increase the
PROTOCOL_VERSION
constant (if it hasn't been increased since the last
release), and change the protocol_version
implementation to map the
stabilized features to the new protocol version.
A feature stabilization request must be approved by at least two nearcore code owners. Unless it is a security-related fix, a protocol feature cannot be included in any release until at least one week after its stabilization. This is to ensure that feature implementation and stabilization are not rushed.
This document describes the advanced network options that you can configure by modifying the "network" section of your "config.json" file:
{
// ...
"network": {
// ...
"public_addrs": [],
"allow_private_ip_in_public_addrs": false,
"experimental": {
"inbound_disabled": false,
"connect_only_to_boot_nodes": false,
"skip_sending_tombstones_seconds": 0,
"tier1_enable_inbound": true,
"tier1_enable_outbound": false,
"tier1_connect_interval": {
"secs": 60,
"nanos": 0
},
"tier1_new_connections_per_attempt": 50
}
},
// ...
}
TIER1 network
Participants of the BFT consensus (block & chunk producers) now can establish direct (aka TIER1) connections between each other, which will optimize the communication latency and minimize the number of dropped chunks. If you are a validator, you can enable TIER1 connections by setting the following fields in the config:
- public_addrs
- this is a list of the public addresses (in the format
"<node public key>@<IP>:<port>"
) of trusted nodes, which are willing to route messages to your node - this list will be broadcasted to the network so that other validator nodes can connect to your node.
- if your node has a static public IP, set
public_addrs
to a list with a single entry with the public key and address of your node, for example:"public_addrs": ["ed25519:86EtEy7epneKyrcJwSWP7zsisTkfDRH5CFVszt4qiQYw@31.192.22.209:24567"]
. - if your node doesn't have a public IP (for example, it is hidden behind a NAT), set
public_addrs
to a list (<=10 entries) of proxy nodes that you trust (arbitrary nodes with static public IPs). - support for nodes with dynamic public IPs is not implemented yet.
- this is a list of the public addresses (in the format
- experimental.tier1_enable_outbound
- makes your node actively try to establish outbound TIER1 connections (recommended) once it learns about the public addresses of other validator nodes. If disabled, your node won't try to establish outbound TIER1 connections, but it still may accept incoming TIER1 connections from other nodes.
- currently
false
by default, but will be changed totrue
by default in the future
- experimental.tier1_enable_inbound
- makes your node accept inbound TIER1 connections from other validator nodes.
- disable both
tier1_enable_inbound
andtier1_enable_outbound
if you want to opt-out from the TIER1 communication entirely - disable
tier1_enable_inbound
if you are not a validator AND you don't want your node to act as a proxy for validators. true
by default
Starting a test chain from state taken from mainnet or testnet
Purpose
For testing purposes, it is often desirable to start a test chain with
a starting state that looks like mainnet or testnet. This is usually
done for the purpose of testing changes to neard itself, but it's also
possible to do this if you're a contract developer and want to see
what a change to your contract would look like on top of the current
mainnet state. At the end of the process described here, you'll have
a set of genesis records that can be used to start your own test chain,
that'll be like any other test chain like the ones generated by the
neard localnet
command, except with account balances and data taken
from mainnet
How-to
The first step is to obtain an RPC node home directory for the chain
you'd like to spoon. So if you want to use mainnet state, you can
follow the instructions
here
to obtain a recent snapshot of a mainnet node's home directory. Once
you have your node's home directory set up, run the following
state-viewer
command to generate a dump of the chain's state:
$ neard --home $NEAR_HOME_DIRECTORY view-state dump-state --stream
This command will take a while (possibly many hours) to run. But at the
end you should have genesis.json
and records.json
files in
$NEAR_HOME_DIRECTORY/output
. This records file represents all of the
chain's current state, and is what we'll use to start our chain.
From here, we need to make some changes to the genesis.json
that was
generated in $NEAR_HOME_DIRECTORY/output
. To see why, note that the
validators field of this genesis file lists all the current mainnet
validators and their keys. So that means if we were to try and start a
test chain from the generated genesis and records files as-is, it
would work, but our node would expect the current mainnet validators
to be producing blocks and chunks (which they definitely won't be!
Because we're the only ones who know or care about this new test
chain).
So we need to select a new list of validators to start off our
chain. Suppose that we want our chain to have two validators,
validator0.near
and validator1.near
. Let's make a new directory
where we'll be storing intermediate files during this process:
$ mkdir ~/test-chain-scratch
then using your favorite editor, lay out the validators you want in
the test chain as a JSON list in the same format as the validators
field in genesis.json
, maybe in the file
~/test-chain-scratch/validators.json
[
{
"account_id": "validator0.near",
"public_key": "ed25519:GRAFkrqEkJAbdbWUgc6fDnNpCTE83C3pzdJpjAHkMEhq",
"amount": "100000000000000000000000000000000"
},
{
"account_id": "validator1.near",
"public_key": "ed25519:5FxQQTC9mk5kLAhTF9ffDMTXiyYrDXyGYskgz46kHMdd",
"amount": "100000000000000000000000000000000"
}
]
These validator keys should be keys you've already generated. So for the rest of this document, we'll assume you've run:
$ neard --home ~/near-test-chain/validator0 init --account-id validator0.near
$ neard --home ~/near-test-chain/validator1 init --account-id validator1.near
This is also a good time to think about what extra accounts you might
want in your test chain. Since all accounts in the test chain will
have the same keys as they do on mainnet, you'll only have access to
the accounts that you have access to on mainnet. If you want to add an
account with a large balance to properly test things out, you can
write them out in a file as a JSON list of state records (in the same
format as they appear in records.json
). For example, you could put
the following in ~/test-chain-scratch/extra-records.json
:
[
{
"Account": {
"account_id": "my-test-account.near",
"account": {
"amount": "10000000000000000000000000000000000",
"locked": "0",
"code_hash": "11111111111111111111111111111111",
"storage_usage": 182,
"version": "V1"
}
}
},
{
"AccessKey": {
"account_id": "my-test-account.near",
"public_key": "ed25519:Eo9W44tRMwcYcoua11yM7Xfr1DjgR4EWQFM3RU27MEX8",
"access_key": {
"nonce": 0,
"permission": "FullAccess"
}
}
}
]
You'll want to include an access key here, otherwise you won't be able to do anything with the account. Note that here you can also add access keys for any mainnet account you want, so you'll be able to control it in the test chain.
Now to make these changes to the genesis and records files, you can
use the neard amend-genesis
command like so:
# mkdir ~/near-test-chain/
$ neard amend-genesis --genesis-file-in $NEAR_HOME_DIRECTORY/output/genesis.json --records-file-in $NEAR_HOME_DIRECTORY/output/records.json --validators ~/test-chain-scratch/validators.json --extra-records ~/test-chain-scratch/extra-records.json --chain-id $TEST_CHAIN_ID --records-file-out ~/near-test-chain/records.json --genesis-file-out ~/near-test-chain/genesis.json
Starting the network
After running the previous steps you should have the files
genesis.json
and records.json
in ~/near-test-chain/
. Assuming
you've started it with the two validators validator0.near
and
validator1.near
as described above, you'll want to run at least two
nodes, one for each of these validator accounts. If you're working
with multiple computers or VMs that can connect to each other over the
internet, you'll be able to run your test network over the internet as
is done with the "real" networks (mainnet, testnet, etc.). But for now
let's assume that you want to run this on only one machine.
So assuming you've initialized home directories for each of the
validators with the init
command described above, you'll want to
copy the records and genesis files generated in the previous step to
each of these:
$ cp ~/near-test-chain/records.json ~/near-test-chain/validator0
$ cp ~/near-test-chain/genesis.json ~/near-test-chain/validator0
$ cp ~/near-test-chain/records.json ~/near-test-chain/validator1
$ cp ~/near-test-chain/genesis.json ~/near-test-chain/validator1
Now we'll need to make a few config changes to each of
~/near-test-chain/validator0/config.json
and
~/near-test-chain/validator1/config.json
:
changes to ~/near-test-chain/validator0/config.json
:
{
"genesis_records_file": "records.json",
"rpc": {
"addr": "0.0.0.0:3030"
},
"network": {
"addr": "0.0.0.0:24567",
"boot_nodes": "ed25519:Dk4A7NPBYFPwKWouiSUoyZ15igbLSrcPEJqUqDX4grb7@127.0.0.1:24568",
"skip_sync_wait": false,
},
"consensus": {
"min_num_peers": 1
},
"tracked_shards": [0],
}
changes to ~/near-test-chain/validator1/config.json
:
{
"genesis_records_file": "records.json",
"rpc": {
"addr": "0.0.0.0:3031"
},
"network": {
"addr": "0.0.0.0:24568",
"boot_nodes": "ed25519:6aR4xVQedQ7Z9URrASgwBY8bedpaYzgH8u5NqEHp2hBv@127.0.0.1:24567",
"skip_sync_wait": false,
},
"consensus": {
"min_num_peers": 1
},
"tracked_shards": [0],
}
Here we make sure to have each node listen on different ports, while
telling each about the other via network.boot_nodes
. In this
boot_nodes
string, we set the public key not to the validator key,
but to whatever key is present in the node_key.json
file you got
when you initialized the home directory. So for validator0
's config,
we set its boot node to validator1
's node key, followed by the
address of the socket it should be listening on. We also want to drop
the minimum required number of peers, since we're just running a small
test network locally. We set skip_sync_wait
to false
, because
otherwise we get strange behavior that will often make your network
stall.
After making these changes, you can try running one neard process for each of your validators:
$ neard --home ~/near-test-chain/validator0 run
$ neard --home ~/near-test-chain/validator1 run
Now these nodes will begin by taking the records laid out in
records.json
and turning them into a genesis state. At the time of
this writing, using the latest nearcore version from the master
branch, this will take a couple hours. But your validators should
begin producing blocks after that's done.
Overview
This chapter holds various assorted bits of docs. If you want to document something, but don't know where to put it, put it here!
Crate Versioning and Publishing
While all the crates in the workspace are directly unversioned (v0.0.0
), they
all share a unified variable version in the workspace manifest.
This keeps versions consistent across the workspace and informs their versions
at the moment of publishing.
We also have CI infrastructure set up to automate the publishing process to
crates.io. So, on every merge to master, if there's a version change, it is
automatically applied to all the crates in the workspace and it attempts to
publish the new versions of all non-private crates. All crates that should be
exempt from this process should be marked private
. That is, they should have
the publish = false
specification in their package manifest.
This process is managed by cargo-workspaces, with a bit of magic sprinkled on top.
Issue Labels
Issue labels are of the following format <type>-<content>
where <type>
is a
capital letter indicating the type of the label and <content>
is a hyphened
phrase indicating what this label is about. For example, in the label C-bug
,
C
means category and bug
means that the label is about bugs. Common types
include C
, which means category, A
, which means area and T
, which means team.
An issue can have multiple labels including which area it touches, which team
should be responsible for the issue, and so on. Each issue should have at least
one label attached to it after it is triaged and the label could be a general
one, such as C-enhancement
or C-bug
.
Experimental: Dump State to External Storage
Purpose
State Sync is being reworked.
A new version is available for experimental use. This version gets state parts from external storage. The following kinds of external storage are supported:
- Local filesystem
- Google Cloud Storage
- Amazon S3
A new version of decentralized state sync is work in progress.
How-to
neard release 1.36.0-rc.1
adds an experimental option to sync state from
external storage.
See how-to how to configure your node to State Sync from External Storage.
In case you would like to manage your own dumps of State, keep reading.
Google Cloud Storage
To enable Google Cloud Storage as your external storage, add this to your
config.json
file:
"state_sync": {
"dump": {
"location": {
"GCS": {
"bucket": "my-gcs-bucket",
}
}
}
}
And run your node with an environment variable SERVICE_ACCOUNT
or
GOOGLE_APPLICATION_CREDENTIALS
pointing to the credentials json file
SERVICE_ACCOUNT=/path/to/file ./neard run
Amazon S3
To enable Amazon S3 as your external storage, add this to your config.json
file:
"state_sync": {
"dump": {
"location": {
"S3": {
"bucket": "my-aws-bucket",
"region": "my-aws-region"
}
}
}
}
And run your node with environment variables AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
:
AWS_ACCESS_KEY_ID="MY_ACCESS_KEY" AWS_SECRET_ACCESS_KEY="MY_AWS_SECRET_ACCESS_KEY" ./neard run
Dump to a local filesystem
Add this to your config.json
file to dump state of every epoch to local
filesystem:
"state_sync": {
"dump": {
"location": {
"Filesystem": {
"root_dir": "/tmp/state-dump"
}
}
}
}
In this case you don't need any extra environment variables. Simply run your node:
./neard run
Archival node - Recovery of missing data
Incident description
In early 2024 there have been a few incidents on archival node storage. As result of these incidents archival node might have lost some data from January to March.
The practical effect of this issue is that requests querying the state of an account may fail, returning instead an internal server error.
Check if my node has been impacted
The simplest way to check whether or not a node has suffered data loss is to run one of the following queries:
replace <RPC_URL> with the correct URL (example: http://localhost:3030)
curl -X POST <RPC_URL> \
-H "Content-Type: application/json" \
-d '
{ "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "b001b461c65aca5968a0afab3302a5387d128178c99ff5b2592796963407560a", "block_id": 109913260, "request_type": "view_account" } }'
curl -X POST <RPC_URL> \
-H "Content-Type: application/json" \
-d '
{ "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "token2.near", "block_id": 114580308, "request_type": "view_account" } }'
curl -X POST <RPC_URL> \
-H "Content-Type: application/json" \
-d '
{ "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "timpanic.tg", "block_id": 115185110, "request_type": "view_account" } }'
curl -X POST <RPC_URL> \
-H "Content-Type: application/json" \
-d '
{ "id": "dontcare", "jsonrpc": "2.0", "method": "query", "params": { "account_id": "01.near", "block_id": 115514400, "request_type": "view_account" } }'
If for any of the above requests you get an error of the kind MissingTrieValue
it means that the node presents the issue.
Remediation steps
Option A (recommended): download a new DB snapshot
-
Stop
neard
-
Delete the existing
hot
andcold
databases. Example assuming default configuration:rm -rf ~/.near/hot-data && rm -rf ~/.near/cold-data
-
Download an up-to-date snapshot, following this guide: Storage snapshots
Option B: manually run recovery commands
Follow the instructions below if, for any reason, you prefer to perform the manual recovery steps on your node instead of downloading a new snapshot.
Requirements:
neard
must be stopped while recovering datacold
storage must be mounted on an SSD disk or better- The config
resharding_config.batch_delay
must be set to 0.
After the recovery is finished the configuration changes can be undone and a hard disk drive can be used to mount the cold
storage.
Important considerations:
- The recovery procedure will, most likely, take more than one week
- Since
neard
must be stopped in order to execute the commands, the node won't be functional during this period of time
- Since
- The node must catch up to the chain's head after completing the data recovery, this could take days as well
- During catch up the node can answer RPC queries. However, the chain head is still at the point where the node was stopped; for this reason recent blocks won't be available immediately
We published a reference recovery script in the nearcore
repository. Your neard
setup might be different, so the advice is to thoroughly check the script before running it. For completeness, here we include the set of commands to run:
neard view-state -t cold --readwrite apply-range --start-index 109913254 --end-index 110050000 --shard-id 2 --storage trie-free --save-state cold sequential
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 0 --restore
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 1 --restore
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 2 --restore
RUST_LOG=debug neard database resharding --height 114580307 --shard-id 3 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 0 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 1 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 2 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 3 --restore
RUST_LOG=debug neard database resharding --height 115185107 --shard-id 4 --restore
Verify if remediation has been successful
Run the queries specified in the section: Check if my node has been impacted. All of them should return a successful response now.