Profiling neard
Sampling performance profiling
It is a common task to need to look where neard
is spending time. Outside of instrumentation
we've also been successfully using sampling profilers to gain an intuition over how the code works
and where it spends time. It is a very quick way to get some baseline understanding of the
performance characteristics, but due to its probabilistic nature it is also not particularly precise
when it comes to small details.
Linux's perf
has been a tool of choice in most cases, although tools like Intel VTune could be
used too. In order to use either, first prepare your system:
sudo sysctl kernel.perf_event_paranoid=0
sudo sysctl kernel.kptr_restrict=0
Beware that this gives access to certain kernel state and environment to the unprivileged user. Once investigation is over either set these properties back to the more restricted settings or, better yet, reboot.
Definitely do not run untrusted code after running these commands.
Then collect a profile as such:
$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 YOUR_COMMAND_HERE
# or attach to a running process:
$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 -p NEARD_PID
This command will use the CPU time clock to determine when to trigger a sampling process and will
do such sampling roughly 1000 times (the -F
argument) every CPU second.
Once terminated, this command will produce a profile file in the current working directory.
Although you can inspect the profile already with perf report
, we've had much better experience
with using Firefox Profiler as the viewer. Although Firefox
Profiler supports perf
and many other different data formats, for perf
in particular a
conversion step is necessary:
perf script -F +pid > mylittleprofile.script
Then, load this mylittleprofile.script
file with the profiler.
Low overhead stack frame collection
The command above uses -g --call-graph dwarf,65528
parameter to instruct perf
to collect
stack trace for each sample using DWARF unwinding metadata. This will work no matter how neard
is
built, but is expensive and not super precise (e.g. it has trouble with JIT code.) If you have an
ability to build a profiling-tuned build of neard
, you can use higher quality stack frame
collection.
cargo build --release --config .cargo/config.profiling.toml -p neard
Then, replace the --call-graph dwarf
with --call-graph fp
:
perf record -e cpu-clock -F1000 -g --call-graph fp,65528 YOUR_COMMAND_HERE
Profiling with hardware counters
As mentioned earlier, sampling profiler is probabilistic and the data it produces is only really suitable for a broad overview. Any attempt to analyze the performance of the code at the microarchitectural level (which you might want to do if investigating how to speed up a small but frequently invoked function) will be severely hampered by the low quality of data.
For a long time now, CPUs are able to expose information about how it operates on the code at a very fine grained level: how many cycles have passed, how many instructions have been processed, how many branches have been taken, how many predictions were incorrect, how many cycles were spent waiting of a memory accesses and many more. These allow a much better look at how the code behaves.
Until recently, use of these detailed counters was still sampling based -- the CPU would produce
some information at a certain cadence of these counters (e.g. every 1000 instructions or cycles)
which still shares a fair number of the same downsides as sampling cpu-clock
. In order to address
this downside, recent CPUs from both Intel and AMD have implemented a list of recent branches taken
-- Last Branch Record or LBR. This is available on reasonably
recent Intel architectures as well as starting with Zen 4 on the side of AMD. With LBRs profilers
are able to gather information about the cycle counts between each branch, giving an accurate and
precise evaluation of the performance at a basic block
or function call level.
It all sounds really nice, so why are we not using these mechanisms all the time? That's because
GCP VMs don't allow access to these counters! In order to access them the code has to be run on
your own hardware, or a VM instance that provides direct access to the hardware, such as the (quite
expensive) c3-highcpu-192-metal
type.
Once everything is set up, though, the following command can gather some interesting information for you.
perf record -e cycles:u -b -g --call-graph fp,65528 YOUR_COMMAND_HERE
Analyzing this data is, unfortunately, not as easy as chucking it away to Firefox Profiler. I'm not
aware of any other ways to inspect the data other than using perf report
:
perf report -g --branch-history
perf report -g --branch-stack
You may also be able to gather some interesting results if you use --call-graph lbr
and the
relevant reporting options as well.
Memory usage profiling
neard
is a pretty memory-intensive application with many allocations occurring constantly.
Although Rust makes it pretty hard to introduce memory problems, it is still possible to leak
memory or to inadvertently retain too much of it.
Unfortunately, “just” throwing a random profiler at neard does not work for many reasons. Valgrind
for example is introducing enough slowdown to significantly alter the behavior of the run, not to
mention that to run it successfully and without crashing it will be necessary to comment out
neard
’s use of jemalloc
for yet another substantial slowdown.
So far the only tool that worked out well out of the box was
bytehound
. Using it is quite straightforward, but needs
Linux, and ability to execute privileged commands.
First, checkout and build the profiler (you will need to have nodejs yarn
thing available as
well):
git clone git@github.com:koute/bytehound.git
cargo build --release -p bytehound-preload
cargo build --release -p bytehound-cli
You will also need a build of your neard
, once you have that, give it some ambient capabilities
necessary for profiling:
sudo sysctl kernel.perf_event_paranoid=0
sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/neard
sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/libbytehound.so
And finally run the program with the profiler enabled (in this case neard run
command is used):
/lib64/ld-linux-x86-64.so.2 --preload /path/to/libbytehound.so /path/to/neard run
Viewing the profile
Do note that you will need about twice the amount of RAM as the size of the input file in order to load it successfully.
Once enough profiling data has been gathered, terminate the program. Use the bytehound
CLI tool
to operate on the profile. I recommend bytehound server
over directly converting to e.g. heaptrack
format using other subcommands as each invocation will read and parse the profile data from
scratch. This process can take quite some time. server
parses the inputs once and makes
conversions and other introspection available as interactive steps.
You can use server
interface to inspect the profile in one of the few ways, download a flamegraph
or a heaptrack file. Heaptrack in particular provides some interesting additional visualizations
and has an ability to show memory use over time from different allocation sources.
I personally found it a bit troublesome to figure out how to open the heaptrack file from the GUI.
However, heaptrack myexportedfile
worked perfectly. I recommend opening the file exactly this way.
Troubleshooting
No output file
- Set a higher profiler logging level. Verify that the profiler gets loaded at all. If you're not seeing any log messages, then something about your working environment is preventing the loader from including the profiler library.
- Try specifying an exact output file with e.g. environment variables that the profiler reads.
Crashes
If the profiled neard
crashes in your tests, there are a couple of things you can try to get past
it. First, make sure your binary has the necessary ambient capabilities (setcap
command above
needs to be executed every time binary is replaced!)
Another thing to try is disabling jemalloc
. Comment out this code in neard/src/main.rs
:
#![allow(unused)] fn main() { #[global_allocator] static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc; }
The other thing you can try is different profilers, different versions of the profilers or different options made available (in particular disabling the shadow stack in bytehound), although I don't have specific recommendations here.
We don't know what exactly it is about neard that leads to it crashing under the profiler as easily as it does. I have seen valgrind reporting that we have libraries that are deallocating with a wrong size class, so that might be the reason? Do definitely look into this if you have time.
What to profile?
This section provides some ideas on programs you could consider profiling if you are not sure where to start.
First and foremost you could go shotgun and profile a full running neard
node that's operating on
mainnet or testnet traffic. There are a couple ways to set up such a node: search for
my-own-mainnet or a forknet based
tooling.
From there either attach to a running neard run
process or stop the running one and start a new
instance under the profiler.
This approach will give you a good overview of the entire system, but at the same time the information might be so dense, it might be difficult to derive any signal from the noise. There are alternatives that isolate certain components of the runtime:
Runtime::apply
Profiling just the Runtime::apply
is going to include the work done by transaction runtime and
the contract runtime only. A smart use of the tools already present in the neard
binary can
achieve that today.
First, make sure all deltas in flat storage are applied and written:
neard view-state --readwrite apply-range --shard-id $SHARD_ID --storage flat sequential
You will need to do this for all shards you're interested in profiling. Then pick a block or a range of blocks you want to re-apply and set the flat head to the specified height:
neard flat-storage move-flat-head --shard-id 0 --version 0 back --blocks 17
Finally the following commands will apply the block or blocks from the height in various different ways. Experiment with the different modes and flags to find the best fit for your task. Don't forget to run these commands under the profiler :)
# Apply blocks from current flat head to the highest known block height in sequence
# using the memtrie storage (note that after this you will need to move the flat head again)
neard view-state --readwrite apply-range --shard-id 0 --storage memtrie sequential
# Same but with flat storage
neard view-state --readwrite apply-range --shard-id 0 --storage flat sequential
# Repeatedly apply a single block at the flat head using the memtrie storage.
# Will not modify the storage on the disk.
neard view-state apply-range --shard-id 0 --storage memtrie benchmark
# Same but with flat storage
neard view-state apply-range --shard-id 0 --storage flat benchmark