Profiling neard

Sampling performance profiling

It is a common task to need to look where neard is spending time. Outside of instrumentation we've also been successfully using sampling profilers to gain an intuition over how the code works and where it spends time. It is a very quick way to get some baseline understanding of the performance characteristics, but due to its probablistic nature it is also not particularly precise when it comes to small details.

Linux's perf has been a tool of choice in most cases, although tools like Intel VTune could be used to. In order to use either, first prepare your system:

$ sudo sysctl kernel.perf_event_paranoid=0
$ sudo sysctl kernel.restrict_kptr=0

Beware that this gives access to certain kernel state and environment to the unprivileged user. Once investigation is over either set these properties back to the more restricted settings or, better yet, reboot.

Definitely do not run untrusted code after running these commands.

Then collect a profile as such:

$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 YOUR_COMMAND_HERE
# or attach to a running process:
$ perf record -e cpu-clock -F1000 -g --call-graph dwarf,65528 -p NEARD_PID

This command will use the CPU time clock to determine when to trigger a sampling process and will do such sampling roughly 1000 times (the -F argument) every CPU second.

Once terminated, this command will produce a profile file in the current working directory. Although you can inspect the profile already with perf report, we've had much better experience with using Firefox Profiler as the viewer. Although Firefox Profiler supports perf and many other different data formats, for perf in particular a conversion step is necessary:

$ perf script -F +pid > mylittleprofile.script

Then, load this mylittleprofile.script file with the profiler.

Low overhead stack frame collection

The command above uses -g --call-graph dwarf,65528 parameter to instruct perf to collect stack trace for each sample using DWARF unwinding metadata. This will work no matter how neard is built, but is expensive and not super precise (e.g. it has trouble with JIT code.) If you have an ability to build a profiling-tuned build of neard, you can use higher quality stack frame collection.

$ cargo build --release --config .cargo/config.profiling.toml -p neard

Then, replace the --call-graph dwarf with --call-graph fp:

$ perf record -e cpu-clock -F1000 -g --call-graph fp,65528 YOUR_COMMAND_HERE

Profiling with hardware counters

As mentioned earlier, sampling profiler is probablistic and the data it produces is only really suitable for a broad overview. Any attempt to analyze the performance of the code at the microarchitectural level (which you might want to do if investigating how to speed up a small but frequently invoked function) will be severely hampered by the low quality of data.

For a long time now, CPUs are able to expose information about how it operates on the code at a very fine grained level: how many cycles have passed, how many instructions have been processed, how many branches have been taken, how many predictions were incorrect, how many cycles were spent waiting of a memory accesses and many more. These allow a much better look at how the code behaves.

Until recently, use of these detailed counters was still sampling based -- the CPU would produce some information at a certain cadence of these counters (e.g. every 1000 instructions or cycles) which still shares a fair number of the same downsides as sampling cpu-clock. In order to address this downside, recent CPUs from both Intel and AMD have implemented a list of recent branches taken -- Last Branch Record or LBR. This is available on reasonably recent Intel architectures as well as starting with Zen 4 on the side of AMD. With LBRs profilers are able to gather information about the cycle counts between each branch, giving an accurate and precise evaluation of the performance at a basic block or function call level.

It all sounds really nice, so why are we not using these mechanisms all the time? That's because GCP VMs don't allow access to these counters! In order to access them the code has to be run on your own hardware, or a VM instance that provides direct access to the hardware, such as the (quite expensive) c3-highcpu-192-metal type.

Once everything is set up, though, the following command can gather some interesting information for you.

$ perf record -e cycles:u -b -g --call-graph fp,65528 YOUR_COMMAND_HERE

Analyzing this data is, unfortunately, not as easy as chucking it away to Firefox Profiler. I'm not aware of any other ways to inspect the data other than using perf report:

$ perf report -g --branch-history
$ perf report -g --branch-stack

You may also be able to gather some interesting results if you use --call-graph lbr and the relevant reporting options as well.

Memory usage profiling

neard is a pretty memory-intensive application with many allocations occurring constantly. Although Rust makes it pretty hard to introduce memory problems, it is still possible to leak memory or to inadvertently retain too much of it.

Unfortunately, “just” throwing a random profiler at neard does not work for many reasons. Valgrind for example is introducing enough slowdown to significantly alter the behaviour of the run, not to mention that to run it successfully and without crashing it will be necessary to comment out neard’s use of jemalloc for yet another substantial slowdown.

So far the only took that worked out well out of the box was bytehound. Using it is quite straightforward, but needs Linux, and ability to execute privileged commands.

First, checkout and build the profiler (you will need to have nodejs yarn thing available as well):

$ git clone git@github.com:koute/bytehound.git
$ cargo build --release -p bytehound-preload
$ cargo build --release -p bytehound-cli

You will also need a build of your neard, once you have that, give it some ambient cabapilities necessary for profiling:

$ sudo sysctl kernel.perf_event_paranoid=0
$ sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/neard
$ sudo setcap 'CAP_SYS_ADMIN+ep' /path/to/libbytehound.so

And finally run the program with the profiler enabled (in this case neard run command is used):

$ /lib64/ld-linux-x86-64.so.2 --preload /path/to/libbytehound.so /path/to/neard run

Viewing the profile

Do note that you will need about twice the amount of RAM as the size of the input file in order to load it successfully.

Once enough profiling data has been gathered, terminate the program. Use the bytehound CLI tool to operate on the profile. I recommend bytehound server over directly converting to e.g. heaptrack format using other subcommands as each invocation will read and parse the profile data from scratch. This process can take quite some time. server parses the inputs once and makes conversions and other introspection available as interactive steps.

You can use server interface to inspect the profile in one of the few ways, download a flamegraph or a heaptrack file. Heaptrack in particular provides some interesting additional visualizations and has an ability to show memory use over time from different allocation sources.

I personally found it a bit troublesome to figure out how to open the heaptrack file from the GUI. However, heaptrack myexportedfile worked perfectly. I recommend opening the file exactly this way.

Troubleshooting

No output file

  1. Set a higher profiler logging level. Verify that the profiler gets loaded at all. If you're not seeing any log messages, then something about your working environment is preventing the loader from including the profiler library.
  2. Try specifying an exact output file with e.g. environment variables that the profiler reads.

Crashes

If the profiled neard crashes in your tests, there are a couple things you can try to get past it. First, makes sure your binary has the necessary ambient capabilities (setcap command above needs to be executed every time binary is replaced!)

Another thing to try is disabling jemalloc. Comment out this code in neard/src/main.rs:

#![allow(unused)]
fn main() {
#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
}

The other thing you can try is different profilers, different versions of the profilers or different options made available (in particular disabling the shadow stack in bytehound), although I don't have specific recommendations here.

We don't know what exactly it is about neard that leads to it crashing under the profiler as easily as it does. I have seen valgrind reporting that we have libraries that are deallocating with a wrong size class, so that might be the reason? Do definitely look into this if you have time.