Performance plays an important role in any computer program. It’s something which makes a user stay with your software. Imagine if your software took minutes to start even on a powerful machine. Or imagine it showed visible performance drops when doing some important work. Both of these cases would reflect badly on your application. The operating system kernel is even more performance critical, because if it lags, the whole system lags. It’s the developer’s responsibility to write code that provides the highest possible performance.
To write programs that provide good performance, we should know which part of our program is becoming a bottleneck. That way, we can focus our efforts on optimizing that region of our code. There are a lot of tools out there to help you as a developer profile your program, and better understand which part of your code needs attention. This article discusses one of the tools to help you profile your program on Linux.

Introducing perf

The perf command in Linux gives you access to various tools integrated into the Linux kernel. These tools can help you collect and analyze the performance data about your program or system. The perf profiler is fast, lightweight, and precise.
To use the perf command requires the perf package to be installed on your distro. You can install it with this command:
sudo dnf install perf
Once you have the required package installed, fire up your terminal and execute the perf command. You’ll see output similar to below:
perf command
The perf command gives a lot of options you can use to profile your code. Let’s go through some of the commands which can come to our rescue frequently.

Listing the events

perf list
The list command shows the list of events which can be traced by the perf command. The output will look something like below:
perf list
There are a lot of events that can be traced via the perf command. Broadly, these events can be software events such as context switches or page faults, or hardware events that originate from the processor itself, like L1 cache misses, or number of clock cycles.

Counting events with perf stat

The perf stat command can be used to count the events related to a particular process or command. Let’s look at a sample of this usage by running the following command:
perf stat -B dd if=/dev/urandom of=/dev/null count=50k
The output of the command lists the counters associated with different types of events that occurred during the execution of the above command.
To get the system wide statistics for all the cores on your system, run the following command:
perf stat -a
The command collects and reports all the event statistics collected until you press Ctrl+C.
The stat command gives you the option to select only specific events for reporting. To select the events, run the stat command with the -e option followed by a comma-separated list of events to report. For example:
perf stat -e cycles,page-faults -a
This command provides statistics about the events named cycles and page-faults for all the cores on the system.
To get the stats for a specific process, run the following command, where PID is the process ID of the process for which you want performance statistics:
perf stat -p

Sampling with perf record

The perf record command is used to sample the data of events to a file. The command operates in a per-thread mode and by default records the data for the cycles event. For example, to collect the system wide data for all the CPUs, execute the following command:
perf record -a
The record collects the data for samples until you press Ctrl+C. That data is stored in a file named perf.data by default. To store the data in some other file, pass the name of the file to the command using the -o option. To see the recorded data, run the following command:
perf report
This command produces output similar to the following:
perf record

The report contains 4 columns, which have their own specific meaning:
  1. Overhead: the percentage of overall samples collected in the corresponding function
  2. Command: the command to which the samples belong
  3. Shared object: the name of the image from where the samples came
  4. Symbol: the symbol name which constitutes the sample, and the privilege level at which the sample was taken. There are 5 privilege levels: [.] user level, [k] kernel level, [g] guest kernel level (virtualization), [u] guest OS userspace, and [H] hypervisor.
The command helps you display the most costly functions. You can then focus on these functions to optimize them further.

Finding code to optimize

For example, let’s examine a firefox process, and sample the data for it. In this example, the firefox process is running as PID 2750.
firefox record
Executing the perf report command produces a screen like this, listing the various symbols in decreasing order of their overhead:
perf report firefox
With this data, we can identify the functions that generate the highest overhead in our code. Now we can start our journey to optimize them for the performance.
This has been a brief introduction of using perf to profile programs and system for performance. The perf command has lots of other options that give you the power to run benchmarks on the system as well as annotate code. For further information on the perf command, visit the Perf wiki.