Performance-Profiling Swift on Linux: Getting Started
Learn how to profile Server-Side Swift with perf on Linux. You’ll discover the basic principles of profiling and how to view events, call-graph-traces and perform basic analysis. By kelvin ma.
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Contents
Performance-Profiling Swift on Linux: Getting Started
35 mins
- Getting Started
- The Data File
- The Trading Application
- The Supporting Code
- Running the Application
- Measuring Performance
- Profiling With Perf
- Loading Samples With Perf Script
- Getting Task Information
- Getting Timeline Information
- Getting Landmark Information
- Working With Perf Events
- Collecting Call Graph Traces
- Tracing With Frame Pointers
- Viewing Call Graph Traces
- Working With Call Graph Traces
- Configuring Perf for Swift
- Enabling DWARF Metadata
- Demangling Swift Function Names
- Putting It All Together
- Counting Function Names
- Observing Changes in Performance
- Where to Go From Here?
Measuring Performance
Before jumping into perf
itself, let’s clarify what kind of performance you’re optimizing for. Broadly speaking, there are two categories of performance:
- Latency is the amount of time that passes between the moment you need to compute a result and the moment you receive that result.
- Resource utilization is the amount of processor cycles a computation consumes.
The difference between the two is important because you measure them with different tools. Timers are good at measuring latency. However, timers aren’t great at measuring resource utilization because they keep ticking when a different task is running. If your application await
s a lot, timing won’t tell you much about its resource utilization.
Samplers are an alternative to timers. Samplers interrupt your application randomly and record what code is executing at that instant. Samplers are good at measuring resource utilization because they collect samples only while your application is running.
You can sample an application manually by running it many times and randomly interrupting it to see what stack trace it halts on. The perf
tool automates this process by calling into a CPU component called a performance monitoring unit (PMU). A PMU can sample your application thousands of times per run.
The crypto-bot
application is compute-constrained, not network-constrained. This means it has a resource utilization problem, so you’ll need to use a sampler like perf
to optimize it.
Profiling With Perf
The Docker image you built already comes with perf
installed. To test the perf
installation, sample the sleep
command:
perf record sleep 1
You should see something like this:
[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.014 MB perf.data (8 samples) ]
Since sleep
spends most of its time sleeping, perf
only recorded a handful of samples from it.
Try sampling crypto-bot
by running the following commands:
perf record .build/release/crypto-bot
This will result in something similar to:
FTM-PERP: 7094 update(s)
...
MEDIA-PERP: 64 update(s)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.210 MB perf.data (5099 samples) ]
Above the perf record
footer, you’ll see that perf
reprinted all the normal output from crypto-bot
as it was running.
If you look closer, you’ll notice perf
collected many more samples from crypto-bot
than it did from sleep
, and it saved them to a file called perf.data. This file is called a performance profile. The perf
tool has another command — script
— that you can use to load samples from a profile.
Loading Samples With Perf Script
Try running perf script
without any arguments:
perf script
This command is interactive. Press space or arrow keys to navigate and Q to exit. If you’re running Docker natively on Linux you’ll see something like:
perf 16 7056.026690: 161 cycles:ppp: ffffffffa1c735bb native_write_msr ([kernel.kallsyms]) perf 16 7056.026696: 1031 cycles:ppp: ffffffffa1c735b4 native_write_msr ([kernel.kallsyms]) perf 16 7056.026702: 6650 cycles:ppp: ffffffffa1c0d345 intel_tfa_pmu_enable_all ([kernel.kallsyms]) crypto-bot 16 7056.026712: 41385 cycles:ppp: ffffffffa208f6d5 apparmor_bprm_committing_creds ([kernel.kallsyms]) crypto-bot 16 7056.026763: 155508 cycles:ppp: ffffffffa1e58909 unmap_region ([kernel.kallsyms]) crypto-bot 16 7056.027434: 233299 cycles:ppp: ffffffffa2296747 _extract_crng ([kernel.kallsyms]) ...
If you’re running Docker on macOS you’ll see something like:
crypto-bot 857 338493.651057: 250000 cpu-clock: ffff800010c7030c [unknown] ([kernel.kallsyms]) crypto-bot 857 338493.651306: 250000 cpu-clock: ffff838a6ce4 _dl_relocate_object (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.651558: 250000 cpu-clock: ffff800010643a10 [unknown] ([kernel.kallsyms]) crypto-bot 857 338493.651807: 250000 cpu-clock: ffff838a6cd8 _dl_relocate_object (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652056: 250000 cpu-clock: ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652305: 250000 cpu-clock: ffff838a4c6c do_lookup_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652556: 250000 cpu-clock: ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.652806: 250000 cpu-clock: ffff838a4c2c do_lookup_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.653056: 250000 cpu-clock: ffff838b3b10 strcmp (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.653305: 250000 cpu-clock: ffff838a5490 _dl_lookup_symbol_x (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.653556: 250000 cpu-clock: ffff800010643a7c [unknown] ([kernel.kallsyms]) crypto-bot 857 338493.653806: 250000 cpu-clock: ffff838b3b24 strcmp (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.654056: 250000 cpu-clock: ffff838b3ab0 strcmp (/usr/lib64/ld-2.26.so) crypto-bot 857 338493.654306: 250000 cpu-clock: ffff82ff4808 malloc_consolidate (/usr/lib64/libc-2.26.so) crypto-bot 857 338493.654556: 250000 cpu-clock: ffff800010c7030c [unknown] ([kernel.kallsyms]) ...
This output contains a human-readable description of each sample in perf.data. By default, perf record
names its output perf.data, so by default, perf script
looks for a file with this name.
perf script
output can run wide in the terminal. Some of the following output snippets use ellipses (...
) to represent elided output.
You can name the profile something else with the --output
option, or -o
for short. Try generating a profile named profile.perf by rerunning perf record
with the following options:
perf record -o profile.perf .build/release/crypto-bot
To load it in perf script
, use the --input
option, or -i
for short:
perf script -i profile.perf
Each line represents a single sample, so if perf record
recorded 5099 samples, perf script
emits 5099 lines of output. You can customize the output with the --fields
option, or -F
for short.
Getting Task Information
The default perf script
output is complex, so try running perf script
again with the -F comm
option:
perf script -i profile.perf -F comm
The “comm
” keyword stands for “command;” accordingly you will see perf script
print the command associated with each sample:
perf perf crypto-bot crypto-bot
You should see many samples associated with crypto-bot
and a smaller number of samples associated with perf
itself.
You can get the thread group and thread identifier for each sample by enabling the pid
and tid
fields. Pass multiple fields to perf script
as a comma-separated list:
perf script -i profile.perf -F comm,pid,tid
You’ll see a new column of output containing two numbers. The first number is the thread group identifier; the second number is the thread identifier. The perf script
tool always prints the fields in the same order, no matter what order you entered them in.
perf 542/542 perf 542/542 crypto-bot 542/542 crypto-bot 542/542
Because crypto-bot
doesn’t use concurrency, every sample comes from the same thread. If you examined a performance profile from a concurrent application, you would see samples with many thread identifiers, but the same thread group identifier.
Getting Timeline Information
You can get the time stamp for each sample by enabling the time
field. Try adding time
to the fields list:
perf script -i profile.perf -F comm,pid,tid,time
perf 542/542 7040.081164: perf 542/542 7040.081168: crypto-bot 542/542 7040.081176: crypto-bot 542/542 7040.081243:
The timestamps represent seconds elapsed since the system booted up.
Performance profiles can store many kinds of events. To get the event type for each sample, enable the event
field:
perf script -i profile.perf -F comm,pid,tid,time,event
You’ll see a new column containing the event type for each sample. If you have access to hardware events, you’ll see:
perf 542/542 7040.081164: cycles:ppp: perf 542/542 7040.081168: cycles:ppp: crypto-bot 542/542 7040.081176: cycles:ppp: crypto-bot 542/542 7040.081243: cycles:ppp:
All samples in profile.perf came from the cycles:ppp
event. The name means “(very) precise CPU cycles”. This event plugs into a cycle counter in the PMU, and triggers a sample at every multiple of some interval.
If you’re running in an environment where the Docker container doesn’t have access to hardware events (like on macOS), you’ll see:
crypto-bot 861/861 364407.450503: cpu-clock: crypto-bot 861/861 364407.450641: cpu-clock: crypto-bot 861/861 364407.450901: cpu-clock: crypto-bot 861/861 364407.452287: cpu-clock: crypto-bot 861/861 364407.452632: cpu-clock:
In this instance, all events came from the cpu-clock
event. This event is triggered by a timer perf creates and triggers a sample at every multiple of some interval.
You can view the sampling intervals by enabling the period
field:
perf script -i profile.perf -F comm,pid,tid,time,event,period
Note that perf script
formats the period
column before the event
name. On macOS hosts, you’ll see a regular time interval:
crypto-bot 861/861 364407.450503: 250000 cpu-clock: crypto-bot 861/861 364407.450641: 250000 cpu-clock: crypto-bot 861/861 364407.450901: 250000 cpu-clock: crypto-bot 861/861 364407.452287: 250000 cpu-clock:
On Linux, you’ll see a different period:
perf 542/542 7040.081164: 1012 cycles:ppp: perf 542/542 7040.081168: 8668 cycles:ppp: crypto-bot 542/542 7040.081176: 72998 cycles:ppp: crypto-bot 542/542 7040.081243: 358563 cycles:ppp:
The sampling interval varies across samples because the kernel tunes it as your application runs. If you scroll down far enough, you should see the kernel eventually herds it into an equilibrium.
crypto-bot 542/542 7040.086985: 325029 cycles:ppp: crypto-bot 542/542 7040.087241: 322303 cycles:ppp: crypto-bot 542/542 7040.087495: 321438 cycles:ppp: