Performance-Profiling Swift on Linux: Getting Started
Learn how to profile Server-Side Swift with perf on Linux. You’ll discover the basic principles of profiling and how to view events, call-graph-traces and perform basic analysis. By kelvin ma.
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Contents
Performance-Profiling Swift on Linux: Getting Started
35 mins
- Getting Started
- The Data File
- The Trading Application
- The Supporting Code
- Running the Application
- Measuring Performance
- Profiling With Perf
- Loading Samples With Perf Script
- Getting Task Information
- Getting Timeline Information
- Getting Landmark Information
- Working With Perf Events
- Collecting Call Graph Traces
- Tracing With Frame Pointers
- Viewing Call Graph Traces
- Working With Call Graph Traces
- Configuring Perf for Swift
- Enabling DWARF Metadata
- Demangling Swift Function Names
- Putting It All Together
- Counting Function Names
- Observing Changes in Performance
- Where to Go From Here?
Getting Landmark Information
Sampling works by taking snapshots of the CPU’s instruction pointer (IP). You can view the instruction pointers in the ip
field:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip
perf 542/542 7040.081164: 1012 cycles:ppp: ffffffff878735b4 perf 542/542 7040.081168: 8668 cycles:ppp: ffffffff884016f2 crypto-bot 542/542 7040.081176: 72998 cycles:ppp: ffffffff87ad38f0 crypto-bot 542/542 7040.081243: 358563 cycles:ppp: ffffffff88277427
By itself, this field isn’t too useful. To get the string names of the symbols associated with the instruction pointers, enable the sym
field. To get their distances from the instruction pointers, enable the symoff
field, which stands for symbol offsets. Add sym
and symoff
to the list of fields:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff
... 1012 cycles:ppp: ffffffff878735b4 native_write_msr+0x4 ... 8668 cycles:ppp: ffffffff884016f2 nmi_restore+0x25 ... 72998 cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0 ... 358563 cycles:ppp: ffffffff88277427 clear_page_erms+0x7
You can also view the binaries that the instructions originated from in the dso
field. The name stands for dynamic shared object. Try adding it to the list of fields:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso
You’ll see the file path to the binary in parentheses after the symbol name and offset, or [kernel.kallsyms]
if the symbol lives in the kernel.
... cycles:ppp: ffffffff878735b4 native_write_msr+0x4 ([kernel.kallsyms]) ... cycles:ppp: ffffffff884016f2 nmi_restore+0x25 ([kernel.kallsyms]) ... cycles:ppp: ffffffff87ad38f0 install_exec_creds+0x0 ([kernel.kallsyms]) ... cycles:ppp: ffffffff88277427 clear_page_erms+0x7 ([kernel.kallsyms])
Scroll down the list of samples and look for binaries you recognize. You should easily find:
- /usr/lib/swift/linux/libswiftCore.so: This is the Swift standard library.
-
/crypto-bot/.build/
/release/crypto-bot : This is thecrypto-bot
executable itself.
Working With Perf Events
It’s possible to capture events besides the CPU cycles
. Although it’s only a subset of the events most PMUs support, perf record
can also capture:
-
instructions: Similar to
cycles
, except it samples on a counter that tracks instructions instead of clock cycles. The two can give varying results because some instructions take multiple cycles to complete. - branches: Samples on a counter that tracks branch instructions. Branch instructions are critical because they are very expensive if incorrectly predicted.
- branch-misses: Samples on a counter that tracks branch predictor misses.
- cache-misses: Samples on a counter that tracks CPU cache misses.
In a normal optimization workflow, you would use the general cycles
event to determine which parts of your code are slow, and then you’d use the more specialized events to figure out why those areas are slow.
You can get the full list of events your PMU supports with the perf list
command:
perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
...
alignment-faults [Software event]
bpf-output [Software event]
...
Your output might be different. Besides the name of the event, perf list
tells you what kind of event (hardware, software, etc.) it is.
To record an event besides cycles
, pass its name to perf record
’s --event
option, or its shortened form -e
. To record multiple events at once, pass their names as a comma-separated list. For example, to sample branch misses and cache misses, pass the following options:
perf record -o branch-and-cache-misses.perf -e branch-misses,cache-misses .build/release/crypto-bot
Try viewing the output in perf script
:
perf script -i branch-and-cache-misses.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso
You will see events of types branch-misses
and cache-misses
in place of cycles:ppp
.
... 1 branch-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms]) ... 1 cache-misses: ffffffff879fcae1 perf_iterate_ctx+0x1 ([kernel.kallsyms]) ... 31 branch-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern... ... 84 cache-misses: ffffffff878f0a61 __update_load_avg_cfs_rq+0x1a1 ([kern...
Collecting Call Graph Traces
So far, you’ve learned how to reconstruct a timeline of crypto-bot
’s execution and read off some basic information about which function frame was at the top of the stack at each point in time. To get a better idea of what’s happening in an application you don’t only want to know what it’s executing at a moment, but also what sequence of calls it took to get there. This information is called a call graph trace. It’s like a stack dump, except there’s one for every sample in the recording.
Tracing With Frame Pointers
One way to capture a call graph trace is to follow the frame pointer (FP). In Swift and many other compiled languages, frame pointers form a linked list so debuggers and profilers can follow to reconstruct a call graph. Rerun perf record
again, this time with the --call-graph fp
option:
perf record -o profile.perf --call-graph fp .build/release/crypto-bot
You’ll see perf
write more than twice as much data to disk than it did last time.
... MEDIA-PERP: 64 update(s) [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 0.630 MB profile.perf (5121 samples) ]
Call graph traces take up a lot of space, which is why perf
doesn’t record them by default.
-fomit-frame-pointer
, you can’t generate call graph traces for it.
Viewing Call Graph Traces
Try viewing the regenerated profile in perf script
:
perf script -i profile.perf -F comm,pid,tid,time,event,period,ip,sym,symoff,dso
In place of the ip
, sym
, symoff
and dso
fields, perf script
now produces multiple rows of these fields for each sample in the profile. Because each sample now spans multiple lines, perf script
now separates samples with empty newlines.
perf 14/14 8358.627472: 1 cycles:ppp: 7fff8e810345 intel_tfa_pmu_enable_all+0x800071803035 ([kernel.kallsyms]) 7fff8e80a598 x86_pmu_enable+0x800071803118 ([kernel.kallsyms]) ... 7fff8e806fd7 do_syscall_64+0x800071803057 ([kernel.kallsyms]) 7fff8f40308c entry_SYSCALL_64_after_hwframe+0x800071803044 ([kernel.k...
The first row is the same as the ip
, sym
, symoff
and dso
fields in the profile without call graph information. The subsequent rows list the function frames that were below it when perf
recorded the sample.
The perf record
tool supports other tracing methods besides frame pointer tracing, including:
- dwarf: Uses compiler-generated metadata in the binaries in place of frame pointers. This generates more detailed information, but also increases profiling overhead.
- lbr: Uses the last branch table in the CPU in place of frame pointers. Only Intel CPUs support this tracing method.
Working With Call Graph Traces
Call graph traces contain a lot of information, not all of which is relevant to your optimization problem.
Often, the startup phase of your application is different from its running phase. You can exclude the startup phase from the recordings by setting a delay with perf record
’s --delay
option, or -D
for short. You express the delay in milliseconds. For example, to make perf record
wait 500 milliseconds before recording:
perf record --delay=500 --call-graph fp -o profile-delayed.perf .build/release/crypto-bot
You should see perf
record fewer samples to disk than it did without the delay.
... [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.336 MB profile-delayed.perf (2827 samples) ]
For more flexible time-slicing, you can attach a recorder to an already-running process with the --pid
option:
.build/release/crypto-bot & perf record --call-graph fp -o profile-attached.perf --pid $(pidof crypto-bot)
pidof crypto-bot
doesn’t return anything.
You can also limit the number of samples perf
records overall with the --freq
option, or -F
for short. The name stands for “sampling frequency.” On average, perf
will attempt to collect the number of samples you specify every second. Try setting it to 100
samples per second:
perf record --freq 100 --call-graph fp -o profile-lite.perf .build/release/crypto-bot
You will see perf
recorded only a fraction of the samples it did before.
[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.027 MB profile-fixed.perf (114 samples) ]