from sys cpu to source code, with perf and flamegraphs

Sep 14th, 2014

Some exploration with perf_events, the performance counters subsystem in Linux, and FlameGraph, the code path profile visualization tool by Brendan Gregg.

The problem

I was trying to explain horrible scaling when running multiple independent tinker molecular modeling processes in parallel. Run alone, a micro-benchmark took under a second; run in parallel, weak scaling efficiency was roughly ~20% — this while well under the host’s core count and memory capacity, where I’d expect something closer to 100% efficiency. Furthermore, it was only a problem on a certain architecture — Intel Nehalem/Westmere, but not Harpertown or AMD Abu Dhabi.

Initial look — high sys cpu, but what?

I took a basic first peek at what’s going on, using top and other procps utils. All processes were at ~100% cpu for the duration, and it was all sys cpu.

So next I took a look at the syscalls, using strace. Running through stracestats, syscalls were only a tiny fraction of the wall time, and represented a constant time for each proc, independent of parallelism. So not the issue.

Then what’s all that sys cpu doing?

Perf to the rescue

This was my first time digging deep into perf, i.e. beyond perf top. It’s amazing — wow should I have added this to my toolbelt sooner. So comprehensive — for profiling, it’s a sun compared to the strace, gdb, etc. lamplights.

I actually went down this route because of the highly-communicative flamegraph plots I’d seen circulating. Flamegraph makes perf even more fun. So I’m following in the footsteps of Brendan Gregg — he wrote flamegraph, wrote the best intro to perf I know of, and has since followed up with a hot talk at LinuxCon.

Back to the problem at hand — the program runs fine solo, and super slow in parallel; what’s the difference? Perf gave the answer, and flamegraph made it easy to see; here’s how:

Run perf (as root) to start profiling the whole system:

basic full system profile

$ perf record -a -F 999 -g -o perf.out

Run the problematic code, analyze (as me), then kill the perf.

Clone flamegraph and put it in the PATH.

Generate a flamegraph from the perf output (grepping out only the “analyze” lines that I are relevant):

creating a flamegraph

$ perf script -i perf.out | stackcollapse-perf.pl > perf.out.folded
$ cat perf.out.folded | grep analyze | flamegraph.pl --title "..." --cp > flamegraph.svg

The results for running solo and for running four in parallel are below. Horizontal widths are time spent in each call (not a time series); going up vertically is going deeper into the call stack. Widths are normalized to total run time (the solo plot represents just under a second; the parallel one 18 seconds).

running solo

running parallel

Aha, page faults! Memory management. Specifically during the initprm_ function call. There is a huge stack of page_fault on top of initprm_ (and kvdw_) in the parallel case compared to the solo case.

Okay, so simple /usr/bin/time had actually already pointed out the huge minor page_fault counts in the parallel case, but this better communicates the story, and it’s only the beginning…

Finding the responsible source code

Make sure analyze is compiled w/ -ggdb and -fno-omit-frame-pointer, and turn down optimization, too (-O0) so things don’t get confusing. Then perf can combine the recorded profile and the program’s source:

annotating source code with profile info

$ perf annotate -i perf.out --symbol=initprm_ --stdio

header … annotation …

That left-hand column is % cpu time for each line of source code, and the corresponding assembly! There were two similar loops later in the output, too. So the issue is initialization of a bunch of 3D double-precision matrices in initprm.f.

Fixing the problem?

Here’s where the story unravels, unfortunately.

The dimension traversal looks good (Fortran uses column-major order)… but maybe it’s a problem jumping from each 8MB matrix to the next? Nope, re-ordering and initializing each separately didn’t get rid of the page faults, only moved them.

Given the slowdown only happened on older Nehalem/Westmere nodes with three memory channels, and the slowdown kicked in particularly bad when running four or more in parallel, that probably has something to do with it…

Or maybe there’s just something about memory that this already shows and I don’t understand (anyone?).

But in the meantime, the hosts that demonstrated the problem have been upgraded (distro point release and 2.6.32-358.11.1 –> 2.6.32-431.17.1 kernel upgrade), and the problem has gone away!

Since the need to actually solve this has dried up, it remains an unsolved mystery.

Summary

Perf is awesome, and flamegraphs make it even more fun. I quickly went from a black box of sys cpu usage, to lines of source code responsible for it. Though this example problem disappeared before I finished a complete analysis, perf was spot-on showing where to look. The exploration was worthwhile, and this is exactly where I’ll pick up next time I hit such a situation.

how to save jobs when you kill storage

May 4th, 2014

It’s a common predicament for admins of dynamic systems that also run long jobs — you need to take down some storage, but there are processes using it that you really don’t want to kill. Here’s a way to save those jobs by live-migrating them to new storage. In short, it’s another case in which gdb let’s you do the sysadmin equivalent of changing the tires while driving down the road.

Do the following for each process id PID involved in each job:

1: Attach with the GNU debugger

attach to the process

$ gdb -p PID

This will also pause the job.

2: Note all the open files

Now list all open files of the process:

list open files

$ lsof -p PID

Search this for the filesystem path that you’re decommissioning, and note the file descriptor (FD column) of every file that needs to move, including the current working directory (FD=cwd), if applicable.

3: Flush data

Back in the gdb session, flush all the file descriptors so that no in-core data is lost; for each numeric file descriptor FILE_DESCRIPTOR run:

flush in-core data

(gdb) call fsync(FILE_DESCRIPTOR)

Flush OS file system buffers, too; in a shell, run:

$ sync

4: Copy the files

Create a directory for the process on the new storage, and copy all affected files from the old storage to the new storage. Note their paths.

5: Change the process’s current working directory

If the process’s current working directory is on the affected storage, change it to the new storage; in the gdb session, run:

change current working directory

(gdb) call chdir("/PATH/TO/NEW/CURRENT/WORKING/DIRECTORY")

6: Change each file descriptor to point to its new file

For each pair of numeric file descriptor FILE_DESCRIPTOR and new storage path /PATH/TO/NEW/FILE, run the following in the gdb session:

swap open files

#note the file descriptor
set var $oldfd = FILE_DESCRIPTOR
#get the access mode and flags of the original file
call fcntl($oldfd, 3)
#open the new file, with those flags
call open("/PATH/TO/NEW/FILE", $)
#note the new file's file descriptor
set var $newfd = $
#get the offset in the original file
call lseek($oldfd, 0, 1)
#set the offset to the identical position in the new file
call lseek($newfd, $, 0)
#copy the new file descriptor to the old one
call dup2($newfd, $oldfd)
#close the new file descriptor
call close($newfd)

Some constants are used above:

The 3 in fcntl($oldfd, 3) is F_GETFL, from /usr/include/bits/fcntl.h
The 1 in lseek($oldfd, 0, 1) is SEEK_CUR, from /usr/include/fcntl.h
The 0 in lseek($newfd, $, 0) is SEEK_SET, from /usr/include/fcntl.h

7: Check

Now when you look at the open files:

$ lsof -p PID

you should see all files paths have changed from the old storage to the new storage.

That’s it!

You can quit gdb, which will resume the process:

(gdb) quit

Notes

The job could have absolute paths stored in memory, and after resuming may try to open files on the old storage. If you can, setup up a symbolic link or some other trick to redirect it to the new storage.

If you’re keeping the same path (e.g. upgrading storage or just taking a downtime), you can just switch the job to a temporary filesystem, swap out the primary storage, and re-switch back the storage to the original location. Then there are no concerns about the process trying to use files in the “old” path, since it will be the same.

Beware of processes that have open network connections and could hit TCP timeouts if the switch takes too long.