It’s a common predicament for admins of dynamic systems that also run long jobs — you need to take down some storage, but there are processes using it that you really don’t want to kill. Here’s a way to save those jobs by live-migrating them to new storage. In short, it’s another case in which gdb let’s you do the sysadmin equivalent of changing the tires while driving down the road.
Do the following for each process id PID
involved in each job:
1: Attach with the GNU debugger
1
|
|
This will also pause the job.
2: Note all the open files
Now list all open files of the process:
1
|
|
Search this for the filesystem path that you’re decommissioning, and note the file descriptor (FD
column) of every file that needs to move, including the current working directory (FD
=cwd
), if applicable.
3: Flush data
Back in the gdb session, flush all the file descriptors so that no in-core data is lost; for each numeric file descriptor FILE_DESCRIPTOR
run:
1
|
|
Flush OS file system buffers, too; in a shell, run:
1
|
|
4: Copy the files
Create a directory for the process on the new storage, and copy all affected files from the old storage to the new storage. Note their paths.
5: Change the process’s current working directory
If the process’s current working directory is on the affected storage, change it to the new storage; in the gdb session, run:
1
|
|
6: Change each file descriptor to point to its new file
For each pair of numeric file descriptor FILE_DESCRIPTOR
and new storage path /PATH/TO/NEW/FILE
, run the following in the gdb session:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Some constants are used above:
- The
3
infcntl($oldfd, 3)
isF_GETFL
, from /usr/include/bits/fcntl.h - The
1
inlseek($oldfd, 0, 1)
isSEEK_CUR
, from /usr/include/fcntl.h - The
0
inlseek($newfd, $, 0)
isSEEK_SET
, from /usr/include/fcntl.h
7: Check
Now when you look at the open files:
1
|
|
you should see all files paths have changed from the old storage to the new storage.
That’s it!
You can quit gdb, which will resume the process:
1
|
|
Notes
The job could have absolute paths stored in memory, and after resuming may try to open files on the old storage. If you can, setup up a symbolic link or some other trick to redirect it to the new storage.
If you’re keeping the same path (e.g. upgrading storage or just taking a downtime), you can just switch the job to a temporary filesystem, swap out the primary storage, and re-switch back the storage to the original location. Then there are no concerns about the process trying to use files in the “old” path, since it will be the same.
Beware of processes that have open network connections and could hit TCP timeouts if the switch takes too long.