Mac OS X Sierra and Ruby debugging: An unhappy marriage
In my new role I've found myself having to debug Ruby native gems again. Debugging C code is normally a pretty mundane task involving lots of poking around with gdb
, and one I've done many times before in a previous life. However, my employer's chosen development machines are MacBooks and I recently updated mine to OS X 10.12.2 (Sierra). With this new Mac OS version things started to unravel pretty quickly.
Developing on Mac OS X for deployment on Linux is usually quite transparent thanks to Homebrew
. Typically, the steps to debug a native Ruby gem look like this:
- Build
cruby
from source in debug mode (not strictly necessary but I find it helpful when learning about Ruby's APIs since you get info about args, locals, etc):
brew install openssl libyaml libffi
CXXFLAGS="-O0 -g3" CFLAGS="-O0 -g3" ./configure --disable-install-doc --with-openssl-dir="$(brew --prefix openssl)" --prefix=/Users/Stefan/bin/ruby
make -j install
- Build your gem in debug mode too (this can be done by editing
extconf.rb
and settingoptflags
ordebugflags
appropriately, see here for more helpful tips):
CONFIG['optflags'] = '-O0 -g3'
Followed by building: rake clean && rake build
- Start your test under the debugger:
gdb --args /Users/Stefan/bin/ruby/bin/ruby -I"lib" -I"/Users/Stefan/bin/gems/gems/rake-10.5.0/lib" "/Users/Stefan/bin/gems/gems/rake-10.5.0/lib/rake/rake_test_loader.rb" "test/test_my_gem.rb"
Unfortunately it turns out the OS X Sierra has multiple known issue with gdb
and Homebrew
currently cherry-picks an upstream patch submitted on November 9th, 2016 to even have gdb
start up. I went through the dance, including codesigning (codesign -fs gdb_codesign `which gdb`
). I was then met with:
warning: unhandled dyld version (15)
The usual info shared
command did not work and my gem was not being picked up. Some investigation showed that Apple has updated dyld to version 15, which is not yet supported by gdb
. I bumped the DYLD_VERSION_MAX
as had been done for version 14 and built my own gdb from the latest source:
CXXFLAGS="-Wno-c++11-narrowing -O0 -g3" CFLAGS="-Wno-c++11-narrowing -O0 -g3" ./configure --enable-debug --disable-build-warnings --prefix=/Users/Stefan/Desktop/gdb-patched-bin
This time I got a bit further:
Cannot insert breakpoint -1
Cannot access memory at address 0xe85d
Investigation showed that 0xe85d
is the offset of the gdb_image_notifier
function in /usr/lib/dyld
. Both lldb
and gdb
transparently set a breakpoint on this function in order to intercept dynamic libraries being loaded. After some poking around for a readelf
equivalent for Mac OS' Mach-O format I came upon otool
and borrowed another laptop running Mac OS X 10.11.6 (El Capitan) for comparison purposes:
Mac OS X 10.11.6 (El Capitan):
otool -l /usr/lib/dyld
Mach header
magic cputype cpusubtype caps filetype ncmds sizeofcmds flags
0xfeedfacf 16777223 3 0x00 7 14 1696 0x00000085
Load command 0
cmd LC_SEGMENT_64
cmdsize 552
segname __TEXT
vmaddr 0x00007fff5fc00000
vmsize 0x0000000000038000
fileoff 0
filesize 229376
maxprot 0x00000007
initprot 0x00000005
nsects 6
flags 0x0
Mac OS X 10.12.2 (Sierra):
otool -l /usr/lib/dyld
/usr/lib/dyld:
Mach header
magic cputype cpusubtype caps filetype ncmds sizeofcmds flags
0xfeedfacf 16777223 3 0x00 7 14 1696 0x00000085
Load command 0
cmd LC_SEGMENT_64
cmdsize 552
segname __TEXT
vmaddr 0x0000000000000000
vmsize 0x000000000003e000
fileoff 0
filesize 253952
maxprot 0x00000007
initprot 0x00000005
nsects 6
flags 0x0
It became apparent that vmaddr
has changed and that gdb
is now attempting to set breakpoints at a raw offset instead of the final relocated value. This theory is corroborated by the output of the Mac /proc/[pid]/maps
equivalent vmmap
:
Process: gdb [91820]
Path: /Users/Stefan/Desktop/gdb-patched-bin/bin/gdb
Load Address: 0x100000000
...
==== Non-writable regions for process 91820
REGION TYPE START - END [ VSIZE] PRT/MAX SHRMOD REGION DETAIL
__TEXT 0000000100000000-000000010064e000 [ 6456K] r-x/rwx SM=COW /Users/Stefan/Desktop/gdb-patched-bin/bin/gdb
__LINKEDIT 000000010077a000-0000000100a91000 [ 3164K] r--/rwx SM=COW /Users/Stefan/Desktop/gdb-patched-bin/bin/gdb
__TEXT 0000000100a91000-0000000100acf000 [ 248K] r-x/rwx SM=COW /usr/lib/dyld
__LINKEDIT 0000000100b06000-0000000100b1c000 [ 88K] r--/rwx SM=COW /usr/lib/dyld
...
Some digging indicates this change my have to do with ASLR:
Slide can be computed by the difference between running __TEXT vmaddr and the one read from disk.
I opened a bug against gdb
and hacked up a crappy fix. While gdb
now starts up properly and info shared
behaves as expected I was still unable to set breakpoints in my gem.
Error in re-setting breakpoint 1: Cannot access memory at address 0x3fe0
I added some more debug printf lines to my gdb
and found that some object files seem to have the wrong memory addresses embedded in their DWARF2
debug info at DW_AT_low_pc
:
Good example: Base addr 0x100001060 for object file: /Users/Stefan/src/ruby/main.o
Bad example: Base addr 0x3fe0 for object file: /Users/Stefan/src/my_gem/tmp/x86_64-darwin16/my_gem/2.3.3/my_gem.o
This finding indicates that there may be something more fundamentally wrong about relocations on Sierra and gdb 7.12
. I decided to give lldb
a try and found that all my forked children in the test would fail instantly. I enabled core dumps with ulimit -c unlimited
and found the the following:
lldb `which ruby` -c /cores/core.83458
(lldb) target create "/Users/Stefan/bin/ruby/bin/ruby" --core "/cores/core.83458"
(lldb) bt
* thread #1: tid = 0x0000, 0x000000010040885e dyld`gdb_image_notifier(dyld_image_mode, unsigned int, dyld_image_info const*) + 1, stop reason = signal SIGSTOP
* frame #0: 0x000000010040885e dyld`gdb_image_notifier(dyld_image_mode, unsigned int, dyld_image_info const*) + 1
(lldb) register read
General Purpose Registers:
rax = 0x0000000100438000 dyld`dyld_all_image_infos
rbx = 0x0000000000000001
rcx = 0x0000000066a314de
rdx = 0x00007fff5fbf07c0
rdi = 0x0000000000000000
rsi = 0x0000000000000001
rbp = 0x00007fff5fbf05b0
rsp = 0x00007fff5fbf0588
r8 = 0x000000000000000f
r9 = 0x0000000000000000
r10 = 0x000000000000002c
r11 = 0x0000000000000206
r12 = 0x00007fff5fbf07c0
r13 = 0x0000000000000014
r14 = 0x00007fff5fbf07c0
r15 = 0x00001ed1d63963b8
rip = 0x000000010040885e dyld`gdb_image_notifier(dyld_image_mode, unsigned int, dyld_image_info const*) + 1
rflags = 0x0000000000000246
cs = 0x000000000000002b
fs = 0x0000000000000000
gs = 0x0000000000000000
I tried gdb
as well (with the semi-broken patch I made) and got a similar core dump. (In fact worse, since the rip
for gdb
is 0x0
and Ruby's own signal handler would attempt to walk backwards to find a stack trace and would proceed to SIGSEGV
- I'm still not sure why that happened, but gdb
seemed to die in the same place and with the same SIGSTOP
as lldb
).
A bit of searching came upon an interesting post:
lldb sets a breakpoint on gdb_image_notifier() in order to catch libraries being loaded ... apparently the debugger did not attach to the new process but incorrectly left its breakpoint active there. When the breakpoint is hit again without the debugger waiting to handle it, the process crashes.
Disassembling some instructions in the core file shows a very suspicious INT 3 which would definitely cause the crash outside a debugger:
(lldb) disassemble --start-address 0x000000010040885e-0x1 -c 100
dyld`gdb_image_notifier:
0x10040885d <+0>: int3
-> 0x10040885e <+1>: movq %rsp, %rbp
...
So while this sounds like a plausible theory, I'm not sure how this mechanism would ever work since lldb
does not support follow-fork-mode child
and in fact gdb
only supports it on Linux:
On some systems, gdb provides support for debugging programs that create additional processes using the fork or vfork functions. On gnu/Linux platforms, this feature is supported with kernel version 2.5.46 and later.
Unfortunately this situation is completely broken and leaves us in a really bad state. cruby
(like cpython
) is a runtime that relies on a single global lock for executing bytecodes in its interpreter loop. While this design vastly simplifies the VM implementation, it also means the concurrency model is process rather than thread oriented (contrast with the JVM). fork
is an integral part of any reasonably complex Ruby program, and not being able to debug forked children easily on Mac OS disqualifies it as a serious development platform for Ruby users.
I believe there are currently multiple compounding issues:
dyld 15
and Mac OS 10.12+ have modified their relocation behaviour andgdb
in its current state cannot debug shared libraries on it anymore. This is a huge problem that I partially overcame with a shitty patch.- For reasons I do not understand at this time the
DWARF2
debug info for some objects also contains offsets whichgdb
does not relocate appropriately, so even with the patch you may not be able to debug shared libraries properly unless you do more hacks to offset the bad locations appropriately. - To my knowledge no native debugger supports
follow-fork-mode child
andset detach-on-fork on
on Mac OS and the alternatives are pretty lacklustre. - Both
lldb
andgdb
seem to crash indyld
's notification mechanism when children are forked. I do not recall this behaviour happening in El Capitan (though I don't have easy access to a machine running it to check). I do not yet have a minimal test case, and a trivial "Hello World" stylefork
+require
program did not exhibit this behaviour.
At this point I've gone way too far down the tooling rabbit hole and will instead set up a Virtualbox VM with Linux so I can use the tooling I know and expect to work. I'll be monitoring the upstream bug report for any comments from gdb
maintainers.
Should you have any ideas please do leave a comment below. Thanks for reading!