Mac OS X Sierra and Ruby debugging: An unhappy marriage

In my new role I've found myself having to debug Ruby native gems again. Debugging C code is normally a pretty mundane task involving lots of poking around with gdb, and one I've done many times before in a previous life. However, my employer's chosen development machines are MacBooks and I recently updated mine to OS X 10.12.2 (Sierra). With this new Mac OS version things started to unravel pretty quickly.

Developing on Mac OS X for deployment on Linux is usually quite transparent thanks to Homebrew. Typically, the steps to debug a native Ruby gem look like this:

  • Build cruby from source in debug mode (not strictly necessary but I find it helpful when learning about Ruby's APIs since you get info about args, locals, etc):
brew install openssl libyaml libffi  
CXXFLAGS="-O0 -g3" CFLAGS="-O0 -g3" ./configure --disable-install-doc --with-openssl-dir="$(brew --prefix openssl)" --prefix=/Users/Stefan/bin/ruby  
make -j install  
  • Build your gem in debug mode too (this can be done by editing extconf.rb and setting optflags or debugflags appropriately, see here for more helpful tips):
CONFIG['optflags'] = '-O0 -g3'  

Followed by building: rake clean && rake build

  • Start your test under the debugger:
gdb --args /Users/Stefan/bin/ruby/bin/ruby -I"lib" -I"/Users/Stefan/bin/gems/gems/rake-10.5.0/lib" "/Users/Stefan/bin/gems/gems/rake-10.5.0/lib/rake/rake_test_loader.rb" "test/test_my_gem.rb"  

Unfortunately it turns out the OS X Sierra has multiple known issue with gdb and Homebrew currently cherry-picks an upstream patch submitted on November 9th, 2016 to even have gdb start up. I went through the dance, including codesigning (codesign -fs gdb_codesign `which gdb`). I was then met with:

warning: unhandled dyld version (15)  

The usual info shared command did not work and my gem was not being picked up. Some investigation showed that Apple has updated dyld to version 15, which is not yet supported by gdb. I bumped the DYLD_VERSION_MAX as had been done for version 14 and built my own gdb from the latest source:

CXXFLAGS="-Wno-c++11-narrowing -O0 -g3" CFLAGS="-Wno-c++11-narrowing -O0 -g3" ./configure --enable-debug --disable-build-warnings --prefix=/Users/Stefan/Desktop/gdb-patched-bin  

This time I got a bit further:

Cannot insert breakpoint -1  
Cannot access memory at address 0xe85d  

Investigation showed that 0xe85d is the offset of the gdb_image_notifier function in /usr/lib/dyld. Both lldb and gdb transparently set a breakpoint on this function in order to intercept dynamic libraries being loaded. After some poking around for a readelf equivalent for Mac OS' Mach-O format I came upon otool and borrowed another laptop running Mac OS X 10.11.6 (El Capitan) for comparison purposes:

Mac OS X 10.11.6 (El Capitan):

otool -l /usr/lib/dyld  
Mach header  
      magic cputype cpusubtype  caps    filetype ncmds sizeofcmds      flags
 0xfeedfacf 16777223          3  0x00           7    14       1696 0x00000085
Load command 0  
      cmd LC_SEGMENT_64
  cmdsize 552
  segname __TEXT
   vmaddr 0x00007fff5fc00000
   vmsize 0x0000000000038000
  fileoff 0
 filesize 229376
  maxprot 0x00000007
 initprot 0x00000005
   nsects 6
    flags 0x0

Mac OS X 10.12.2 (Sierra):

otool -l /usr/lib/dyld  
/usr/lib/dyld:
Mach header  
      magic cputype cpusubtype  caps    filetype ncmds sizeofcmds      flags
 0xfeedfacf 16777223          3  0x00           7    14       1696 0x00000085
Load command 0  
      cmd LC_SEGMENT_64
  cmdsize 552
  segname __TEXT
   vmaddr 0x0000000000000000
   vmsize 0x000000000003e000
  fileoff 0
 filesize 253952
  maxprot 0x00000007
 initprot 0x00000005
   nsects 6
    flags 0x0

It became apparent that vmaddr has changed and that gdb is now attempting to set breakpoints at a raw offset instead of the final relocated value. This theory is corroborated by the output of the Mac /proc/[pid]/maps equivalent vmmap:

Process:         gdb [91820]  
Path:            /Users/Stefan/Desktop/gdb-patched-bin/bin/gdb  
Load Address:    0x100000000  
...
==== Non-writable regions for process 91820
REGION TYPE                      START - END             [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL  
__TEXT                 0000000100000000-000000010064e000 [ 6456K] r-x/rwx SM=COW  /Users/Stefan/Desktop/gdb-patched-bin/bin/gdb  
__LINKEDIT             000000010077a000-0000000100a91000 [ 3164K] r--/rwx SM=COW  /Users/Stefan/Desktop/gdb-patched-bin/bin/gdb  
__TEXT                 0000000100a91000-0000000100acf000 [  248K] r-x/rwx SM=COW  /usr/lib/dyld  
__LINKEDIT             0000000100b06000-0000000100b1c000 [   88K] r--/rwx SM=COW  /usr/lib/dyld  
...

Some digging indicates this change my have to do with ASLR:

Slide can be computed by the difference between running __TEXT vmaddr and the one read from disk.

I opened a bug against gdb and hacked up a crappy fix. While gdb now starts up properly and info shared behaves as expected I was still unable to set breakpoints in my gem.

Error in re-setting breakpoint 1: Cannot access memory at address 0x3fe0  

I added some more debug printf lines to my gdb and found that some object files seem to have the wrong memory addresses embedded in their DWARF2 debug info at DW_AT_low_pc:

Good example: Base addr 0x100001060 for object file: /Users/Stefan/src/ruby/main.o
Bad example: Base addr 0x3fe0 for object file: /Users/Stefan/src/my_gem/tmp/x86_64-darwin16/my_gem/2.3.3/my_gem.o

This finding indicates that there may be something more fundamentally wrong about relocations on Sierra and gdb 7.12. I decided to give lldb a try and found that all my forked children in the test would fail instantly. I enabled core dumps with ulimit -c unlimited and found the the following:

lldb `which ruby` -c  /cores/core.83458  
(lldb) target create "/Users/Stefan/bin/ruby/bin/ruby" --core "/cores/core.83458"
(lldb) bt
* thread #1: tid = 0x0000, 0x000000010040885e dyld`gdb_image_notifier(dyld_image_mode, unsigned int, dyld_image_info const*) + 1, stop reason = signal SIGSTOP
  * frame #0: 0x000000010040885e dyld`gdb_image_notifier(dyld_image_mode, unsigned int, dyld_image_info const*) + 1
(lldb) register read
General Purpose Registers:  
       rax = 0x0000000100438000  dyld`dyld_all_image_infos
       rbx = 0x0000000000000001
       rcx = 0x0000000066a314de
       rdx = 0x00007fff5fbf07c0
       rdi = 0x0000000000000000
       rsi = 0x0000000000000001
       rbp = 0x00007fff5fbf05b0
       rsp = 0x00007fff5fbf0588
        r8 = 0x000000000000000f
        r9 = 0x0000000000000000
       r10 = 0x000000000000002c
       r11 = 0x0000000000000206
       r12 = 0x00007fff5fbf07c0
       r13 = 0x0000000000000014
       r14 = 0x00007fff5fbf07c0
       r15 = 0x00001ed1d63963b8
       rip = 0x000000010040885e  dyld`gdb_image_notifier(dyld_image_mode, unsigned int, dyld_image_info const*) + 1
    rflags = 0x0000000000000246
        cs = 0x000000000000002b
        fs = 0x0000000000000000
        gs = 0x0000000000000000

I tried gdb as well (with the semi-broken patch I made) and got a similar core dump. (In fact worse, since the rip for gdb is 0x0 and Ruby's own signal handler would attempt to walk backwards to find a stack trace and would proceed to SIGSEGV - I'm still not sure why that happened, but gdb seemed to die in the same place and with the same SIGSTOP as lldb).

A bit of searching came upon an interesting post:

lldb sets a breakpoint on gdb_image_notifier() in order to catch libraries being loaded ... apparently the debugger did not attach to the new process but incorrectly left its breakpoint active there. When the breakpoint is hit again without the debugger waiting to handle it, the process crashes.

Disassembling some instructions in the core file shows a very suspicious INT 3 which would definitely cause the crash outside a debugger:

(lldb) disassemble --start-address 0x000000010040885e-0x1 -c 100
dyld`gdb_image_notifier:  
    0x10040885d <+0>:   int3
->  0x10040885e <+1>:   movq   %rsp, %rbp
...

So while this sounds like a plausible theory, I'm not sure how this mechanism would ever work since lldb does not support follow-fork-mode child and in fact gdb only supports it on Linux:

On some systems, gdb provides support for debugging programs that create additional processes using the fork or vfork functions. On gnu/Linux platforms, this feature is supported with kernel version 2.5.46 and later.

Unfortunately this situation is completely broken and leaves us in a really bad state. cruby (like cpython) is a runtime that relies on a single global lock for executing bytecodes in its interpreter loop. While this design vastly simplifies the VM implementation, it also means the concurrency model is process rather than thread oriented (contrast with the JVM). fork is an integral part of any reasonably complex Ruby program, and not being able to debug forked children easily on Mac OS disqualifies it as a serious development platform for Ruby users.

I believe there are currently multiple compounding issues:

  1. dyld 15 and Mac OS 10.12+ have modified their relocation behaviour and gdb in its current state cannot debug shared libraries on it anymore. This is a huge problem that I partially overcame with a shitty patch.
  2. For reasons I do not understand at this time the DWARF2 debug info for some objects also contains offsets which gdb does not relocate appropriately, so even with the patch you may not be able to debug shared libraries properly unless you do more hacks to offset the bad locations appropriately.
  3. To my knowledge no native debugger supports follow-fork-mode child and set detach-on-fork on on Mac OS and the alternatives are pretty lacklustre.
  4. Both lldb and gdb seem to crash in dyld's notification mechanism when children are forked. I do not recall this behaviour happening in El Capitan (though I don't have easy access to a machine running it to check). I do not yet have a minimal test case, and a trivial "Hello World" style fork + require program did not exhibit this behaviour.

At this point I've gone way too far down the tooling rabbit hole and will instead set up a Virtualbox VM with Linux so I can use the tooling I know and expect to work. I'll be monitoring the upstream bug report for any comments from gdb maintainers.

Should you have any ideas please do leave a comment below. Thanks for reading!