«home

Inside a Hello World executable on OS X

Oct 26 2018

This post gives a fairly thorough breakdown of the contents of a "Hello World" executable on OS X 10.13.3 (High Sierra). The source used to generate the executable is as follows:

#include <stdio.h>

int main()
{
    printf("Hello World!\n");
    return 0;
}

You might find useful information here if you

are curious about how an executable is structured on a modern *nix OS,
need to manipulate object files in the Mach-O format, or
are interested in the inner workings of dynamic linking on OS X.

Official documentation for the Mach-O object file format is sparse, and much of the unofficial documentation available – while still very valuable – is out of date in crucial respects. For example, Z. Liu's minimal Mach-O executable doesn't work on recent OS X versions. Aiden Steel's useful guide leaves out some crucial features of modern Mach-O executables.

Background knowledge

I assume that you're familar with basic concepts from low-level programming (pointers, memory addresses, registers, the stack, etc.). No detailed knowledge of x86-64 assembly is required. However, it would be helpful to have a rough idea of what the MOV, JMP, CALL and LEA instructions do.

The dynamic linker makes extensive use of LEB128 encoding. Briefly, LEB encodes integer values of arbitrary size as variable-length sequences of bytes. Only the last byte has its most significant bit set. The integer encoded is given by the lowest 7 bits of each byte in sequence (little-endian).

OS X on x86-64 uses the System V calling conventions. The details of these conventions are not relevant here, but it would be helpful to skim Wikipedia's description. The important things to bear in mind are that (i) not all arguments to a function are passed on the stack and that (ii) there is convention for determining which arguments go in which registers.

Mach-O executables generated by the standard Xcode tools have a zero page to ensure that dereferencing of a null pointer is trapped by the OS.

Useful command line tools

OS X has the following tools for dumping the contents of Mach-O files:

List segments/sections:

    otool -l a.out

Show dyld opcodes (location of dyldinfo varies with Xcode version):

    /Library/Developer/CommandLineTools/usr/bin/dyldinfo -opcodes a.out
    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/dyldinfo -opcodes a.out

jtool is a cross-platform alternative:

http://www.newosxbook.com/tools/jtool.html

RIP-relative addressing modes in x86-64

x86-64 has a number of instruction-pointer-relative addressing modes. Roughly speaking, any instruction that takes an address operand can also take a memory address specified as a signed 32-bit offset from the value of RIP, the instruction pointer register. The addition of RIP-relative addressing reduces the cost of position-independent code in terms of code size and performance.

As an example, take the following jmp instruction. The address to jump to is stored in a given location in memory. The address of this memory location is specified not in absolute terms, but relative to the value that RIP has following decoding of the jmp instruction. As the offset specified is 0x61, and the jmp instruction itself occupies 6 bytes, the target is the address stored at the address of the first byte of the jmp instruction plus 0x67.

jmp    QWORD PTR [rip+0x61]  # jump to address of this instruction + 0x67

On OS X, or any other modern general purpose operating system, an executable never knows at which address it is going to be loaded, or at which addresses any shared libraries it makes use of are going to be loaded. Any absolute addresses contained in the executable must therefore be translated prior to execution. The use of RIP-relative addressing ensures that relatively few of these relocations need to be performed by the dynamic linker. For a more general introduction to linkers and the concept of relocations, I recommend Ian Lance Taylor's series of blog posts (post 6 in particular).

The structure of the executable

A Mach-O file consists of:

A Mach64 header.
A sequence of load commands, some of which specify the size and location of code and data segments.
Data for the segments described in the load commands.

Segments can be split into multiple sections. This document does not cover the format of the header or the load commands. This information is collected in in Aiden Steel's guide.

If you're on OS X, the MachOView utility is useful for browsing the structure of MachO files.

The overall layout of our example executable is as follows.

Mach64 Header

No surprises here. This specifies the architecture, the number of load commands, and the size of the load commands. Details here.

__PAGEZERO segment load command

This command loads a segment that occupies the first 4GB of the process's memory, but that takes up no space in the executable file. The essential purpose of the __PAGEZERO segment is to ensure that null pointer dereferences are trapped. This is achieved by ensuring that no protection rights are assigned to the segment — it is neither readable, writable nor executable. See this StackOverflow question for discussion of why the (virtual) size of this segment is so large.

__TEXT segment load command

This command loads the text segment, which is split into multiple sections.

__text section

The __text section contains the code for the main function.

stubs and stub_helper sections

The code in the __stubs and __stub_helper sections is crucially involved in calls to dynamically linked functions. Dynamic linking is covered in more detail in the subsequent section ‘Lazy vs. non-lazy symbol binding’.

__cstring section

This section contains zero-terminated C string constants. In our executable there is one such constant: "Hello World!\n".

__unwind_info section

From the man page for the unwinddump utility:

When a C++ (or x86_64 Objective-C) exception is thrown, the runtime must unwind the stack looking for some function to catch the exception. Traditionally, the unwind information is stored in the __TEXT/__eh_frame section of each executable as Dwarf CFI (call frame information). Beginning in Mac OS X 10.6, the unwind information is also encoded in the __TEXT/__unwind_info section using a two-level lookup table of compact unwind encodings.

The comment on __TEXT/__eh_frame is somewhat out of date, as the current Xcode tools omit this section. This appears to be a relatively recent change (see e.g. this blog post and this LLVM mailing list post from 2014).

__DATA segment load command

In a more complex executable, this segment contains actual program data. In our executable, it contains only lazy and non-lazy symbol pointers (in the sections __la_symbol_ptr and __nl_symbol_ptr respectively). These symbol pointers are involved in calling dynamically linked functions. Details on how this works are in the section ‘Lazy vs. non-lazy symbol binding’.

__LINKEDIT segment load command

This segment contains data interpreted by the dynamic linker. Its internal structure is specified in an additional DYLD_INFO_ONLY load command.

DYLD_INFO_ONLY segment load command

This load command specifies the internal structure of the __LINKEDIT segment. In particular, it gives the offset and size of

some bytecode interpreted by OS X's dynamic linker, and
the symbol export trie.

SYMTAB segment load command

This command loads the SYMTAB segment. This segment contains the symbol table, which is a list of nlist_64 structures. The SYMTAB segment is included in modern executables largely for legacy reasons, and the executable will in fact run successfully with it removed. The string table referenced by this load command is, however, still used.

DSYMTAB load command

This load command specifies the offset of the indirect symbol table. The indirect symbol table is a list of indices into the symbol table. The following fields are used to categorize symbols by specifying ranges of the indirect symbol table:

unsigned long ilocalsym;  /* index to local symbols */
unsigned long nlocalsym;  /* number of local symbols */
unsigned long iextdefsym; /* index to externally defined symbols */
unsigned long nextdefsym; /* number of externally defined symbols */
unsigned long iundefsym;  /* index to undefined symbols */
unsigned long nundefsym;  /* number of undefined symbols */

Our executable has no local symbols, so ilocalsym and nlocalsym are 0. It defines two external symbols (__mh_execute_header and _main), so iextdefsym is 0 and nextdefsym is 2. There are two referenced external symbols (dyld_stub_binder and _printf), so nundefsym is 2 and iundefsym is 2 (because the first two entries in the indirect symbol table are the indices for __mh_execute_header and _main). The role of dyld_stub_binder is discussed in more detail in the section ‘Lazy vs. non-lazy symbol binding’.

LOAD_DYLINKER load command

This is a very simple load command that just specifies the location of the dynamic linker: /usr/lib/dyld.

UUID load command

This specifies a unique identifier for the executable.

VERSION_MIN_MACOSX load command

This load command specifies the minimum version of OS X compatible with the executable (10.13.0).

SOURCE_VERSION load command

This load command specifies the version of the source code used to generate the executable. In our executable, this has the default value of 0.0.

MAIN load command

This load command gives the offset of the __main function in the file (3936). In our executable, __main is at the beginning of the __text section of the __TEXT segment.

LOAD_DYLIB load command

There is one LOAD_DYLIB load command for every library to which the executable is dynamically linked. In our executable, the only such library is libc, /usr/lib/libSystem.B.dylib.

FUNCTION_STARTS load command

This load command gives the offset and size of the function starts segment. Mark Rowe explains on Stack Overflow that this segment is

... used by tools that need to symbolicate addresses in crash logs, samples, spindumps, etc. to determine if a given address falls inside a function. It could also be useful to debuggers to help them more quickly find the bounds of the function that a given address is within.
The data within this section is formatted as a zero-terminated sequence of DWARF-style ULEB128 values. The first value is the offset from the start of the __TEXT segment to the start of the first function. The remaining values are offsets to the start of the next function from the previous function.

DATA_IN_CODE load command

This load command specifies the offset and size of a segment which records the locations of certain pieces of data that are inlined in the __TEXT segment. This segment is empty in our example executable. When present, the format appears to be simply a list of data_in_code_entry structs. See the LLVM source and the entry for this struct in the LLVM docs.

Following the Mach-O header, the contents of the executable are as follows:

__TEXT segment
- __stubs section
- __stub_helper section
- __cstring section
- __unwind_info section
__DATA segment
- __nl_symbol_ptr section
- __la_symbol_ptr section
Dynamic loder info segment
- rebase info
- binding info
- lazy binding info
- export info
Function starts segment
Symbol table
Dynamic symbol table
String table

Lazy vs. non-lazy symbol binding

Our hello world executable is dynamically linked against libc. By default, dynamically bound symbols like printf are bound lazily. That is, printf is not bound when the executable is loaded, but only when the first call to printf is made.

The basic concept of how this works is simple. For each dynamically bound symbol the executable stores a function pointer. This function pointer initially points to a ‘stub’. The stub calls the dynamic linker and asks it to look up the address of the relevant function. The function pointer is then overwritten with the function's address. As a result, subsequent function calls proceed directly.

Things are slightly more complex than this in practice because the stub is split into a stub proper and a stub helper. The stub proper always consists of a single jmp instruction. Initially, this jump targets the stub helper. The stub helper then calls the dynamic linker.

In fact, things are more complex still because the stub helper is itself decomposed into a stub helper and a ‘stub binding helper’. The reason for this decomposition is that the dyld_stub_binder function, which is called by each stub helper, requires two arguments. One of these arguments is different for each dynamically bound symbol; the other is the same. The stub binding helper pushes the constant argument onto the stack. The stub helper pushes the varying argument onto the stack and then jumps to the stub binding helper. Unlike a regular C function, dyld_stub_binder does not follow the System V calling conventions and takes both of its arguments on the stack.

Initial state:

The lazy symbol pointer points to the stub helper.

First function call:

The call to printf compiles down to a call to the associated stub. The arguments to printf are moved into registers prior to this call. The stub itself has no arguments.
The stub calls the function at the address stored in the lazy symbol pointer, thereby calling the stub helper.
The stub helper calls the dynamic linker. The dynamic linker overwrites the lazy symbol pointer with the address of the printf function itself. All arguments to dyld_stub_binder are passed on the stack, so the arguments to printf aren't clobbered.
The dynamic linker jumps to printf.

Subsequent function calls:

The stub jumps to the function at the address stored in the lazy symbol pointer, which now points to printf itself.

The lazy symbol pointer section

The lazy symbol pointer section is 8 bytes long, and thus contains one lazy symbol pointer. This is what we might expect given that our executable calls a single library function, printf.

A0 0F 00 00 01 00 00 0

The most significant non-zero byte is present because page zero occupies the first 4GB of the executable's address space. The corresponding file offset is therefore 0x0FA0 = 4000. The __stub_helper section starts at 3984=0xf90. As we will see shortly, the first stub is 16 bytes long, and 0xf90+16 = 4000. Thus, the lazy symbol pointer points to the second stub. This is because, as mentioned in the previous section, the first stub is a special stub that is called by all of the other stubs rather than the stub for a specific function.

The __stub_helper section starts at 0xf90 and has size 0x1a=26. It disassembles as follows:

0:  4c 8d 1d 71 00 00 00    lea    r11,[rip+0x71]        # 0x78
7:  41 53                   push   r11
9:  ff 25 61 00 00 00       jmp    QWORD PTR [rip+0x61]  # 0x70
f:  90                      nop
10: 68 00 00 00 00          push   0x0
15: e9 e6 ff ff ff          jmp    rip-0x1a

The nop instruction pads the section to 16 bytes.

The first three instructions are the stub for dyld_stub_binding_helper, which is different in form from the subsequent stubs. The address jumped to in the third instruction is the address in the memory location at 0x100000f90 + 0x61 + 9 + 6 (where 6 is the size of the jmp instruction itself). The resulting address is 0x100001000, which corresponds to offset 0x1000=4096 in the file. This is the start of the __nl_symbol_ptr section. Thus, the address jumped to is the address pointed to by the first non-lazy symbol pointer. The __nl_symbol_ptr section is zeroed out in the file, but when the executable is loaded, the relevant entry is non-lazily set to point to dyld_stub_binder.

The value loaded into R11 in the snippet above is the address of the ImageLoader cache (see dyld_stub_binder.s). In a regular executable, where the only non-lazily loaded symbol is dyld_stub_binder, and the __nl_symbol_ptr section is 16 bytes long, the address of the ImageLoader cache is the starting address of __nl_symbol_ptr plus 8. I don't know exactly what the ImageLoader cache is, or how the internals of this work.

The last two instructions in the listing above form the sole ordinary stub helper in this executable. In a larger executable, there would be a long sequence of stub helpers like this. Each ordinary stub helper pushes a dyld bytecode offset onto the stack and then jumps to dyld_stub_binding_helper (which in turn calls dyld_stub_binder). There is no padding between ordinary stub helpers.

__stubs starts at 0xf8a, has a length of 6 bytes and disassembles as follows:

ff 25 80 00 00 00       jmp    QWORD PTR [rip+0x80]        # 0x86

This is the stub for printf. The operand to jmp is the memory address (specified via a RIP-relative offset) of the first (and in our executable, only) lazy pointer in __la_symbol_ptr. It's important not to get confused into thinking that this is a jump specified via a simple relative offset. Rather, it is a jump to the memory location stored in the relevant lazy symbol pointer. It is the location of the lazy symbol pointer that is specified in a RIP-relative manner.

The string table

The string table is simply a sequence of null-terminated strings. In our Hello World executable it starts at 8376=0x20b8:

20 00  _  _  m  h  _  e  x  e  c  u  t  e  _  h  e  a
 d  e  r 00  _  m  a  i  n 00  _  p  r  i  n  t  f 00
 d  y  l  d  _  s  t  u  b  _  b  i  n  d  e  r 00

Because zero offsets into the string table have a special meaning, the first entry is a dummy. The convention appears to be to use the string " " for this purpose.

The symbol table

The symbol table is a list of nlist_64 structs:

// Size: 16 bytes
struct nlist_64 {
    union { uint32_t n_strx; } n_un;
    uint8_t n_type;
    uint8_t n_sect;
    uint16_t n_desc;
    uint64_t n_value;
};

In our executable the symbol table has offset 8296=0x2068 and contains 4 symbols, thus 64 bytes. Its contents are as follows:

(one nlist_64 per line)
00 00 00 0F 01 10 00 00 00 00 00 01 00 00 00
00 00 00 0F 01 00 00 60 0F 00 00 01 00 00 00
00 00 00 01 00 00 01 00 00 00 00 00 00 00 00
00 00 00 01 00 00 01 00 00 00 00 00 00 00 00

See Aidan Steele's guide for more details on these fields.

The first nlist_64:

n_strx  = 2            the index of the string "__mh_execute_header" in the
                       string table
n_type  = 0x0F         N_SECT | N_EXT (N_SECT means that n_sect gives the
                       section number in this file where the symbol is
                       defined)
n_sect  = 1
n_desc  = 0x0010       REFERENCED_DYNAMICALLY
n_value = 0x100000000  the address of the symbol (this = size of page zero
                       in the case of __mh_execute_header)

The second nlist_64:

n_strx  = 22           the index of the string "_main" in the string table
n_type  = 0x0F         N_SECT | N_EXT
n_sect  = 1
n_desc  = 0x0000
n_value = 0x100000F60  the beginning of the __text section.

The third nlist_64:

n_strx  = 28           the index of the string "_printf" in the string table
n_type  = 0x01         N_EXT (symbol not defined in this file)
n_sect  = 0            dummy value (because N_SECT not set in n_type)
n_desc  = 0x0001       REFERENCE_FLAG_UNDEFINED_LAZY
n_value = 0            dummy value (because not defined in this file)

The fourth nlist_64:

n_strx  = 36           the index of the string "dyld_stub_binder" in the
                       string table
n_type  = 0x01         N_EXT (symbol not defined in this file)
n_sect  = 0            dummy value (because N_SECT not set in n_type)
n_desc  = 0x0001       REFERENCE_FLAG_UNDEFINED_LAZY
n_value = 0            dummy value (because not defined in this file)

The indirect symbol table

The indirect symbol table is a sequence of 32-bit values. Each value is an index into the symbol table. The purpose of the indirect symbol table is to record which symbol is associated with each

stub,
non-lazy symbol pointer, and
lazy symbol pointer.

The indices in a given section of the indirect symbol table are in the same order as the stubs / non-lazy symbol pointers / lazy symbol pointers. So for example, to find the symbol associated with the second lazy symbol pointer, we

add 2 to the specified offset into the indirect symbol table,
look up the index at this offset into the indirect symbol table, then
go to the entry in the symbol table at the resulting index.

In our executable, the indirect symbol table starts at 8360=0x20a8 and has a length of 4 * sizeof(uint32) = 16 bytes. Its contents are as follows:

02 00 00 00 | 03 00 00 00 | 00 00 00 40 | 02 00 00 00

These values have the following interpretations:

Index into indirect    Index into symtab
0                      2                  --> _printf
1                      3                  --> _dyld_stub_binder
2                      ???                --> ???
3                      2                  --> _printf

Offsets into the indirect symbol table are as follows:

__stubs            0
__nl_symbol_ptr    1
__la_symbol_ptr    3

Dynamic linker commands

The dynamic linker is called via dyld_stub_binder. The arguments of this function do not directly specify which symbols to bind. Instead, dyld_stub_binder is given an offset into a special bytecode segment within the executable that is interpreted by the dynamic linker.

The code for the dynamic linker is split into four sections:

rebase info
binding info
lazy binding info
export info

We can disassmble the dynamic linker section using otool:

dyldinfo -opcodes a.out

This gives the following result:

rebase opcodes:
0x0000 REBASE_OPCODE_SET_TYPE_IMM(1)
0x0001 REBASE_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB(2, 0x00000010)
0x0003 REBASE_OPCODE_DO_REBASE_IMM_TIMES(1)
0x0004 REBASE_OPCODE_DONE()
binding opcodes:
0x0000 BIND_OPCODE_SET_DYLIB_ORDINAL_IMM(1)
0x0001 BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM(0x00, dyld_stub_binder)
0x0013 BIND_OPCODE_SET_TYPE_IMM(1)
0x0014 BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB(0x02, 0x00000000)
0x0016 BIND_OPCODE_DO_BIND()
0x0017 BIND_OPCODE_DONE
no compressed weak binding info
lazy binding opcodes:
0x0000 BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB(0x02, 0x00000010)
0x0002 BIND_OPCODE_SET_DYLIB_ORDINAL_IMM(1)
0x0003 BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM(0x00, _printf)
0x000C BIND_OPCODE_DO_BIND()
0x000D BIND_OPCODE_DONE
0x000E BIND_OPCODE_DONE
0x000F BIND_OPCODE_DONE

Each opcode is a single byte. The most significant four bits identify the opcode. Some opcodes allow an immediate value to be stored in the least significant 4 bits. For example, REBASE_OPCODE_SET_TYPE_IMM(1) is encoded as 0x50 | 0x01. Other opcodes can have immediate values following them. These immediate values are typically either LEB-encoded integer values or zero-terminated strings.

It's easy enough to find the encoding of each opcode by googling and/or looking at the headers, so I won't list them here.

A lazy symbol pointer starts out pointing to the address of a stub helper. The address of this helper changes following relocation of the program. Thus, each lazy symbol pointer must be rebased when the program is loaded. We have a single lazy symbol pointer (for _printf), so the rebase opcodes section contains a single REBASE_OPCODE command. This command specifies the index of the load command for the data segment (counting from zero) and an offset of 0x10 into this segment – the start of the __la_symbol_ptr section. Setting the type to 1 specifies that the entity being rebased is a pointer. REBASE_OPCODE_DO_REBASE_IMM_TIMES is used to rebase a contiguous sequence of pointers using a single command. Thus, if our program called three functions in libc rather than one, REBASE_OPCODE_DO_REBASE_IMM_TIMES(1) would become REBASE_OPCODE_DO_REBASE_IMM_TIMES(3).

The binding opcodes section contains the command to non-lazily bind dyld_stub_binder. BIND_OPCODE_SET_DYLIB_ORDINAL_IMM(1) takes as its argument the index of /usr/lib/libSystem.B.dylib. The index is 1 because this library is loaded by the first LC_LOAD_DYLIB load command in the file. Setting the type to 1 specifies that the symbol is a pointer. 0x02 is the index of the load command for the data segment (counting from zero). The offset of zero specifies the beginning of the first section of the data segment, __nl_symbol_pointer. Thus, the effect of this command is to set the pointer in __nl_symbol_pointer[0] to point to dyld_stub_binder.

The lazy binding opcodes section binds the lazy symbol pointers. In the case of our example executable, the only lazy symbol pointer is _printf. The offset is 0x02 because of the two non-lazy symbol pointers at the beginning of the data segment.

Note that BIND_OPCODE_DONE is zero. The last two BIND_OPCODE_DONE opcodes in the listing are just padding.

The export trie

The export trie is primarily of interest for dylibs rather than for executables. Nonetheless, our executable does export two symbols: __mh_execute_header and _main. The export trie stores the names of all exported symbols together with various associated properties. The headers give the following description:

The symbols exported by a dylib are encoded in a trie. This is a compact representation that factors out common prefixes. It also reduces LINKEDIT pages in RAM because it encodes all information (name, address, flags) in one small, contiguous range. The export area is a stream of nodes. The first node sequentially is the start node for the trie.
Nodes for a symbol start with a uleb128 that is the length of the exported symbol information for the string so far. If there is no exported symbol, the node starts with a zero byte. If there is exported info, it follows the length.
First is a uleb128 containing flags. Normally, it is followed by a uleb128 encoded offset which is location of the content named by the symbol from the mach_header for the image. If the flags is EXPORT_SYMBOL_FLAGS_REEXPORT, then following the flags is a uleb128 encoded library ordinal, then a zero terminated UTF8 string. If the string is zero length, then the symbol is re-export from the specified dylib with the same name. If the flags is EXPORT_SYMBOL_FLAGS_STUB_AND_RESOLVER, then following the flags is two uleb128s: the stub offset and the resolver offset. The stub is used by non-lazy pointers. The resolver is used by lazy pointers and must be called to get the actual address to use.
After the optional exported symbol information is a byte of how many edges (0-255) that this node has leaving it, followed by each edge. Each edge is a zero terminated UTF8 of the addition chars in the symbol, followed by a uleb128 offset for the node that edge points to.

There is also a good description of the export trie on the following page (under the "Export Trie" heading):

http://www.m4b.io/reverse/engineering/mach/binaries/2015/03/29/mach-binaries.html

Some flag values:

EXPORT_SYMBOL_FLAGS_REEXPORT          = 8
EXPORT_SYMBOL_FLAGS_STUB_AND_RESOLVER = 16

The trie data from our executable is as follows:

               byte 5
               |
00 01  _ 00 05 00 02  _  m  h  _  e  x  e  c  u
 t  e  _  h  e  a  d  e  r 00 21  m  a  i  n 00
25 02 00 00 00 03 00 E0 1E 00 00 00 00 00 00 00
   |           |
   byte 33     byte 37

We can sort of see already from this that the overall structure of the trie is as follows:

                   o
                   |
                   |
                  '_'   BRANCH 1
                   |
                   |
                   o
                 /   \
      BRANCH 2  /     \  BRANCH 3
               /       \
'_mh_execute_header'   'main'
            /            \
           o              o

The two symbols encoded in the trie are __mh_execute_header and _main.

Byte(s)	Encoded value	Interpretation
0	0	No terminal string info here. Root node.
1	01	Number of branches leaving this node.
2-3	`_\0`	Label of branch 1 (see diagram above).
4	05	Offset from start of trie to beginning of next node.
5	0	No terminal string info here.
6	2	Number of branches leaving this node.
7-25	`_mh_execute_header\0`	Label of branch 2 (see diagram above)
26	33	Offset from start of trie to beginning of next node.
27-31	`main\0`	Label of branch 3 (see diagram above).
32	0x25=37	Offset from start of tree to beginning of next node.
33	2	Length of terminal string info.
34	0	Symbol export flags.
35	0	Offset of symbol `__mh_execute_header` in file.
36	0	Number of branches leaving this node.
37	3	Length of terminal string info.
38	0	Symbol export flags.
39-40	3936 (LEB)	Offset of symbol `_main` in file.
41	0	Number of branches leaving this node.
42-48	--	Padding.

The export trie tells us that __mh_execute_header starts at the beginning of the file while _main starts at byte 3936. It makes sense that __mh_execute_header starts at the beginning of the file, since this symbol is made available so that programs can inspect their Mach-O headers. The value of 3936 for __main also makes sense as this is the offset of the __text section of the __TEXT segment.

There isn't any interesting use of the ‘symbol flags’ flags byte in our executable. This byte can be used to encode the following info:

Kinds (least significant two bits in flags byte):

    0   Regular symbol
    1   Thread local symbol
    2   Absolute symbol

Types (bits 3 and 4 in flags byte):

    0   Regular
    4   Weak (program will still exec if symbol not found?)
    8   Reexport
    16  A ‘stub’ with a uleb128 stub offset followed by a
        uleb128 resolver offset. Not to be confused with
        stubs in the sense above. Don't know what this is
        exactly.

Export trie generation

The use of ULEB encoding for node offsets makes it surprisingly difficult to generate the export trie. Non-terminal nodes in the trie reference other nodes via their offsets in the encoded byte stream. The number of bytes used to encode an offset varies depending on the size of the offset value. Increasing the number of bytes occupied by an encoded offset has a knock-on effect on the values of other offsets, which in turn affects the number of bytes required to encode these offsets.

The following is a sketch of the export trie generation algorithm used by the standard tools (see makeTrie in MachOTrie.hpp). First, calculate the size of a node on the assumption that the offsets of each of its children occupy a single byte. If one of the offsets can't fit in a byte, then increase the size of this offset. Update as necessary the offset values of the node's other children and the offset values of its descendants' children. These updates may cause the encoded size of some of the offsets to increase. Repeat the cycle until the encoded size of all offsets stabilizes.

Pseudocode:

Each trie node has the fields
  SIZE (integer),
  MAX_DISP_SIZE (integer).

Initial value of SIZE for each node =
  encoded size of the node excluding any child offsets

Initial value of MAX_DISP_SIZE for each node =
  1

START:
  Set OFFSET := 0
  Visit each node of the trie in pre-order:
    OFFSET += N.SIZE

    If the node has as-yet unvisited children:
      ULEN := uleb encoded length of OFFSET
      If ULEN > N.MAX_DISP_SIZE:
        N.SIZE += ULEN - N.MAX_DISP_SIZE
        N.MAX_DISP_SIZE = ULEN
        GOTO START

What is a minimal Mach-O executable?

On recent OS X versions, a viable Mach-O executable must be almost as complex as the executable for a Hello World C program produced by the standard Xcode tools. However, I have verified that the following load commands (and associated segments where applicable) can be removed without rendering the object file unexecutable on OS X 10.13.4:

LC_VERSION_MIN_MACOSX
LC_SOURCE_VERSION
LC_DATA_IN_CODE
LC_FUNCTION_STARTS
LC_UUID
LC_SYMTAB
The __cstring and __unwind_info sections of the __TEXT segment.
- The __cstring section can't be removed from our example executable because it contains the string constaint passed to printf. But in general, a Mach-O file without a __cstring segment will still execute.

Useful resources

[1] A comprehensive but dated reference for the Mach-O file format by Aidan Steele. Based on a PDF released by Apple in 2009.
[2] A twenty part essay on linkers by Ian Lance Taylor. No Mach-O specific content, but a wealth of useful information on linking and object file formats in general.
[3] Stack Overflow question with discussion of requirements for a minimal Mach-O executable.
[4] Useful information on the internal workings of the OS X dynamic linker and its bytecode format (from Jonathan Levin's New OS X Book).
[5] A detailed but dated description of a minimal Mach-O executable by Mike Ash.
[6][7] A two-part comparison of ELF and Mach-O by Joe Damato. Contains some useful info about how dynamic linking works on OS X.
[8] Facebook's ‘fishhook’ utility. The readme has an extremely useful diagram showing the interaction of the various different symbol-table structures in a Mach-O executable.
[9] Some useful discussion of stubs on Stack Exchange.
[10] Info on the global offset table in Apple's official docs.
[11] Slides for a presentation on the structure of MachO files by Anthony Shoumikhin. Covers calls to dynamically bound functions in detail.