computer talk by a systems hacker and ruby-core developer

© 2014. All rights reserved.

Ruby 2.1: RGenGC

Ruby 2.1 adds a "restricted" generational collector, with minor mark phases that dramatically reduce the cost of GC.

Let's take a look at the evolution of Ruby's GC.

Ruby 1.8: simple mark and sweep

Classic mark and sweep implementation. The entire world is stopped during both phases.

  1. Traverse object graph from roots and mark live objects, using a bit inside the object structure (FL_MARK).
  2. Iterate over all heap slots and add unmarked slots to the freelist.

Ruby 1.9.3: lazy sweep

@nari3 adds LazySweepGC, reducing GC pauses to just the mark phase. The heap is swept incrementally as object slots are required.

Ruby 2.0: bitmaps for COW-safety

@nari3 adds Bitmap Marking GC, helping Unix systems share memory across child processes. The mark phase is also rewritten to be non-recursive.

Although the memory savings from bitmaps were only modest, the patch freed up a bit (FL_MARK became FL_WB_PROTECTED later) and laid the groundwork for a generational collector.

Ruby 2.1: oldgen and minor marking

@ko1 designs RGenGC, a generational collector that can be implemented incrementally and supports C-extensions.

Objects on the heap are divided into two categories:

Only protected objects can be promoted to oldgen.
(This is the "restricted" in RGenGC.)

Unprotected objects cannot be promoted, but if referenced from oldgen they are added to a remembered set. Minor marks are much faster because they only have to traverse references from the remembered set.

Heap Layout

Ruby objects live on the ruby heap, which is split up into pages. Each page is 16KB and holds ~408 object slots.

Every RVALUE slot on this page occupies 40 bytes. Strings shorter than 23 bytes and Arrays with less than four elements can be embedded directly within this 40 byte slot.

GC::INTERNAL_CONSTANTS[:RVALUE_SIZE]    #=>  40 bytes per object slot (on x84_64)
GC::INTERNAL_CONSTANTS[:HEAP_OBJ_LIMIT] #=> 408 slots per heap page

Pages in the heap reside in one of two places: eden or tomb. Eden contains pages that have one or more live objects. The tomb contains empty pages with no objects. The sum of these pages represent the capacity of the ruby heap:

GC.stat(:heap_used) ==
  GC.stat(:heap_eden_page_length) + GC.stat(:heap_tomb_page_length)

During lazy sweep eden pages are swept one at a time. Each page provides up to 408 object slots for re-use. By the time the lazy sweep completes, all unmarked slots in eden have been replaced by new objects.

Empty pages found during sweep are moved to the tomb. This reduces fragmentation in eden (by filling in sparse pages first), and allows the tomb to be grown or shrunk before it is used.

Once eden runs out of slots, the empty pages in tomb are moved back to eden to add space. This happens incrementally, one page at a time. When the tomb runs out of pages, a mark is triggered and the cycle begins anew.

Major vs Minor GC

As an example, let's look at github.com's large rails app.

First, we'll measure how many long-lived objects are created after the app is booted:

# preload controllers/models and other code

# measure heap stats after a major mark and full sweep
GC.start # same as GC.start(full_mark: true, immediate_sweep: true)

# three ways to measure live slots
count = ObjectSpace.count_objects
count[:TOTAL] - count[:FREE]        #=> 565121

GC.stat(:heap_live_slot)            #=> 565121

GC.stat(:total_allocated_object) -
GC.stat(:total_freed_object)        #=> 565121

Of these ~565k long-lived bootup objects, ~95% are promoted to oldgen:

s = GC.stat
100.0 *              s[:old_object] / s[:heap_live_slot]  #=> 94.90
100.0 * s[:remembered_shady_object] / s[:heap_live_slot]  #=>  1.88

This means only ~5% of the heap needs to be traversed on minor marks, via references from the ~2% of objects that are remembered.

As expected, this makes minor GC pauses much much shorter: only 7ms in our app compared to 58ms for a major mark.

time{ GC.start(full_mark:  true, immediate_sweep: false) }  #=> 0.058
time{ GC.start(full_mark: false, immediate_sweep: false) }  #=> 0.007

The majority of mark phases during code execution will use a minor mark and finish very quickly. However, over time the size of the remembered set and oldgen can grow. If either of these double in size, a major mark is used to reset them.

The limits used to trigger a major mark can be monitored via GC.stat:

>> GC.stat.values_at(:remembered_shady_object, :old_object)
=> [10647, 536785]

>> GC.stat.values_at(:remembered_shady_object_limit, :old_object_limit)
=> [21284, 1073030]

The frequency of major vs minor marks can also be monitored. For example, you might graph "major GCs per request", "minor GCs per request" or "minor GCs per major GC".

GC.stat(:count) == GC.stat(:minor_gc_count) + GC.stat(:major_gc_count)

Tuning Variables

In our app above, we use the following GC settings:


malloc() Limits

So Ruby objects occupy 40 bytes each, inside pages on the eden heap.

When an object needs even more space, it allocates memory on the regular process heap (via a ruby_xmalloc() wrapper). For instance when a string outgrows 23 bytes, it allocates a separate, larger buffer for itself. The additional memory used by this string (or any object) can be measured with ObjectSpace.memsize_of(o) in objspace.so.

Internally the VM keeps track of malloc_increase, the number of bytes that have been allocated but not yet freed. This is effectively the memory growth of the process. When more than 16MB is added, a GC is forced even if free slots are still available. The limit starts at 16MB, but adapts to the memory usage patterns in your code.

The initial/max values and dynamic growth factor can also be controlled via environment variables:

Similarly, the memory growth associated with oldgen is tracked separately in oldmalloc_increase. When this limit is tripped, a major GC is forced. These limits can be tuned as well:

Both malloc increase and limit values can be monitored via GC.stat:

>> GC.stat.values_at(:malloc_increase, :malloc_limit)
=> [14224, 64000000]

>> GC.stat.values_at(:oldmalloc_increase, :oldmalloc_limit)
=> [20464, 64000000]

In our app we've increased the initial limit to 64MB, to reduce GC activity during boot and when memory usage peaks.

export RUBY_GC_MALLOC_LIMIT=64000000

GC Events

And finally, ruby 2.1 ships with new tracepoints that can be used to monitor the GC at runtime. These are available from C, via rb_tracepoint_new():

C-extensions using these events can also take advantage of rb_gc_stat() and rb_gc_latest_gc_info(), which provide safe access to GC.stat and GC.latest_gc_info.

Ruby 2.2 and beyond

With the introduction of RGenGC, Ruby 2.1 includes a significant upgrade to ruby's GC. Seven millisecond minor marks and a 95% oldgen promotion rate are remarkable achievements, especially considering not one of our C-extensions had to be modified. Hats off to @ko1!

Ruby 2.2 will expand the GC algorithm from two generations to three. (In fact, 2.1 already includes a RGENGC_THREEGEN compile flag to enable the third generation). @ko1 also plans to implement an incremental mark phase, which would remove the need for long major GC pauses.