Applying classic memory allocation strategies to C++ containers (v0.1)

Ion Gaztañaga <igaztanaga@gmail.com>

Table of Contents

Introduction

I'm sure that many C++ programmers have ever wondered where does good old realloc fit in C++. And that's a good question. Could we improve std::vector performance using memory expansion mechanisms to avoid too many copies? But std::vector is not the only container that could benefit from some classic strategies memory allocation: What if we could take advantage of the insertion of multiple elements in std::list using a burst allocation mechanism that could amortize costs (mutex locks, searches...) that can't be amortized when using single node allocation strategies? Some space-saving strategies also come to mind with node containers: Can we find a mechanism to build a memory allocator that offers the size-saving benefits of a simple segregated storage strategy (see Boost.Pool) without suffering too big node pools when most of nodes have been deallocated?

This article will explain how some classic improvement strategies have been applied to C++ allocators. Many of these mechanism are being used for shared-memory allocators and containers (see Boost.Interprocess) where size restrictions are tighter than with heap allocators and other strategies (e.g. per-thread-pthread pools) are not applicable. So we'll be using Boost.Interprocess containers to measure the speed up that we can obtain when using those classic strategies in C++.

Obviously, these C++ allocators need some underlying general purpose allocator, since new and delete don't offer expansion and burst capabilities. In this article we'll use a modified Doug Lea Malloc, aka DLmalloc memory allocator with new basic functions that will offer the necessary underlying machinery to implement all these classic strategies. DLmalloc is known to be very size and speed efficient, and this allocator is used as the basis of many malloc implementations, including multithreaded allocators built above DLmalloc (See ptmalloc2, ptmalloc3 or nedmalloc ). So improvements obtained by the classic strategies applied to C++ will measure real-world speedups, even when the allocator is very efficient.

I've developed some additional classes, like STL allocators, pools and tests, I've put all this code under Boost Software license, and I've respected Boost guidelines so that the developed code can be considered a Boost library called "allocplus" (sorry, I couldn't find a better name). Code is under namespace boost::allocplus namespace (in short namespace ba, but this namespace is NOT reserved for this library and this library is NOT an official Boost library. It's just that since it depends on several Boost libraries, and Boost libraries are widely used, it was easier to integrate the code as if it was a Boost library and reuse some Boost facilities like Boost.Build.

Chapter 1: In-place expansion for vector

What we tried: Existing proposals for in-place expansion

There've been a couple of C++ proposal to add in-place expansion mechanisms to standard allocators (N1953: Upgrading the Interface of Allocators using API Versioning written by Howard Hinnant and N2045: Improving STL allocators written by me based on Hinnant's proposal). Unfortunately they didn't receive much interest in the committee but there is hope that we could get some of this functionality for C++1x if the C++ community shows some interest. Let's review a bit what these proposals said:

Hinnant's proposal adds some new functionalities to C++ allocators:

  1. The ability to know the size of a memory buffer allocated with the allocator.
  2. The ability to expand a previously obtained memory buffer (similar to realloc but with no memory copy)
  3. The ability to shrink a previously obtained memory buffer (similar to realloc but with no memory copy)
  4. Three size concepts when allocating memory: preferred size, limit size and received size.

The first functionality is interesting to avoid wasting some precious bytes the memory allocator adds in the end of the buffer due to alignment reasons or because its own memory allocation algorithm. As an example, DLmalloc on 32 bit systems will need a minimum of 16 bytes for every allocation (4 bytes are payload and 12 bytes are available for the user): If we try to initialize std::string with a 4 character narrow string ("abcd") we will waste 8 bytes, because we have no way to know that there is room in our newly allocated buffer that could be used for longer strings without going again through the allocator. Other allocators waste even more memory. In 64 bit systems the minimum size (and thus, the waste) is bigger: 32 bytes (24 bytes available for the user) for DLmalloc.

The second functionality is the classical use case of realloc in C: instead of allocating a new buffer, the programmer tries to expand the current one. If the expansion succeeds, there is no need to copy data to the new buffer (in C++, that means avoiding potentially costly copy-constructors). The third one is exactly the inverse: a vector that has grown considerably has now a few elements, vector.capacity() is much bigger than vector.size() and we'd like to return memory to the system. Many programmers use the shrink-to-fit trick to return memory to the system, but this trick needs a possibly throwing copy, whereas a shrink to fit operation implemented by the allocator would not need any copying at all.

The last functionality is really interesting because a vector increases its capacity usually using a growth factor (usually 1.5f or 2.0f) to avoid too many reallocations. That is, when we insert new objects in a vector and vector.capacity() == vector.size(), then the vector allocates much more memory than the size needed to insert a new object. When allocating new memory to insert a new object, limit size is size() + 1, preferred size is size()*growth_factor and received size is the real size of the buffer. When inserting a new element in a vector, is usually better to expand the memory to hold the new element than to allocate a new bigger buffer.

N2045 makes some additions:

  1. The ability to expand the current buffer not only forward, but also backwards (forward expansion, backwards expansion or both).
  2. Offers the possibility to allocate or expand the buffer in a single call to avoid excessive locking.
  3. Separates array allocation and node allocation, similarly to operator new and operator new[].
  4. Centralizes all the expansion, shrinking and allocation functionality in a single function.
  5. Pushes limit size and preferred size concepts also to new allocations and shrink to fit operations.

The first addition is useful for embedded systems and other size-limited memories, like fixed-size shared memory. If a vector can only be expanded forward, then it can't reuse preceding free memory if that preceding memory is not big enough (that is, 1.5 or 2.0 times bigger than the current buffer). With backwards expansion, this limitation disappears and backwards expansion can be combined with forward expansion to optimize memory usage.

The vector has to copy elements backwards so backwards expansion needs exactly the same number of copies as a new allocation. Many times backwards expansion can be implemented very efficiently in some memory algorithms that have fast access to previous and next memory buffers. Backwards expansion also has some downsides:

  1. The code handling data copy for a backwards expanded buffer is not trivial (the new buffer has both already constructed objects and raw memory)
  2. In order to maintain old objects aligned (and thus reusable for being move assigned) the underlying C function implementing backwards expansion must guarantee that the distance between the new beginning of the backwards expanded buffer and the old beginning must be multiple of the size of the objects already constructed in the buffer (and of course, multiple of the alignment required by the memory allocation algorithm). A buffer holding a type with sizeof(Type) == 12 can't be expanded backwards 16 bytes, because the distance between the old beginning and the new beginning is not multiple of 12, and thus, the old object will be partially overwritten and it can't be safely move-assigned.

The second addition tries to compact in a single call several tries to expand or allocate memory because usually a vector will first try to expand memory and if this fails, it will try to allocate a new buffer. Since checking for expansion capabilities is quite fast comparing to mutex locking and other checks performed by the allocator, both functions can be combined in a single call.

The third addition (separating node and array allocation) is based on the fact that with the addition of expansion functions, node allocators based classic segregated storage pools won't be able to be allocators with expansion capabilities. This might not sound like a big problem for pure-node containers like std::list but some hybrid containers like std::unordered_map wouldn't be able to use both segregated storage (for nodes) and expansion features (for the bucket array).

Furthermore, some allocators (see N2554 The Scoped Allocator Model (Rev 2)) can be propagated through containers, so having an allocator that can use in-place expansion capabilities and node-pooling capabilities is very interesting. And this can only be achieved if the allocator uses different functions for single element and array operations.

The last two additions are just optimizations to reduce locking and to minimize the number of allocation calls.

Hinnant's proposal was implemented in the Freescale (then Metrowerks) standard library (MSL) with great success. My approach was implemented in Boost.Interprocess. This article is an evolution of my last proposal but applying these techniques in a widely used heap allocator.

New functions for expansion capabilities

As mentioned, N2045 proposed an advanced new function for allocators that offers the possibility to request buffer expansion (combined or not with allocation if the expansion fails), and shrinking called allocation_command. This is the explanation of the function:
   enum allocation_type
   {
      //Bitwise OR (|) combinable values
      allocate_new        = ...,
      expand_fwd          = ...,
      expand_bwd          = ...,
      shrink_in_place     = ...,
      try_shrink_in_place = ...,
      nothrow_allocation  = ...
   };


   template<class T>
   std::pair<T *, bool> allocation_command
      ( allocation_type command,    std::size_t limit_size
      , std::size_t preferred_size, std::size_t &received_size, T *reuse_ptr = 0);

Preconditions for this function:

Which are the effects of this function:

Throws an exception if two conditions are met:

This function returns:

Notes:

Obviously, we need an underlying C function that offers these possibilities. I've added the following function to the modified dlmalloc library:


typedef struct dl_command_ret_impl
{
   void *first;
   int   second;
}dl_command_ret_t;

dl_command_ret_t dl_allocation_command
   ( allocation_type command   , size_t sizeof_object
   , size_t limit_objects      , size_t preferred_objects
   , size_t *received_objects  , void *reuse_ptr );

This function does the same job as allocation_command, so it takes the same options and an additional parameter indicating the size of the object (specially important for backwards expansion). This function does not throw exceptions because C has no exceptions but other than these two issues, its use is exactly the same as allocation_command.

A configurable allocator for testing

To compare apples to apples, we just can't compare the standard allocator versus an allocator based on DLmalloc with expansion capabilities. For this purpose we develop an allocator (called, surprisingly, ba::allocator). Here's the declaration:


namespace boost {
namespace allocplus {

template<class T, unsigned Version = 2, unsigned int AllocationDisableMask = 0>
class allocator;

} //namespace allocplus {
} //namespace boost {

The first template parameter is the type of the objects to be allocated. The second is the version of the allocator (see N1953) for details. If Version is 1, then the allocator is a standard conforming classic allocator around dlmalloc and dlfree with no additional features. If Version is 2, then the allocator has expansion capabilities and other features like specifying limit and preferred sizes and obtaining the real size of an allocated memory, separated node and array allocations, etc...

The third parameter is a mask that can be used to disable some capabilities. For example, specifying expand_bwd | expand_fwd will disable forward and backwards expansion and the allocator will only provide new buffer allocations.

Testing forward expansion in vector

Let's first present the class that will be used in our performance tests. It's a simple wrapper over an int, with a copy constructor that realizes a job that will be equivalent to most move-constructors in C++0x. Of course performance gains will be bigger in copyable but non-moveable heavy objects, but I think this class is fair enough.


class MyInt
{
   int int_;

   public:
   MyInt(int i = 0) : int_(i){}

   MyInt(const MyInt &other)
      :  int_(other.int_)
   {}

   MyInt & operator=(const MyInt &other)
   {
      int_ = other.int_;
      return *this;
   }
};

Now we'll push back an element one by one in a vector, and measure the time using Boost.DateTime:

   unsigned int numalloc = 0, numexpand = 0, capacity = 0;

   ptime tini = microsec_clock::universal_time();

   for(unsigned int r = 0; r != num_iterations; ++r){
      bi::vector<MyInt, IntAllocator> v;
      v.reset_alloc_stats();   //Reset allocation statistics
      for(unsigned int e = 0; e != num_elements; ++e)
         v.push_back(e);
      numalloc  += v.num_alloc;   numexpand += v.num_expand_fwd;
      capacity = static_cast<unsigned int>(v.capacity());
   }

   ptime tend = microsec_clock::universal_time();
   //...

We'll test 4 different allocators:

  1. StdAllocator: equivalent to std::allocator<int>, the standard allocator that comes with the C++ compiler
  2. AllocatorPlusV1: equivalent to ba::allocator<int, 1>, a STL conformant allocator around dlmalloc and dlfree.
  3. AllocatorPlusV2Mask: equivalent to ba::allocator<int, 2, expand_bwd | expand_fwd>, an allocator with capabilities to specify a limit size the preferred size and real size of a buffer, but with expansion capabilities disabled.
  4. AllocatorPlusV2: equivalent to ba::allocator<int, 2>, a fully equipped improved allocator. (Note: There is no need to disable backwards expansions because forward expansion has always preference over new allocations and backwards expansion).

and we'll use several [num_iterations, num_elements] combinations: [10000, 10000], [100000, 1000], [1000000, 100] and [10000000, 10]

For each allocator we'll show the following information:

Now let's see the results of the test in a Windows XP machine (AMD 1.47 Ghz) using Visual C++ 2003:

As we see in the picture, DLmalloc is faster than the standard allocator provided by MSVC 7.1 (StdAllocator vs. AllocatorPlus). We also see that V2 allocators call less times to allocator functions (for size=10, 4 times vs. 6 times), because using the received_size parameter of allocation_command the container can take advantage of the real size of the newly allocated small buffer.

For big sizes (size=10000), due to the expansion capabilities of AllocatorPlusV2, this is twice as fast as AllocatorPlusV2Mask and 2.5 times faster than the standard conforming, DLmalloc based AllocatorPlusV1. As we can see in the third graph, only AllocatorPlusV2 performs expand in place, whereas the rest only allocate new buffers.

Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler and will check push_back times:

We see that the standard allocator is faster than our V1 (a standard conforming DLmalloc wrapper) by a small margin for some sizes and slightly slower for bigger sizes. That's because gblic malloc is based on ptmalloc2, which is actually based on DLmalloc, but uses per-thread-pools, so minimizes mutex locks.

Like in Windows, expand-in-place capable allocator is twice as fast as the V1 allocator, confirming that forward expansion offers a considerable speed up. This speed up will be noticeable in initialization phases when vectors are filled with many back insertions. The speed-up will be more noticeable if vector elements are elements with no move constructors (e.g. third-party classes than can't be updated with move semantics).

Testing backwards expansion in vector

Historically backwards expansion has not been very used for allocation strategies, basically because it does not offer the speed improvements offered by forward expansion for usual vector implementations. Similarly to new allocation, we need to move old objects (an implementation might avoid data-moving if it manages uninitialized data in the beginning of the memory buffer, but that's not usual). Backwards expansion has the ability to reduce the peak memory used by the vector (while allocating new memory, the vector has to maintain the old buffer until copies data to the new one) effectively reusing free memory placed just before the buffer to be expanded.

As we've said before, the backwards expansion algorithm has to calculate the LCM (least common multiple) of the memory allocation alignment (8 bytes in DLmalloc) and the size of the object. For very efficient algorithms like DLmalloc, I've found that calculating the LCM using the traditional Euclidean algorithm was a big overhead that made backwards expansion considerably slower than new allocation. It's necessary to take advantage of the power of 2 memory alignment and implement optimizations for object sizes also power of 2 or multiple of 2, 4, or 8 bytes, and use optimized versions to calculate the LCM. Fortunately, these optimizations made backwards expansion slightly faster than a new allocation, making backwards expansion better than new allocation in every aspect.

To test backwards expansion we'll execute the same push_back test using the following 4 allocators:

  1. StdAllocator: equivalent to std::allocator<int>, the standard allocator that comes with the C++ compiler
  2. AllocatorPlusV1: equivalent to ba::allocator<int, 1>, a STL conformant allocator around dlmalloc and dlfree.
  3. AllocatorPlusV2Mask: equivalent to ba::allocator<int, 2, expand_bwd | expand_fwd>, an allocator with capabilities to specify a limit size the preferred size and real size of a buffer, but with expansion capabilities disabled.
  4. AllocatorPlusV2: equivalent to ba::allocator<int, 2, expand_fwd>, a fully equipped improved allocator, that avoids forward expansion (forward expansion has priority over backwards expansion, so we need to disable it to only measure backwards expansion).
Now let's see the results of the test in a Windows XP machine (AMD 1.47 Ghz) using Visual C++ 2003:

And now in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler:

We see that for most cases backwards expansion (AllocatorPlusV2) is slightly faster than new allocation (AllocatorPlusV1) and new allocation with received_size capabilities (AllocatorPlusV2Mask).

Testing shrink to fit in vector

The Shrink-to-fit is a widely used idiom that can be improved with the shrink_in_place" option provided by allocation_command method. To test this function we'll use the following test:

   const std::size_t Step = 5;

   ptime tini = microsec_clock::universal_time();

   typedef boost::interprocess::detail::integral_constant
      <unsigned, boost::interprocess::detail::version<Allocator>::value> alloc_version;

   for(unsigned int r = 0; r != num_iterations; ++r){
      //Create a vector with num_elements size
      bi::vector<MyInt, IntAllocator> v(num_elements);
      v.reset_alloc_stats();
      //Empty the vector erasing the last Step elements and calling shrink_to_fit()
      for(unsigned int e = num_elements; e != 0; e -= Step){
         v.erase(v.end() - Step, v.end());
         v.shrink_to_fit();
      }
   }

   ptime tend = microsec_clock::universal_time();

That is, we'll initialize the vector with an initial size and will erase the last Step elements (5 in this case), and shrink to fit the capacity of the vector. We'll repeat this operation for different [num_iterations, num_elements] combinations: [100, 10000], [1000, 200] and [10000, 500]

To test shrink to fit we'll execute the test using the following 3 allocators:

  1. StdAllocator: equivalent to std::allocator<int>, the standard allocator that comes with the C++ compiler
  2. AllocatorPlusV1: equivalent to ba::allocator<int, 1>, a STL conformant allocator around dlmalloc and dlfree.
  3. AllocatorPlusV2: equivalent to ba::allocator<int, 2, expand_fwd>, a fully equipped improved allocator with shrink to fit capacity.
Now let's see the results of the test in a Windows XP machine (AMD 1.47 Ghz) using Visual C++ 2003:

As the initial vector size grows, AllocatorPlusV2 shines (up to 90 times faster!) because a shrink to fit operation is a no-throw operation that can have amortized constant-time complexity due to the fact that an allocator usually implements shrink to fit with a deallocation: the allocator divides the old buffer in two independent buffers and deallocates the second buffer. On the other hand, vectors with allocators with no shrink to fit functionality need to allocate a new buffer, move elements to the new one (moving elements is a linear operation with the size of the vector) and deallocate the old buffer.

Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler and will check push_back times:

Result are similar to Windows results, with AllocatorPlusV2 shinning more and more when initial vector size grows.

Chapter 2: Burst allocation for node containers

Burst allocation and deallocation introduction

Many times, operations can be performed faster if they are packed in groups, because some costs are not related to the number of operations, but to the number of groups. An example of this operations is the template iterator insert(Iterator begin, Iterator end) operation in vectors. If Iterator is a forward iterator, most implementations calculate the distance between begin and end so that they know the size of the buffer needed to hold all those elements. With this information, they only perform one allocation to insert all the range.

Other containers can't take advantage of this possibility because they are node containers, meaning that each element is obtained with a call to the allocator that allocates a single element. Since each element has its own independent memory, these containers offer stronger iterator validity guarantees and pointers and references to inserted objects are not easily invalidated. On the other hand, inserting N elements in those containers, require N calls to the allocator.

To solve this, the idea is to have a function that could allocate a lot of objects in a single call, guaranteeing that each object can be individually deallocatable. Some memory management algorithms offer this possibility, even with the possibility to specify a different buffer size for each operation in a group. For example, DLmalloc offers the following two functions:

void** independent_calloc(size_t n_elements, size_t element_size, void*chunks[]);

Explanation: independent_calloc is similar to calloc, but instead of returning a single cleared space, it returns an array of pointers to n_elements independent elements that can hold contents of size elem_size, each of which starts out cleared, and can be independently freed, realloc'ed etc. The elements are guaranteed to be adjacently allocated (this is not guaranteed to occur with multiple callocs or mallocs), which may also improve cache locality in some applications. [...]

void** independent_comalloc(size_t n_elements, size_t sizes[], void* chunks[]);

Explanation: independent_comalloc allocates, all at once, a set of n_elements chunks with sizes indicated in the "sizes" array. It returns an array of pointers to these elements, each of which can be independently freed, realloc'ed etc. The elements are guaranteed to be adjacently allocated (this is not guaranteed to occur with multiple callocs or mallocs), which may also improve cache locality in some applications. [...]

For these two functions each element must be individually freed when it is no longer needed and objects are guaranteed to be contiguously allocated (to help cache locality). Also, the "chunks" argument is optional (i.e., may be null, which is probably the most typical usage). If it is null, the returned array is itself dynamically allocated and should also be freed when it is no longer needed. Otherwise, the chunks array must be of at least n_elements in length. It is filled in with the pointers to the chunks.

According to DLmalloc documentation independent_calloc simplifies and speeds up implementations of many kinds of pools. It may also be useful when constructing large data structures that initially have a fixed number of fixed-sized nodes, but the number is not known at compile time, and some of the nodes may later need to be freed:

  struct Node { int item; struct Node* next; };

  struct Node* build_list() {
    struct Node** pool;
    int n = read_number_of_nodes_needed();
    if (n <= 0) return 0;
    pool = (struct Node**)(independent_calloc(n, sizeof(struct Node), 0);
    if (pool == 0) die();
    // organize into a linked list...
    struct Node* first = pool[0];
    for (i = 0; i < n-1; ++i)
      pool[i]->next = pool[i+1];
    free(pool);     // Can now free the array (or not, if it is needed later)
    return first;
  }

According to DLmalloc documentation independent_comalloc can be used to speed up allocation in cases where several structs or objects must always be allocated at the same time:

  struct Head { ... }
  struct Foot { ... }

  void send_message(char* msg) {
    int msglen = strlen(msg);
    size_t sizes[3] = { sizeof(struct Head), msglen, sizeof(struct Foot) };
    void* chunks[3];
    if (independent_comalloc(3, sizes, chunks) == 0)
      die();
    struct Head* head = (struct Head*)(chunks[0]);
    char*        body = (char*)(chunks[1]);
    struct Foot* foot = (struct Foot*)(chunks[2]);
    // ...
  }

Unfortunately, these two functions have some downsides when working with some containers:

Burst allocation infrastructure

The first issue (the required external or internally allocated array) can be avoided if every memory node is linked with an intrusive singly linked list and offering an iterator-like interface. The intrusive list is built in the first bytes reserved for the user data because the minimum node size guarantees 12 bytes for 32 bits systems and 24 bytes for 64 bit systems. Obviously, once the user has overwritten that memory, there is no way to iterate to the next node so the programmer must be careful. Let's present these new functions:

/* Iterator functions */
typedef /**/ multialloc_it_t;                      /* Iterator type */
BOOST_ALLOCPLUS_INIT_END_IT(IT)                    /* Postcondition: BOOST_ALLOCPLUS_IS_END_IT(IT) != 0 */
BOOST_ALLOCPLUS_IS_END_IT(IT)                      /* Is an end iterator? */
BOOST_ALLOCPLUS_NEXT_IT(IT)                        /* operator ++ emulation */
BOOST_ALLOCPLUS_ADDR_IT(IT)                        /* operator* emulation, returns the address of the memory */
These functions offer an emulation of a C++ input iterator.
/* Multiallocation chain functions */
typedef /**/ multialloc_it_chain_t;                /* Chain type */
BOOST_ALLOCPLUS_IT_CHAIN_INIT(CHAIN)               /* Postcondition: BOOST_ALLOCPLUS_IT_CHAIN_IT(IT_CHAIN) is end it */
BOOST_ALLOCPLUS_IT_CHAIN_PUSH_BACK(CHAIN, MEM)     /* Push back the node MEM in the intrusive linked list chain */
BOOST_ALLOCPLUS_IT_CHAIN_PUSH_FRONT(CHAIN, MEM)    /* Push front the node MEM in the intrusive linked list chain */
BOOST_ALLOCPLUS_IT_CHAIN_SPLICE_BACK(CHAIN, CHAIN2)/* Splice chain2 in the end of the first chain */
BOOST_ALLOCPLUS_IT_CHAIN_IT(IT_CHAIN)              /* Return an input iterator that traverse the chain */
BOOST_ALLOCPLUS_IT_CHAIN_SIZE(IT_CHAIN)            /* Returns the size of the chain */
These functions offer the possibility of building an intrusive chain (list) of nodes previously allocated by DLmalloc and obtain an iterator that can traverse them. These functions are used to build a range that can be deleted in a single call using with dl_multidealloc() (See below).
/* Some defines for "contiguous_elements" parameters */
DL_MULTIALLOC_ALL_CONTIGUOUS                       /* Allocate all elements contiguous on memory */
DL_MULTIALLOC_DEFAULT_CONTIGUOUS                   /* The implementation chooses the maximum contiguous elements

/* Burst allocation: 
   Allocates "n_elements nodes" of size "elem_size" with a maximum (if possible) of "contiguous_elements".
   On success returns an input iterator. On failure returns an "end" iterator.*/
multialloc_it_t dl_multialloc_nodes (size_t n_elements, size_t elem_size, size_t contiguous_elements);

/* Burst allocation: 
   Allocates "n_arrays" of elements, whose size is "sizeof_element" and whose lengths are specified in the "sizes"
   parameter. Contiguous memory won't be (if possible) bigger than "contiguous_elements"*sizeof_element.
   On success returns an input iterator. On failure returns an "end" iterator.*/
multialloc_it_t dl_multialloc_arrays (size_t n_arrays, const size_t *lengths, size_t sizeof_element, size_t contiguous_elements);

/* Burst deallocation: deallocates all the range specified by the iterator */
void dl_multidealloc(multialloc_it_t it);

dl_multialloc_nodes is a function that allocates multiple nodes of the same size in the same call returning an iterator to the range. is a function that allocates multiple arrays in the same call returning an iterator to the range. The user can specify the (suggested) contiguous number of elements in both functions. dl_multidealloc deallocates a range of arrays or nodes that have been allocated by dl_multialloc_nodes, dl_multialloc_arrays or allocated by other DLmalloc functions and chained using BOOST_ALLOCPLUS_IT_CHAIN_XXX functions.

Burst interface for allocators

Now that we have a low-level swiss knife we can define a C++ interface for burst allocation:

template<class T>
class allocator
{
   //...
   typedef /**/ multiallocation_iterator; //wrapper over multialloc_it_t
   typedef /**/ multiallocation_chain;    //wrapper over multialloc_chain_t

   //------------------------------------------------------------------
   // Functions for array allocation (expandable memory)
   //------------------------------------------------------------------

   //Allocates many elements of size elem_size in a contiguous block
   //of memory. Elements must be individually deallocated with deallocate()
   multiallocation_iterator allocate_many(size_type elem_size, size_type n_elements);

   //Allocates n_elements elements, each one of size elem_sizes[i] 
   //Elements must be individually deallocated with deallocate()
   multiallocation_iterator allocate_many(const size_type *elem_sizes, size_type n_elements);

   //Deallocates the memory arrays pointed by the iterator range starting at it
   void deallocate_many(multiallocation_iterator it);

   //------------------------------------------------------------------
   //Functions for node (size == 1, non-expandable memory) allocation 
   //------------------------------------------------------------------

   //Allocates many elements of size == 1.
   //Elements must be individually deallocated with deallocate_one()
   multiallocation_iterator allocate_individual(size_type num_elements);

   //Deallocates the memory nodes pointed by the iterator range starting at it
   void deallocate_individual(multiallocation_iterator it);
};
Burst allocation also follows the separation between node (size == 1) and array allocation functions so that node allocation can use segregated storage mechanisms, whereas array allocation can use another approach that can be used with expansion functions (allocation_command).

We have all that we need to implement burst allocation and deallocation for STL containers. Now, let's measure if this leads to any speed improvement.

Testing burst allocation

A good place to take advantage of burst allocation is range insertion. Other operations(range assignment, copy construction) are also good places to implement burst allocation. So let's try this code with several allocators to test insertion time:

   tini = microsec_clock::universal_time();

   bi::list<MyInt, IntAllocator> l;
   for(unsigned int r = 0; r != num_iterations; ++r){
      l.insert(l.end(), num_elements, MyInt(r));
   }

   tend = microsec_clock::universal_time();

That is, we'll try insertions of different range sizes (num_elements) and different iterations (num_iterations). We'll end up with the same number of elements in the list, so we can compare results easier. To measure burst deallocation we'll preprocess several iterator ranges and measure the time to completely empty the list:

      //Now preprocess ranges to erase
      std::vector <typename bi::list<MyInt, IntAllocator>::iterator> ranges_to_erase;
      ranges_to_erase.push_back(l.begin());

      for(unsigned int r = 0; r != num_iterations; ++r){
         typename bi::list<MyInt, IntAllocator>::iterator next_pos(ranges_to_erase[r]);
         std::size_t n = num_elements;
         while(n--){ ++next_pos; }
         ranges_to_erase.push_back(next_pos);
      }

      //Measure range erasure function
      tini = microsec_clock::universal_time();

      for(unsigned int r = 0; r != num_iterations; ++r){
         l.erase(ranges_to_erase[r], ranges_to_erase[r+1]);
      }

      tend = microsec_clock::universal_time();

As shown, a good place to take advantage of burst allocation is range erasure along with clear() and destructors.

We'll test 3 different allocators:

  1. StdAllocator: equivalent to std::allocator<int>, the standard allocator that comes with the C++ compiler
  2. AllocatorPlusV1: equivalent to ba::allocator<int, 1>, a STL conformant allocator around dlmalloc and dlfree (no burst allocation).
  3. AllocatorPlusV2: equivalent to ba::allocator<int, 2>, burst-allocation capable allocator.
and we'll use these test values for [num_iterations, num_elements] combinations: [200, 10000], [2000, 1000], [20000, 100] and [200000, 10]

For each allocator we'll show the following information:

Now let's see the results of the test in a Windows XP machine (AMD 1.47 Ghz) using Visual C++ 2003:

The graph shows that allocation speed per element is constant for AllocatorPlusV1 (surprisingly, not so constant for Visual's standard allocator) and this is pretty logical since each element needs an allocation. AllocatorPlusV2 is using DL_MULTIALLOC_DEFAULT_CONTIGUOUS which allocates contiguous nodes up to 4096 bytes. That's why AllocatorPlusV2 performance stabilizes for big ranges reaching to a 3x speedup.

Burst deallocation improvements are more modest, why?

This means that burst deallocation is just a loop of plain deallocations, amortizing costs like mutex locks, checks, function calls and returns and other common operations. Nevertheless, the test shows a 30% speedup when comparing to node to node deallocation with quite big ranges (>1000) and still noticeable speedups (20%) with small ranges (>=10).

Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1:

In Linux, the standard allocator (ptmalloc2, based on DLmalloc) performs very well, and the lock reduction achieved by its per-thread-pool approach offers a similar performance than burst allocation (AllocatorPlusV2) for small ranges. With bigger ranges (>=100) AllocatorPlusV2 maintains (more or less) its performance while StdAllocator can't keep up. Like in WinXP Visual 7.1 tests, AllocatorPlusV2 is faster (up to 2x in some cases) than AllocatorPlusV1 for big ranges.

Burst deallocation is not faster than the standard allocator in Linux, surely because ptmalloc2 reduces mutex locks just like burst deallocation, but burst deallocation needs to traverse the range to be erased twice as explained before. When comparing to AllocatorPlusV1, that is, a wrapper around dlmalloc and dlfree burst deallocation (AllocatorPlusV2) is 30% faster for big ranges.

Chapter 3: Adaptive pools for node containers

Simple segregated storage: advantages and problems

When developing allocators to improve node containers performance, one of the widely used approaches is using simple segregated storage (e.g. Boost.Pool). The idea behind simple segregated storage is to divide a big memory portion ("block") allocated through a general purpose memory manager into fixed-size "nodes". A memory manager ("pool") holds a list of "blocks" and a singly linked list of free nodes. Since all nodes are the same size, Simple Segregated Storage can't serve nodes of different sizes. To solve this, an application creates a different pool for each size and usually different types with the same size share the same pool.

The free node list is intrusive (since the node is free, the list pointer is built in the node), so that there is no size overhead for each node. Simple Segregated Storage is very fast:

Now the question is: when does the pool return fully free "blocks" (allocated each time the free node list was empty) to the general purpose memory manager so that they can be reused by other pools or non-pool allocators? The common answer is: usually never. Why? Simple Segregated Storage erases all bookeeping data from the node and this makes "block" deallocation very expensive: the pool needs to traverse the free nodes list (nodes are not ordered by their original block) and check the address of each node against the address of each "block" just to check if all the nodes of a "block" are free and the block can be safely deallocated. If we have B blocks and N nodes, this is a O(N*B) operation. Usually N (and B) can be quite high and this operation can take minutes if the application has allocated several tens of thousands of nodes. That's why the option to trim the memory is provided only as an optional and explicit operation.

A manual operation makes allocator changes non-portable (How is that manual trimming activated? How long does it take?) so many applications just ignore this operation. This also has a drawback: if an application allocates a big amount of nodes in a container and then destroys the container, the pool will hold a big list of free nodes but that memory won't be reused by any other container not using a pool with the same node size, leading to excessive memory waste.

Adaptive pools explained

Adaptive pool and memory alignment

Can we obtain the small size overhead of simple segregated storage and a fast, continuous, memory trimming operation? The idea behind adaptive pools is simple: when deallocating a node, an efficient self-trimming node pool needs a fast access to the "block" containing the node. If so, the pool can check how many free nodes are in that "block" (the "block" has a header storing the number of free blocks) and if all nodes are free, the pool can return that block to the general purpose memory manager.

The problem is that storing bookeeping data (a pointer to the block containing that node) in a node has a size overhead. But there is an extremely fast way to obtain a pointer to the block containing the node with zero size overhead per-node: memory alignment.

Suppose a 4KB "block" also aligned to 4KB: a node just needs a mask operation to obtain the address of the block. Once we have it, we can check the number of free nodes and deallocate the block if all the nodes are free. And once we have fast access to the "block" we can develop a more complicated trimming scheme that favors the creation of free "blocks" and improves trimming probabilities.

Alignment can lead to excessive memory waste

Allocators usually have a payload that store bookeeping information for deallocation. If an adaptive pool allocates 4KB bytes aligned to 4KB, it will really allocate 4KB + payload bytes where the start of the non-payload section is aligned to 4KB. This means that the memory adjacent to this aligned block won't be suitable for a new 4KB aligned allocation leading to excessive memory waste (a 4KB allocation would waste next 4KB - payload bytes of memory).

There is a simple solution: allocate (4KB - payload bytes) bytes aligned to 4KB. This simple solution make contiguous memory usable for a future 4KB aligned allocation leading to optimal memory usage. That's why knowing the size of the payload is important for aligned memory allocations where the size of the allocation is not fixed.

Managing blocks

Once we have access from a node to its block, we can implement different policies for them. Once a block is completely free we can just deallocate it but the allocation pattern could lead to several half-used blocks, increasing in practice the memory usage we'll trying to minimize.

Alternatively, we can implement a strategy in order to minimize the number of in-use blocks. We can try to allocate nodes always from blocks with less free nodes, leading, at least in theory, to more fully used blocks and more fully free blocks that could be deallocated faster. Obviously, this block management must be very fast: searching for the most used block can ruin performance, specially when comparing with the efficient simple segregated storage. The same happens when deallocating a node: if block management needs too many operations deallocation can have very bad performance.

Minimizing alignment requirements

Adaptive pools can require excessive alignment requirements when many nodes are allocated in a chunk. If a block is configured to hold at least 256 nodes and the size of a node is 24 bytes, the required size for the block is (block payload + 20*256) > 4096 -> 8196 bytes, which also needs 8196 bytes alignment. This means that increasing the number of nodes a block can hold also increases the required alignment.

We can improve the excessive alignment if we divide each block in sub-blocks. The first sub-block contains the payload of the block and the rest of sub-blocks just store a pointer to the first sub-block. This leads to an additional degree of freedom: block overhead. The overhead per block (number of nodes * node size)/ (block size) determines the size (and the alignment) of a sub-block, whereas the number of nodes per block determines the number of sub-blocks per block. This trick maintains the alignment requirement lower than the approach that does not use sub-blocks and constant with the number of nodes per block. On the other hand, deallocation requires an additional indirection (from node to subblock and from subblock to the block header) and a bit more complicated block initialization.

Presenting adaptive_pool

adaptive_pool is built above the aligned memory allocation function from our modified DLmalloc library. Alignment must be power of two:

void* dl_memalign(size_t bytes, size_t alignment);

adaptive_pool is implemented as follows:

This is the declaration of adaptive_pool:

const std::size_t ADP_nodes_per_block    = unspecified;
const std::size_t ADP_max_free_blocks    = unspecified
const std::size_t ADP_overhead_percent   = unspecified;
const std::size_t ADP_only_alignment     = unspecified;

template < class T
         , unsigned Version            = 2
         , std::size_t NodesPerBlock   = ADP_nodes_per_block
         , std::size_t MaxFreeBlocks   = ADP_max_free_blocks
         , std::size_t OverheadPercent = ADP_overhead_percent
         >
class adaptive_pool;

Description of the template parameters:

  1. The type of the objects to be allocated
  2. The version of the allocator. Similarly to other allocators presented in previous chapters, Version 2 allocator offers burst-allocation.
  3. NodesPerBlock is the minimum number of nodes per block we want to allocate.
  4. MaxFreeBlocks is the maximum number of fully free blocks that the pool will hold in the intrusive red-black tree (the rest of fully free blocks will be deallocated).
  5. OverheadPercent is a parameter that will configure the maximum memory overhead that each block will have (and thus, the number of sub-blocks). If this parameter is equal to ADP_only_alignment, then no sub-blocks will be used and NodesPerBlock will determine the alignment required by blocks.

To allocate a new node adaptive_pool needs to:

To deallocate a new node adaptive_pool needs to:

Testing adaptive_pool

In this test we'll measure 8 allocators:

  1. AllocatorPlusV1: equivalent to ba::allocator<int, 1>, a STL conformant allocator around dlmalloc and dlfree (no burst allocation).
  2. AllocatorPlusV2: equivalent to ba::allocator<int, 2>, burst-allocation capable allocator.
  3. AdPoolAlignOnlyV1: an adaptive pool with no sub-blocks and no burst allocation.
  4. AdPoolAlignOnlyV2: an adaptive pool with no sub-blocks but with burst allocation.
  5. AdPool2PercentV1: an adaptive pool with sub-blocks (configured with 2% overhead per block) and no burst allocation.
  6. AdPool2PercentV2: an adaptive pool with sub-blocks (configured with 2% overhead per block) but with burst allocation.
  7. SimpleSegregatedStorageV1: a classical simple segregated storage allocator with no burst allocation.
  8. SimpleSegregatedStorageV2: a classical simple segregated storage allocator with burst allocation.

All pool allocators (allocators 3-8) are configured with a minimum of 256 nodes per block. For a int list, in 32 bit systems, the required alignment is 4096 bytes for 3 and 4 allocators (adaptive pools with no sub-blocks) and 2048 bytes for 5 and 6 allocators (adaptive pools with sub-blocks).

For each allocator we'll show the following information:

The test will be the same as the used in the section Testing burst allocation. Now let's see the results of the test in a Windows XP machine (AMD 1.47 Ghz) using Visual C++ 2003:

In general, insertion speeds are better for pool allocators than for DLmalloc. Analyzing only V1 allocators (no burst allocation) we can see that SimpleSegregatedStorageV1 performs best, followed by AdPoolAlignOnlyV1 and AllocPool2PercentV1. This is quite logical: simple segregated storage minimizes all block operations because it does not deallocate free blocks on the fly. AdPoolAlignOnlyV1 and AllocPool2PercentV1 perform quite similarly.

When comparing V2 allocators (burst-allocation enabled) we see that performance of all pool allocators is very similar. The reason for this is that the slightly higher costs of adaptive pools are practically minimized when the pool allocates several nodes in a row: all needed blocks are preallocated in a single call, error handling in loops can be simplified leading to tighter loops... Curiously, the performance of a burst-enabled DLmalloc (AllocatorPlusV2) is on par with pools with burst allocation. This is not a surprise: DLmalloc burst allocation uses exactly the same mechanism as pools: allocating big block and create nodes from them. The difference, though, is in the overhead per node.

When analyzing deallocations, results for V1 allocators are pretty similar, with SimpleSegregatedStorageV1 leading the test. We must remember that SimpleSegregatedStorageV1 does not perform any block managing and deallocation, so this is the expected behavior. Adaptive pool deallocation times are exactly between DLmalloc (AllocatorPlusV1) and segregated storage times. Adaptive pools perform a quite complicated block management ordering free blocks by free node count and deallocating completely free blocks if the number of them is superior to a configure value (2 blocks in this test). As expected AdPoolAlignOnlyV2 performs marginally better than AllocPool2PercentV2 because deallocating a node allocated with AllocPool2PercentV2 (adaptive pool with sub-blocks) needs an additional indirection to obtain the block header from the node (first we reach the header of the sub-block and this header stores the address of the block header).

When analyzing burst-enabled allocators, however, the picture changes so that adaptive pools perform similarly to than SimpleSegregatedStorageV2. Again, burst-allocation allows the elimination of some redundant checks and management operations that yield to better performance and amortization of block managing costs.

For big ranges, V2 (burst) adaptive pools are even better than simple segregated storage by a small margin.

Now let's see the amount of memory allocated from the operating system by DLmalloc with each allocator (V1 and V2 practically allocate the same amount of memory so we only represent V2 allocators). DLmalloc wrapper (AllocatorPlusV2) allocates more memory than the rest of allocators. Allocated nodes are 12 byte in 32 bit systems so DLmalloc needs 12 byte + overhead (4 bytes) = 16 bytes per node. With 2000000 nodes this yields to aproximately 32 MB. For pools, since overhead is practically inexistent the total memory is around 24MB (12 bytes per node).

In the previous graph we can see the overhead (in percentage) for each allocator. As expected, DLmalloc offers a 33% overhead (4 bytes per each 12 byte node) whereas pools offer less than 2% overhead (remember that we've configured AllocPool2PercentVX with a maximum of 2 percent overhead). As expected, SimpleSegregatedStorageVX offers the minimum overhead but adaptive pools (with their additional headers for block management) are also very size-efficient.

The previous picture shows how much in-use memory DLmalloc reports after deallocating all the nodes. Note that this does not count the OS memory DLmalloc still caches, but memory that have been allocated through DLmalloc and has not been deallocated yet through DLmalloc. In other words, memory still hold by pools.

As expected DLmalloc wrappers (AllocatorPlusVX) have deallocated all their memory (no leaks!). Also, simple segregated storage allocators have not deallocated a single block so they still hold all the memory. Adaptive pools (AdPoolAlignOnlyVX and AllocPool2PercentVX) only hold the number of blocks we've configured: 2 blocks x 4096 bytes per block = 8192 bytes. So we clearly show that adaptive pools not only offer a low memory overhead and fast allocation times, but they also return memory to the underlying memory allocator!

Once allocators return memory to DLmalloc, DLmalloc can trim its cached OS memory and return that memory to the OS. The following graph shows the amount of memory DLmalloc caches after all nodes have been deallocated:

Cached memory usually depends on the number of contiguous free pages that DLmalloc holds internally so numbers are not important. The only thing that this graph shows is that adaptive pools, after returning memory to DLmalloc, encourage also memory trimming so that DLmalloc can return memory to the operating system.

Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1:

Insertion speeds for V1 allocators are more or less the expected with segregated storage leading the group and non-pool DLmalloc wrapper being the slowest one. V2 (burst) allocators are faster with similar performance between adaptive pools and simple segregated storage as the number of the insertion range grows.

Erasure speeds are similar to Windows, with SimpleSegregatedStorageVX leading the test, and adaptive pools come next.