This is an overview of a new allocator based on Doug Lea's pmalloc, used in glibc. The new allocator design is based on input from JohnMoser and JohnBerthels; it may be appropriate to consider it the Berthels-Lea-Moser design.
The Gnome foundation has taken interest in reducing the memory footprint of Gnome. By reducing the memory needed by the Gnome desktop, Gnome will become more appealing to both users with limited memory and users wanting to maximize the potential of an ample amount of memory.
JohnMoser noticed a quirk in the pmalloc allocator by Doug Lea that causes excessive memory usage over time. This flaw occurs because unused portions of the heap are kept resident; and heap fragmentation is plentiful. With the constant allocation and deallocation of memory, the heap grows and then decays slowly, leaving holes throughout it that are filled in at the first available area big enough for any future allocation. Because of this behavior, allocations can become spread out and leave much fragmentation.
Originally, JohnMoser had considered that a complete redesign was needed. To fix the allocator, he wanted to allocate smaller "miniheap" areas that were balanced such that the most used was preferred for any new allocation and the least used was less preferred. The end result would be that the least used areas would "starve" and be released back to the system eventually.
JohnBerthels supplied some insight which allowed for a minor tweak to the current allocator to solve the problem. In theory this method will work; however, JohnMoser still believes that a small but significant degree of improvement in memory usage can be gained by controlling the allocation patterns so that allocations group together. The end goal is to leave a fragmented heap as a series of large used and large unused areas, with the unused areas having neither resident memory footprint nor swap space.
JohnBerthels came up with a less invasive approach. It turns out that the madvise() function, appearing in the Single Unix Specification v3 as posix_madvise() as part of the Advanced Realtime option, allows an application to specify that an area of memory will not be needed in the near future. By using madvise() to indicate that an area of memory is not needed, the kernel can be allowed to remove the mapping and reduce the resident size of the application. Future access to the affected memory areas will create zeroed pages in those areas.
Unfortunately, madvise() is not POSIX; and posix_madvise() does not specify what "Specifies that the application expects that it will not access the specified range in the near future." However, Linux handles madvise() in a useful manner:
- Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in re-loading of the memory contents from the underlying mapped file (see mmap) or zero-fill-on-demand pages for mappings without an underlying file.
Although this is ideal, it is not written in standard and thus not guaranteeable. Nevertheless, it is a good idea to take advantage of this behavior for the time being by teaching glibc and pmalloc to utilize madvise() on unused pages in the heap. By doing so, the impact of Berthel's enhancements may be more closely studied, and a case to preserve this behavior can be made in case it changes in the future.
This behavior would potentially be something to rally for in the next revision of POSIX. As the MADV_DONTNEED advice is quite vague in SUSv3 for posix_madvise(), a new advice such as MADV_DESTROY could be supplied in SUSv4. The behavior would be as follows:
- Advises the operating system that the memory may be destroyed and returned to the system. The contents of the memory are considered by the application to be uninitialized at this point. Following the next access to the memory region, the MADV_DESTROY advice is no longer valid and the system may not tamper with the memory region.
This would not force an implementation to reclaim the memory; but it would clearly define the behavior of applications with regard to the memory region. The posix_madvise() MADV_DONTNEED could be interpreted to mean that the memory won't be accessed for a while, and thus may be swapped out at the time of the advice. With this assumption, the memory must be returned in the state it was in before the call. Having to store the memory is sub-optimal for our purposes.
JohnMoser is considering the impact of fragmentation on the allocator. To maximize the effectiveness of Berthel's method, the allocator must produce as many completely unused pages of memory as possible. Pages on the end of the heap are elimitated via brk(); while pages inside the heap are eliminated via madvise().
JohnMoser suggested that some form of allocation crowding be used to reduce fragmentation. The idea of allocation crowding is to consolidate allocations into major areas rather than spread them out. Rather than using the first available area on the heap, the allocator would find the area of ample size closest to a large number of allocations. In this way, dense areas would get denser, and thin areas would get thinner.
The theory is that if the heap suffers little fragmentation, then allocation patterns don't matter. However, if the heap is heavily fragmented, then there will be some areas that are denser than others; that is, some areas of the heap will have many allocations, while others will have sparse allocations. By actively avoiding the sparsely allocated areas, we deny them the chance to grow in the hopes that as allocations are freed, sparsely allocated areas of the heap will decay and become empty. Once a page is empty, it can be advised to be freed to the system.
One possible way to quickly create allocation crowding would be to view the heap as an array of areas of given length and keep track of how many allocations are in each area. When selecting a place to create a new allocation, the allocator would chose from the most crowded area or from the area closest to the most crowded area. Utilizing Lea's design, other considerations must be made; areas of the heap are already defined to allow threaded applications to avoid locking in malloc(), and following this method would lead us to have to consider exactly how to hand out heap areas to threads.
Use a drop-in malloc()/free() replacement?
There's a purely mmap based multi-heap allocator implementation from OpenBSD libc which was ported to Linux: http://mr.himki.net/OpenBSD_malloc_Linux.c The allocator is known to correctly work with such memory-hungry applications as Mozilla based browsers (this was tested before jemalloc adoption), KDE 3 desktop environment, etc.