At ESN, we have recently experienced problems with Python not giving back memory to the OS in Linux. It reuses allocated memory internally, but never releases free memory back to the OS. This causes problems with monitoring, as it becomes difficult to see trends or temporary memory usage spikes. At first we thought we had a Python memory leak on our hands. Others seem to have similar problems, for example there is a Stack Overflow entry about it. We investigated this problem and solved it using TCMalloc, a malloc replacement part of Google Performance Tools, and some appropriate tuning. This post details the results of our investigation and our solution.
Pythons default memory management is based on two basic techniques:
- Using malloc, on Linux-systems usually the GLIBC version, to allocate memory from the OS.
- A custom memory allocator for small objects, to reduce the number of malloc calls.
To understand why Python does not give back memory to the OS, we have to dig into how GLIBC’s malloc works and in particular how Linux memory management works.
In Linux, there are two ways to allocate memory:
- Through the brk()/sbrk() syscalls. These are used to increase or decrease a continuous amount of memory allocated to the process. It is always provided as a continuous chunk, so you can only free memory at the end of the allocated memory, you cannot have “holes”.
- Through the mmap() syscall. With mmap, you can allocate an arbitrary size of memory and map it wherever you like in the virtual address space of the process. You can also release memory allocated by mmap using munmap(), meaning you can have “holes” in your allocated memory. In many respects, allocating memory through mmap() is similar and as flexible as using malloc. There is, however, a performance penalty for using mmap. The reason is that the OS, to be POSIX compliant, has to zero the memory before giving it to the process. Because of this, mmap is traditionally only used for larger allocations that are not so frequent.
The picture below shows the virtual address space of a process. The first segment, marked as brk, is the memory allocated using brk()/sbrk() calls. The end of the brk segment is called the breakpoint of the process (which is the reason for the syscall names). Using sbrk()/brk(), it is possible to move this breakpoint. With mmap() you can place arbitrary chunks of memory into the address space.
GLIBC’s malloc uses both brk and mmap. It uses brk for small allocations (on 64-bit the default is lower than 64MB, but this threshold is dynamically adjusted and can be tuned, explained in a message from libc mailing list) and mmap for larger allocations. Allocations inside the memory allocated by brk is managed by malloc internally, potentially leading to fragmentation.
The problem arises when many allocations occur followed by almost all memory being freed. If the memory that is still allocated is high up in the brk segment, malloc will not be able to release the memory to the OS. Typical scenarios is when you have a long memory-consuming computation, and store the result. The result is then likely to be in the upper part of the brk segment.
While malloc can be tuned to use mmap at lower thresholds, it does not have the ability to manage smaller allocations inside a block allocated using mmap. Python’s own allocator for small objects can help, but it does not use it for all objects.
Our solution was to use TCMalloc, a malloc replacement part of Google Performance Tools. TCMalloc can be tuned to only use mmap, and uses a delay before releasing memory to the OS, reducing the number of OS calls for applications using malloc frequently.
We compiled a version of Python with TCMalloc that only uses mmap. When testing the new Python in one of our largest projects, we found that not only did Python give back memory to the OS correctly, it also had a reduced memory usage and no apparent CPU penalty for using mmap instead of brk.
Planet, our development platform for the social real-time web, now ships with Python using TCMalloc as standard.