Introduction:
In my quest to find a reliable tool that can show per-process swap usage, I stumbled upon a program called smem
. Although smem
does the job well, it took around 1.5 seconds to run, with the runtime depending on the number of open processes. smem
works by parsing /proc/<pid>/smaps
and summing specific fields. To improve upon this, I developed my own version called zmem
, inspired by memory management mechanics like zram
and zswap
. My initial Rust implementation already showed a significant performance boost, taking approximately 600ms to run. However, I wanted to optimize it further to create a continuous monitoring software. In this blog post, I will describe my journey of optimizing zmem and the techniques I employed to achieve it.
Initial Rust Implementation:
As a newcomer to Rust, I started by writing a simple implementation of zmem
. I followed a similar approach as smem
, but aimed to optimize it from the very beginning. My initial implementation took around 600ms to run, which was already a significant improvement over smem
. However, I wanted to push the limits and achieve even better performance.
Here’s a snippet of the initial implementation:
Profiling with Cargo Flamegraphs:
To identify bottlenecks in my code, I turned to Cargo’s built-in flamegraphs. Generating a flamegraph was as easy as running cargo flamegraph
. The flamegraph helped me visualize the performance of different parts of my code.
Optimizing Update Function:
One of the first issues I discovered was that I was calling the update function twice, without any apparent reason. By removing one of the redundant calls, I managed to bring the runtime down to around 300ms.
Parallelizing Update Function:
Next, I parallelized the Update function using tokio, so that each reading operation on /proc/<pid>/smaps
was performed asynchronously. This change resulted in a significant performance improvement, reducing the runtime to just 70ms.
All I did was make the call to new process asynchronous.
Improving Data Locality:
In an effort to further optimize the program, I added local variables where resources were being summed after parsing the smaps file. By doing this, I improved data locality, which in turn improved the performance of my code.
Including Debug Information:
I realized that my release profile did not include debug information, which limited the output of the flamegraphs and excluded a lot of symbol data. By adding debug=true
to the release profile, I was able to obtain more detailed information about which parts of my program were taking longer to execute.
Utilizing filter_map
and Reusing String:
To further optimize my code and make it cleaner, I started using filter_map
in some sections. Additionally, I realized I was creating a new line
string for each iteration, leading to repeated calls to the allocator. By reusing the same string and clearing it after each iteration, I managed to improve performance by 5-10%.
Switching to smaps_rollup
:
While implementing smaps to read data about processes, I initially assumed that it was the best option available. However, during my research, I discovered that the slowness of smaps had been a concern for others as well. To further complicate things, I found out that the default value of kernel.perf_event_paranoid was 2, which made my flamegraphs inaccurate. After setting it to 1, I noticed that the call to __show_smap
was taking the most time. For more information about the kernel.perf_event_paranoid sysctl value, refer to https://sysctl-explorer.net/kernel/perf_event_paranoid/.
Before:
After:
Relevant section of linux kernel source: https://sourcegraph.com/github.com/torvalds/linux/-/blob/fs/proc/task_mmu.c?L817:13
To solve this problem, I began searching for an alternative solution and found smaps_rollup just bellow in the source code. This more efficient alternative automatically sums up the data for me, eliminating the need for manual calculation. After replacing my original implementation with smaps_rollup, I was able to achieve an additional 35-40% performance improvement, bringing the runtime down to an impressive 30ms.
Final version of the method after minor adjustments:
Conclusion:
Throughout this optimization journey, I managed to reduce the runtime of my Rust program zmem
from an initial 600ms down to just 30ms. By employing various optimization techniques, such as profiling with Cargo flamegraphs, parallelizing the Update function, improving data locality, reducing reduntant allocations, and switching to alternate ways to get data (smaps_rollup
), I achieved a much faster and efficient program.
My experience with Rust has shown me the power of the language, and the optimization possibilities it offers. I’m excited to continue exploring Rust and its performance capabilities, and I hope that my journey in optimizing zmem
has provided you with some valuable insights for your own Rust projects.
Happy coding!
Source repo for Zmem: https://github.com/xeome/Zmem