Zram Performance Analysis

Introduction

Zram is a kernel module that utilizes a compressed virtual memory block device allowing for efficient memory management. In this document we will analyze the performance of various compression algorithms used in Zram and their impact on the system. We will also discuss the effects of different page-cluster values on the system’s latencies and throughput.

Compression Algorithm Comparison

The following table compares the performance of different compression algorithms used in Zram, in terms of compression time, data size, compressed size, total size, and compression ratio.

Data from Linux Reviews:

AlgorithmCp timeDataCompressedTotalRatio
lzo4.571s1.1G387.8M409.8M2.689
lzo-rle4.471s1.1G388M410M2.682
lz44.467s1.1G403.4M426.4M2.582
lz4hc14.584s1.1G362.8M383.2M2.872
84222.574s1.1G538.6M570.5M1.929
zstd7.897s1.1G285.3M298.8M3.961

Data from u/VenditatioDelendaEst:

algopage-clusterMiB/sIOPSMean Latency (ns)99% Latency (ns)comp_ratio
lzo058211490274242874562.77
lzo166688535144436119682.77
lzo271934603528438211202.77
lzo3749623987516426391682.77
lzo-rle062641603776223563042.74
lzo-rle172709306424045105602.74
lzo-rle278325012487710195842.74
lzo-rle3824826396314897371202.74
lz4079432033515170836002.63
lz4196281232494299063042.63
lz42107566884305560114562.63
lz431143436589310674213762.63
zstd026126687155714131203.37
zstd1281636053310847249603.37
zstd2293118760821073488963.37
zstd330059618141343957443.37

Data from my raspberry pi 4, 2gb model:

algopage-clusterMiB/sIOPSMean Latency (ns)99% Latency (ns)comp_ratio
lzo01275.19326448.939965.1418816.001.62
lzo11892.08242186.6814178.7731104.001.62
lzo22451.65156905.5223083.5556064.001.62
lzo32786.3389162.4642224.49107008.001.62
lzo-rle01271.53325511.429997.7220096.001.62
lzo-rle11842.69235863.9514627.2334048.001.62
lzo-rle22404.35153878.6523592.1960160.001.62
lzo-rle32766.6188531.4642579.14114176.001.62
lz401329.87340447.839421.3515936.001.59
lz412004.43256567.1913238.7825216.001.59
lz422687.75172015.9320807.0043264.001.59
lz433157.29101033.4236901.3680384.001.59
zstd0818.88209633.9716672.1338656.001.97
zstd11069.07136840.5026777.0569120.001.97
zstd21286.1782314.8446059.39127488.001.97
zstd31427.7545688.1484876.56246784.001.97

The table presents the performance metrics of different compression algorithms, including LZO, LZO-RLE, LZ4, and ZSTD. The metrics include throughput, compression ratio, and latency, which are important factors to consider for selecting the optimal compression algorithm. We used a weighted sum to evaluate the performance of each algorithm and page cluster combination, with weights of 0.4 for latency, 0.4 for compression ratio, and 0.2 for throughput. The results show that LZ4 with page cluster 0 achieved the highest weighted sum, indicating that it is the optimal choice for this dataset. Overall, this evaluation provides valuable insights for selecting the most suitable compression algorithm for data storage and processing, balancing between compression ratio, throughput, and latency.

Code used to calculate weighed sums:

data = {
    ('lzo', 0): (5821, 2.77, 2428),
    ('lzo', 1): (6668, 2.77, 4436),
    ('lzo', 2): (7193, 2.77, 8438),
    ('lzo', 3): (7496, 2.77, 16426),
    ('lzo-rle', 0): (6264, 2.74, 2235),
    ('lzo-rle', 1): (7270, 2.74, 4045),
    ('lzo-rle', 2): (7832, 2.74, 7710),
    ('lzo-rle', 3): (8248, 2.74, 14897),
    ('lz4', 0): (7943, 2.63, 1708),
    ('lz4', 1): (9628, 2.63, 2990),
    ('lz4', 2): (10756, 2.63, 5560),
    ('lz4', 3): (11434, 2.63, 10674),
    ('zstd', 0): (2612, 3.37, 5714),
    ('zstd', 1): (2816, 3.37, 10847),
    ('zstd', 2): (2931, 3.37, 21073),
    ('zstd', 3): (3005, 3.37, 41343),
}
 
weights = {'latency': 0.4, 'ratio': 0.4, 'throughput': 0.2}
 
# Find the maximum value for each metric
max_throughput = max(x[0] for x in data.values())
max_ratio = max(x[1] for x in data.values())
max_latency = max(x[2] for x in data.values())
 
best_score = 0
best_algo = None
best_page_cluster = None
 
for (algo, page_cluster), (throughput, ratio, latency) in data.items():
    throughput_norm = throughput / max_throughput
    ratio_norm = ratio / max_ratio
    latency_norm = latency / max_latency
    score = weights['latency'] * (1 / latency_norm) + weights['ratio'] * ratio_norm + weights['throughput'] * throughput_norm
    print(f"{algo}, pagecluster {page_cluster}: {score:.4f}")
    if score > best_score:
        best_score = score
        best_algo = algo
        best_page_cluster = page_cluster
 
print(f"Best algorithm: {best_algo}")
print(f"Best page cluster: {best_page_cluster}")

Data from me:

Compiling memory intensive code (vtm ). Test was done on raspberry pi 4b, 2gb ram.

algotime
lz4433.63s
zstd459.34s

Page-cluster Values and Latency

The page-cluster value controls the number of pages that are read in from swap in a single attempt, similar to the page cache readahead. The consecutive pages are not based on virtual or physical addresses, but consecutive on swap space, meaning they were swapped out together.

The page-cluster value is a logarithmic value. Setting it to zero means one page, setting it to one means two pages, setting it to two means four pages, etc. A value of zero disables swap readahead completely.

The default value is 3 (8 pages at a time). However, tuning this value to a different value may provide small benefits if the workload is swap-intensive. Lower values mean lower latencies for initial faults, but at the same time, extra faults and I/O delays for following faults if they would have been part of that consecutive pages readahead would have brought in.

Conclusion

In the analysis of Zram performance, it was determined that the zstd algorithm provides the highest compression ratio while still maintaining acceptable speeds. The high compression ratio allows more of the working set to fit in uncompressed memory, reducing the need for swap and ultimately improving performance.

For daily use (non latency sensitive), it is recommended to use zstd with page-cluster=0 as the majority of swapped data is likely stale (old browser tabs). However, systems that require constant swapping may benefit from using the lz4 algorithm due to its higher throughput and lower latency.

It is important to note that the decompression of zstd is slow and results in a lack of throughput gain from readahead. Therefore, page-cluster=0 should be used for zstd. This is the default setting on ChromeOS and seems to be standard practice on Android.

The default page-cluster value is set to 3, which is better suited for physical swap. This value dates back to 2005, when the kernel switched to git, and may have been used in a time before the widespread use of SSDs. It is recommended to consider the specific requirements of the system and workload when configuring Zram.

Sources and See Also

https://linuxreviews.org/Zram

https://docs.kernel.org/admin-guide/sysctl/vm.html

https://www.reddit.com/r/Fedora/comments/mzun99/new_zram_tuning_benchmarks/