并行压缩管理工具-Pcompress
2012-10-29 17:14:13 阿炯

本站赞助商链接,请多关照。

pcompress是一种实用工具,做并行分裂成块的输入数据压缩/解压缩和重复数据删除。它具有模块化结构,包括支持LZMA算法,BZIP2,PPMD等,CRC64块校验等多种算法,捆绑LZMA压缩的SSE优化。

它还实现了内容感知的重复数据删除和块级增量压缩功能的基础上半拉宾指纹图谱计划。它具有低的元数据开销和重叠的I/O和压缩,以达到最大的并行。它还捆绑了一个简单的slab分配器,以加快重复类似的块分配。它可以工作在管道模式,从标准输入读取和写入到stdout。它还提供了一些自适应压缩在多种算法,每块试图确定在给定块的最佳模式。

采用' C/C++'为开发语言,并在LGPL协议下授权使用。


Pcompress is a utility to do compression and decompression in parallel by splitting input data into chunks. It has a modular structure and includes support for multiple algorithms like LZMA, Bzip2, PPMD, etc, with SKEIN/ SHA checksums for data integrity. It can also do Lempel-Ziv pre-compression (derived from libbsc) to improve compression ratios across the board. SSE optimizations for the bundled LZMA are included.

It also implements chunk-level Content-Aware Deduplication and Delta Compression features based on a Semi-Rabin Fingerprinting scheme. Delta Compression is done via the widely popular bsdiff algorithm. Similarity is detected using a technique based on MinHashing. When doing chunk-level dedupe it attempts to merge adjacent non-duplicate blocks index entries into a single larger entry to reduce metadata. In addition to all these it can internally split chunks at rabin boundaries to help dedupe and compression.

最新版本:2.2

主要是 bug 修复,包括一些无效输入导致的程序崩溃,以及 Debian 6 和 老的非 SSE4 处理器下的构建问题。改进了基于 Min-heap 相似度匹配的速度和精度,可伸缩的分段全局去重精度提升 95%,增加更多的测试用例。

项目主页:https://github.com/moinakg/pcompress/
该文章最后由 阿炯 于 2015-02-06 09:24:25 更新,目前是第 2 版。