HAMMER文件系统简介-FreeOA

HAMMER文件系统简介

2009-09-26 18:02:41

Administrator

DragonFly是Matthew Dillon在2003年6月由于理念不合离开了FreeBSD开发团队，并以FreeBSD 4.8为基础开发的操作系统。5年过去了，DragonFly并没有引起太多注意，应用似乎也不是很多，是一个小众系统。但随着Hammer这个杀手级应用发布，DragonFly最近一段时间肯定会吸引来许多眼球。为啥Hammer那么吸引人？这个要从Hammer一些独特的特性说起。

1，No fsck。fsck是一个很老的问题，人们也投入了很多精力去解决这个问题，比如最常见的日志文件系统，FreeBSD的Soft Update ，还有后台fsck。现在出来的每一个新的文件系统都会号称它不需要fsck。但事实上只是几乎不需要，当需要的时候，痛苦随之而来。就算有后台fsck 的功能，文件系统的性能也大打折扣。其实不是文件系统的代码有问题，而是我们的方法有问题。硬件虽然在飞速发展，但是有一些地方却进展不大，对文件系统来说有两个尴尬的地方：1，硬件的稳定性还是和当年一样；当硬件发生问题的时候，再好的文件系统也会出错，fsck永远不会过时。2，硬盘的容量虽然迅速膨胀，每GB的价格越来越便宜，但硬盘的速度确实不成比例的增长，尤其是对于随机/并发读写，几乎没有可见的增长。上面的两个问题说明了，fsck肯定需要，当fsck需要的时候，检查现在TB级别的文件系统是一个漫长痛苦的过程。说Hammer是No fsck并不是说这个文件系统永不崩溃，而是这个文件系统不需要显式的fsck，更重要的是，fsck的成本很低。它怎么做到这一点的呢？从Hammer 的布局说起。一个Hammer文件系统由多个volume（也就是常见的分区）组成，每个volume由多个cluster组成，每个cluster多个 16KB大小的buffer（也就是block）组成。在Hammer中，每个可能因为故障引发文件系统不一致的操作都局限在一个cluster中。也就是说，假如出现故障，每个cluster的不一致问题都是独立的；要修正在这个cluster上出现的问题不需要读写另外一个cluster。每当 Hammer打开一个cluster进行读写的时候，都会设置一个mount标志；如果发生意外没有清除这个标志，下次打开这个cluster的时候就会检查并修复。这样就把需要检查的区域控制在一个可接受的范围内。这个想法并不新鲜，比如这个ChunkFS就提出了类似的想法。

2，快照 /历史纪录。快照已经不是什么新鲜玩意了，文件历史纪录也不是没有出现过。但Hammer的处理方式更加独特。默认情况下，Hammer保存每次写入硬盘的历史纪录，也就是说，每次把buffer cache中的内容刷新到硬盘的时候，都会形成一个新的历史快照，大约30-60秒左右的粒度。这是一个比较有争议的特性（我认为）。在一个频繁改动的系统中，也许很快就把所有空间用完，所以它并不适合小文件系统使用。但只要应用得当，这个功能会变成一个非常强大的功能，无论对于备份还是删除恢复。而且这个功能是下一个功能的半个基础。

3，镜像。Hammer目前支持一个Master多个slave的镜像功能，能在master基础上做一个完全一样的镜像，或者多/少一点历史纪录的。这个镜像功能的特别之处是Queue-less，异步的。一个镜像不需要同步地持续地从master中获取信息；比如可以每天一次，也可以在一个不稳定的网络中；而且只要连接到master，就可以在做镜像的时候同时访问到master上的任何内容，因为这些内容不需要按照特定的队列顺序传送过来。这是因为Hammer中有所有文件系统的活动历史纪录（假如没有被禁止或者删除的话），每块内容都是只写入一次。 Hammer的目标比这个更远大，它希望将来版本中能支持多Master多Slave的集群和镜像！

Hammer还有一些比较琐碎的特性，比如后台优化文件系统的布局（ext4准备实现的功能之一），数据CRC校验（目前似乎只对meta data做了，将来对数据也做），没有i-node限制（一些老旧的文件系统依然有这个限制，比如ext2/3），最大支持1EB的文件系统，实时扩展 /缩减分区（还没有实现，目标之一）。据作者说目前的性能还是颇不错的。其实还有一个特性：BSD许可。这么闪亮的文件系统，我猜测不用多久就会有人站出来把他移植到Linux里。Linux现在实在是缺少一个足够闪亮的文件系统。

也许Hammer要走上实用还需要一段时间，需要作者和用户共同去测试、微调。但Hammer设计理念已经走在了前面：硬件很便宜，也很容易出错；数据非常宝贵，速度永远不会嫌快。

--------------------------------
The HAMMER Filesystem

The HAMMER filesystem is a new addition to DragonFly. Being a brand-new filesystem we consider HAMMER to be in an early Beta state as of the 2.0 release, and expect it to become production-ready by the end of 2008. All major features except the mirroring are quite well tested as-of the release. The mirroring was the last big-ticket item to go in prior to the 2.0 release and should be considered more in a late-alpha light.

Now that it has been released, HAMMER is going to be given some settling time through the end of 2008. We will be testing the waters on porting interest and making adjustments to the source files to support that work, mainly by wrapping the DragonFly-specific calls and globals used by the filesystem into more OS-specific files and leaving as much of the meat of the filesystem as possible as machine-independent code.

If you are interesting in porting HAMMER to another OS, please drop me (Matthew Dillon) a line at dillon at backplane.com. I will be creating a new DragonFly mailing list specifically for HAMMER porting as well as a git or mercurial repository (I haven't decided which yet) separate from the DragonFly repository.

I can't stress enough that HAMMER is designed for large storage media. The minimum I would be comfortable with would be around a 40G partition, though I often create smaller partitions for testing purposes. HAMMER can address up to 1-Exabyte of space and the use target is really designed for 500G and up. What does this mean for people who want to use HAMMER on a smaller partition? Well, there are two issues. First, the filesystem needs to reserve several hundred megabytes of media space to serve as buffer space for the reblocker. If you create a small filesystem the space efficiency is not going to be all that good. More importantly, HAMMER recovers space via pruning and reblocking cron jobs which early adopters must set up manually and are intended to run a few minutes every night to incrementally clean up the filesystem. You don't get instant gratification when you 'rm' something, so if the filesystem is too small normal use may run you out of space before the pruning and reblocking can catch up.

If you remember only two things about this filesystem, the first should be the large-media nature of the filesystem's management functions, and the second should be the historical data retention. Most systems sync their mounts every 30-60 seconds. For HAMMER this means that you effectively get a snapshot every 30-60 seconds. The filesystem's fine-grained nature shows up when you use the hammer or undo commands to sift through the history, but the absolute best way to utilize the fine-grained nature of the filesystem is to create a cron job which creates a snapshot softlink at the desired interval, for example once an hour, using the hammer snapshot command. Creating an actual softlink via this command guarantees you a consistent view of the filesystem state.
--------------------------------
HAMMER Filesystem 09年我最期待的文件系统，但是09年绝对是做不出来的，慢慢等吧，一个文件系统的成熟需要很长时间的

下面是相关的资料：

HAMMER Filesystem Design
October 10, 2007 - 8:51pm
Submitted by Jeremy on October 10, 2007 - 8:51pm.

"I am going to start committing bits and pieces of the HAMMER filesystem over the next two months," announced Matthew Dillon on the Dragonfly BSD kernel mailing list. He noted that the filesystem should be functional by the 2.0 release in December, "I am making good progress and I believe it will be beta quality by the release. It took nearly the whole year to come up with a workable design. I thought I had it at the beginning of the year but I kept running into issues and had to redesign the thing several times since then." Matthew then posted a detailed design document for the new filesystem.

“我将在未来的两个月内开始创立有关HAMMER文件系统的每字每句。”Matthew Dillon在Dragonfly BSD内核邮件列表中宣称。他指出，该文件系统将成为12月发布的2.0版本中的一部分。“我正在取得良好的进展，我相信，它一定会在发布时达到beta 质量。它让我用了近一整年的时间进行可行性设计。我原以为我可以在今年年初的时候完成它，可惜我遭遇了一次车祸。从那以后我曾多次对其进行了重新设计。”Matthew接着为新文件系统张贴了一份详细的设计文档。

During the followup discussion, Matthew was asked if HAMMER would be a ZFS killer. He responded, "ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it." He noted it is essential to have redundant copies of data, but added, "the problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time." As for how Dragonfly's new filesystem will address redundancy, he explained:

在随后的讨论期间，Matthew被问及HAMMER是否将成为对付ZFS的杀手锏。他回答说“ZFS服务于一个不同的目的，我想它在那些方面是很不错的。但是随着时间的推移，我发现我所喜欢的ZFS的设计元素越来越少，于是我欣然决定不再尝试着去移植该文件系统。”他指出，必须有冗余的数据备份，但是他补充道：“ZFS的问题就是，它太冗余了。如果你打算操控一个多主复制环境的话，你不需要那样庞大数量级的冗余。因为你不仅拥有文件系统的完整独立（逻辑上）的拷贝，而且它们是同时活跃并在线的。”至于Dragonfly的新文件系统怎么应对冗余，他解释说：

"HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies perating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year."

“HAMMER的方式是冗余了整个文件系统的逻辑响应。也就是说，它将完全独立的拷贝在不同地域的不同的机器上操作。最后，通过HAMMER的镜像功能，进一步提升我们的集群能力。该项目主要的目标是简化集群。而实行这一目标的一个必要条件是拥有一个多主复制环境。这将是HAMMER文件系统最终将扮演的角色。我们将从2.0开始习惯一个多主环境。而真正鲜明的改革恐怕我们要等到明年的年底了。”

Submit to: Reddit, Digg, Slashdot, Del.icio.us, OSNews
--------------------------------------------------------------------------------
From: Matthew Dillon <dillon@...>
Subject: HAMMER filesystem update - design document
Date: Oct 10, 3:33 pm 2007
Ok, here's the final design document that I am now implementing.    Again, I expect most or all of these features to be ready and the   filesystem to be beta-quality by the December release.

好啦，这里是我最终落实的最终设计文档。此外，我预计所有的功能中大部分已经完成，文件系统将在十二月的发布时达到beta质量。
Hammer Filesystem

(I) General Storage Abstraction
全面抽象化存储
HAMMER uses a basic 16K filesystem buffer for all I/O. Buffers are   collected into clusters, cluster are collected into volumes, and a single HAMMER filesystem may span multiple volumes.

HAMMER对所有的I/O使用了一个基础的16K文件系统缓冲器。缓冲器被收集成簇，簇被收集成卷，而一个单一的HAMMER文件系统可以跨多个卷。
HAMMER maintains a small hinted radix tree for block management in each layer. A small radix tree in the volume header manages cluster allocations within a volume, one in the cluster header manages buffer allocations within a cluster, and most buffers (pure data buffers excepted) will embed a small tree to manage item allocations within the buffer.
HAMMER在每一层为区块管理维护着一个小的提示性的根基树。一个小的根基树在卷轴头部管理着集群在卷中的分配额。一个集群头部管理缓冲器在一个集群内进行分配，而更多的缓冲器（空数据缓冲器除外）被嵌入到一个小树中以管理条目在缓冲器中的分配。
Volumes are typically specified as disk partitions, with one volume designated as the root volume containing the root cluster. The root cluster does not need to be contained in volume 0 nor does it have to be located at any particular offset.
卷被特别指定作为分区的依据，用一个指定的卷当作根卷被包含到根集群中。根集群无须被包含在0卷内，也不需要定位在任何特殊的分支中。
Data can be migrated on a cluster-by-cluster or volume-by-volume basis and any given volume may be expanded or contracted while the filesystem is live.   Whole volumes can be added and (with appropriate data migration) removed.
在一个由集群+集群或者卷+卷组成的设备上的数据可以被迁移。所给的任意卷都可以在文件系统活动时被扩大或缩小。全卷可以被增加（通过适当的数据转移）和删除。

HAMMER's storage management limits it to 32768 volumes, 32768 clusters per volume, and 32768 16K filesystem buffers per cluster.   A volume is thus limited to 16TB and a HAMMER filesystem as a whole is limited to 524288TB. HAMMER's on-disk structures are designed to allow future expansion through expansion of these limits. In particular, the volume id is intended to be expanded to a full 32 bits in the future and using a larger buffer size will also greatly increase the cluster and volume size limitations by increasing the number of elements the buffer-restricted radix trees can manage.
HAMMER的存储管理限制了它只能达到32768卷，每卷32768个集群，每个集群32768个16K的文件系统缓冲器。一个卷被限制为 16TB，而整个HAMMER文件系统被限于524288TB。HAMMER的on-dsik结构被设计允许通过扩展功能来突破这些限制。尤其值得一提的是，卷ID号有意设计成在未来可以扩展成完整32位，并通过增加一些受限缓冲器基树数量加以管理，使用一个更大的缓冲器，将大大提高集群和卷容量的限制。
HAMMER breaks all of its information down into objects and records. Records have a creation and deletion transaction id which allows HAMMER to maintain a historical store. Information is only physically deleted based on the data retention policy. Those portions of the data retention policy affecting near-term modifications may be acted upon by the live filesystem but all historical vacuuming is handled by a helper process.

HAMMER将所有消息分成了对象和记录两部分。记录拥有一个创建处理和删除处理的ID号，这允许HAMMER保留存储历史。消息只包含基于数据保留策略的物理性删除。这些基于数据保留策略的部分影响着近期可能由活动的文件系统所进行的修改。而所有历史性的真空则是由一个帮助进程加以处理。

All information in a HAMMER filesystem is CRCd to detect corruption.

HAMMER文件上的所有信息都通过CRCd驻守服务检查数据讹误。

(II) Filesystem Object Topology
文件系统对象拓扑学

The objects and records making up a HAMMER filesystem is organized into a single, unified B-Tree. Each cluster maintains a B-Tree of the records contained in that cluster and a unified B-Tree is constructed by linking clusters together. HAMMER issues PUSH and PULL operations internally to open up space for new records and to balance the global B-Tree. These operations may have the side effect of allocating new clusters or freeing clusters which become unused.

由对象和记录所组成的 HAMMER文件系统被组织成一个单一统一的B型树。每个集群维护着一个包含在此丛集中的记录的B型树。而各种相连的集群一起组成了一个统一的B型树。 HAMMER通过关键性的内部PUSH和PULL操作为新的记录开启了空间，并用于平衡全局B型树。这些操作可能产生的副作用是分配的新集群或者正在释放的集群无法加以利用。

B-Tree operations tend to be limited to a single cluster. That is, the B-Tree insertion and deletion algorithm is not extended to the whole unified tree. If insufficient space exists in a cluster HAMMER will allocate a new cluster, PUSH a portion of the existing cluster's record store to the new cluster, and link the existing cluster's B-Tree to the new one.
B型树操作往往被限制在单一的集群中。这就是说，B型树的插入和删除法则无法被扩展到整个统一树。如果在一个集群中的空间不够，HAMMER将创建一个新的集群。PUSH已有的集群的部分以记录新的集群，并从已有的集群的B型树做一个指向新集群的链接。

Because B-Tree operations tend to be restricted and because HAMMER tries to avoid balancing clusters in the critical path, HAMMER employs a background process to keep the topology as a whole in balance. One side effect of this is that HAMMER is fairly loose when it comes to inserting new clusters into the topology.
由于B型树的操作往往受到限制，且由于HAMMER试图避免平衡集群的关键路径，HAMMER采用了一个后台进程以保持拓扑结构作为一个整体的平衡。这样做的一个副作用在于，当我们往拓扑中插入新的集群时，HAMMER将变得很松散。

HAMMER objects revolve around the concept of an object identifier. The obj_id is a 64 bit quantity which uniquely identifies a filesystem object for the entire life of the filesystem. This uniqueness allows backups and mirrors to retain varying amounts of filesystem history by removing any possibility of conflict through identifier reuse. HAMMER typically iterates object identifiers sequentially and expects to never run out. At a creation rate of 100,000 objects per second it would take HAMMER around 6 million years to run out of identifier space. The characteristics of the HAMMER obj_id also allow HAMMER to operate in a multi-master clustered environment.
HAMMER对象被一个对象标识符的概念所围绕。“obj_id”是一个64位长度的号码，它是代表文件系统完整生命的文件系统对象的唯一识别码。这种唯一性使得备份和镜像保留了文件系统历史上的变化总量——删除任意（通过标识码重用所导致的）可能的冲突。HAMMER通常会重申对象标识符顺序，并预期不会被用完。以每秒创建10万个对象的速度，要花掉大约6百万年才可能使 HAMMER耗完标识符空间。HAMMER的“obj_id”特征也允许 HAMMER在一个多主集群环境中进行操作。

A filesystem object is made up of records. Each record references a variable-length store of related data, a 64 bit key, and a creation and deletion transaction id which is indexed along with the key.
一个文件系统对象由各种记录创建。每个记录参考一个相关数据的可变长度存储器，一个64位key。而一个创建和删除处理ID根据这个key被索引。

HAMMER utilizes a 64 bit key to index all records. Regular files use the base data offset of the record as the key while irectories use a namekey hash as the key and store one directory entry per record. For all intents and purposes a directory can store an unlimited number of files.
HAMMER统一了一个64位key以索引所有记录。通常文件使用记录的基础数据误差作为值。当目录使用一个namekey hash作为其key时，每次记录会存储一个目录。为了所有的意图和目的，一个目录可以存储无限数量的文件。
HAMMER is also capable of associating any number of out-of-band attributes with a filesystem object using a separate key space. This key space may be used for extended attributes, ACLs, and anything else the user desires.

HAMMER也有通过一个文件系统对象使用一个独立key空间，使任意数量不同频道信号传送属性融合的能力。这个key空间可以被用于扩展属性、访问控制以及任何用户所希望的用途。

(III) Access to historical information
访问历史信息
A HAMMER filesystem can be mounted with an as-of date to access a snapshot of the system. Snapshots do not have to be explicitly taken but are instead based on the retention policy you specify for any given HAMMER filesystem. It is also possible to access individual files or directories (and their contents) using an as-of extension on the file name.

一个HAMMER文件系统可以通过“以xx日期”方式加载，以访问一个系统的快照。快照不一定要被明确的生成，而是基于你给任何所给HAMMER文件系统所描述的保留策略产生。同时，我们也可以访问其文件名中使用了“于xx日期”的单独的文件或目录（及他们的内容）。

HAMMER uses the transaction ids stored in records to present a snapshot view of the filesystem as-of any time in the past, with a granularity based on the retention policy chosen by the system administrator. feature also effectively implements file versioning.

HAMMER使用了在记录中存储的处理ID号（由系统管理员选择的，基于保留策略的粒度理论产生），展现一个文件系统在过去任意时间刻度的快照。这个功能还能被有效的应用于文件版本中。

(IV) Mirrors and Backups
镜像和备份

HAMMER is organized in a way that allows an information stream to be generated for mirroring and backup purposes. This stream includes all historical information available in the source. No queueing is required so there is no limit to the number of mirrors or backups you can have and no limit to how long any given mirror or backup can be taken offline. Resynchronization of the stream is not considered to be an expensive operation.

HAMMER规划了一种途径，以允许为镜像和备份目的创建一个信息流。这个流以源码方式包括了所有的的历史信息。没有等待是必须的，所以你可以拥有无限数量的镜像或备份。也没有限制镜像或备份被离线了多长时间。流信息的反复同步不再被认为是奢侈的操作。

Mirrors and backups are maintained logically, not physically, and may have their own, independant retention polcies. For example, your live filesystem could have a fairly rough retention policy, even none at all, then be streamed to an on-site backup and from there to an off-site backup, each with different retention policies.

镜像和备份时逻辑上的维护，不是物理上的。并且可以有他们各自的所有人、独立的保留策略。例如，你的活动文件系统可以有一个相当粗略的保留策略，甚至什么也没有。然后被分流到一个在线备份中，并由那里到达一个离线备份中。每一步都有不同的保留策略。

(V) Transactions and Recovery
交换与恢复

HAMMER implement an instant-mount capability and will recover information on a cluster-by-cluster basis as it is being accessed.

HAMMER提供了一种瞬间加载能力，并可以在一个集群对集群的主机上像被访问一样恢复信息。

HAMMER numbers each record it lays down and stores a synchronization point in the cluster header. Clusters are synchronously marked 'open' when undergoing modification. If HAMMER encounters a cluster which is unexpectedly marked open it will perform a recovery operation on the cluster and throw away any records beyond the synchronization point.

HAMMER每个记录数值被安置和存储在集群头部的一个同步点上。当集群被修改时，集群被标识为“开启”并进行同步。如果HAMMER遇见一个被意外标识成“开启”的集群，它将在集群上执行一次恢复操作，并丢弃原先从同步点上获得的任何记录。

HAMMER supports a userland transactional facility. Userland can query the current (filesystem wide) transaction id, issue numerous operations and on recovery can tell HAMMER to revert all records with a greater transaction id for any particular set of files. Multiple userland applications can use this feature simultaniously as long as the files they are accessing do not overlap. It is also possible for userland to set up an ordering dependancy and maintain completely asynchronous operation while still being able to guarentee recovery to a fairly recent transaction id.

HAMMER支持一个用户级处理设备。用户级可以查询当前（文件系统级）处理ID，执行大量的操作，并且可以通过使用一个更高级的处理ID，告知 HAMMER为任意特别设置的文件恢复所有的记录。多数用户级应用程序可以使用这个功能进行模拟，只要他们的文件操作没有重叠。它还可能为用户级设定一个可信赖排序，并保持完全异步的操作。直到恢复到一个相当近期的处理ID。

(VI) Database files
数据库文件

HAMMER uses 64 bit keys internally and makes key-based files directly available to userland. Key-based files are not egular files and do not operate using a normal data offset space.

HAMMER使用内部64位key并使依据key的相关文件能有效应用于用户级。依据key的文件不是可执行文件，不能用常规的数据误差空间进行操作。

You cannot copy a database file using a regular file copier. The file type will not be S_IFREG but instead will be S_IFDB.   The file must be opened with O_DATABASE. Reads which normally seek the file forward will instead iterate through the records and lseek/qseek can be used to acquire or set the key prior to the read/write operation.

你不能使用一个普通文件复印件复制一个数据库文件。文件类型将不再是S_IFREG，而用S_IFDB代替。文件必须用O_DATABASE来打开。从文件头部开始搜索的常规读取方式将被通过反复记录并使用lseek/qseek获取或设置关键值的读写操作模式所代替.