ext3文件系统
2010-08-03 22:10:16 阿炯

ext2和ext3是许多Linux操作系统发行版本的默认文件系统,ext基于ufs,是一种快速、稳定的文件系统。Linux ext2/ext3文件系统使用索引节点来记录文件信息,作用像windows的文件分配表。索引节点是一个结构,它包含了一个文件的长度、创建及修改时间、权限、所属关系、磁盘中的位置等信息。一个文件系统维护了一个索引节点的数组,每个文件或目录都与索引节点数组中的唯一一个元素对应。系统给每个索引节点分配了一个号码,也就是该节点在数组中的索引号,称为索引节点号。linux文件系统将文件索引节点号和文件名同时保存在目录中,所以目录只是将文件的名称和它的索引节点号结合在一起的一张表,目录中每一对文件名称和索引节点号称为一个连接。对于一个文件来说有唯一的索引节点号与之对应,对于一个索引节点号,却可以有多个文件名与之对应,因此在磁盘上的同一个文件可以通过不同的路径去访问它。

随着Linux系统在关键业务中的应用,Linux文件系统的弱点也渐渐显露出来了;其中ext2文件系统是非日志文件系统,这在关键行业的应用是一个致命的弱点,ext3文件系统弥补了这一缺点。

ext3文件系统是直接从ext2文件系统发展而来,目前ext3文件系统已经非常稳定可靠。它完全兼容ext2文件系统,用户可以平滑地过渡到一个日志功能健全的文件系统中来,这实际上了也是ext3日志文件系统初始设计的初衷。ext3 基于ext2 的代码,它的磁盘格式和 ext2 的相同;这意味着,一个干净卸装的ext3 文件系统可以作为 ext2 文件系统重新挂装,ext3文件系统仍然能被加载成ext2文件系统来使用,你可以把一个文件系统在ext3和ext2自由切换。

linux系统下ext3文件系统内给文件/目录命名,最长只能支持127个中文字符,英文则可以支持255个字符。
ext3文件系统一级子目录的个数为31998(个)。这是Linux为了cpu的搜索效率而规定的,要想改变数目大概要重新编译内核。
ext3文件系统下单个目录里的最大文件数,单个目录下的最大文件数似乎没什么特别限制,也是受限于所在文件系统的inode数限制:使用df -i或者使用tune2fs -l /dev/sdaX或者dumpe2fs -h /dev/sdaX查看可用inode数,后两个命令输出结果是一样的,但是跟df所得出的可用inode数会有些误差。
打开文件数限制(文件句柄、文件描述符),在终端里用ulimit -n 65535设置,或者/etc/security/limit.conf里设置用户打开文件数、进程数、CPU等。

ext2 文件系统
ext2文件系统应该说是Linux正宗的文件系统,早期的Linux都是用ext2,但随着技术的发展,大多Linux的发行版本目前并不用这个文件系统了;比如Redhat和Fedora 大多都建议用ext3,ext3文件系统是由ext2发展而来的。对于Linux新手,我们还是建议您不要用ext2文件系统;ext2支持undelete(反删除),如果您误删除文件,有时是可以恢复的,但操作上比较麻烦,ext2支持大文件。ext2 从一开始就被设计为一个商业级文件系统,沿用 BSD 的 Berkeley 文件系统的设计原理。ext2 提供了 GB 级别的最大文件大小和 TB 级别的文件系统大小,使其在 20 世纪 90 年代的地位牢牢巩固在文件系统大联盟中。很快它被广泛地使用,无论是在 Linux 内核中还是最终在 MINIX 中,且利用第三方模块可以使其应用于 MacOS 和 Windows。但这里仍然有一些问题需要解决:ext2 文件系统与 20 世纪 90 年代的大多数文件系统一样,如果在将数据写入到磁盘的时候,系统发生崩溃或断电,则容易发生灾难性的数据损坏。随着时间的推移,由于碎片(单个文件存储在多个位置,物理上其分散在旋转的磁盘上),它们也遭受了严重的性能损失。尽管存在这些问题,但今天 ext2 还是用在某些特殊的情况下 -- 最常见的是,作为便携式 USB 驱动器的文件系统格式。更多信息请访问ext2文件系统的官方主页

ext3 文件系统
ext3 is a Journalizing file system for Linux(ext3是一个用于Linux的日志文件系统),ext3支持大文件;但不支持反删除(undelete)操作,ext3日志文件系统的特点:
1、高可用性
系统使用了ext3文件系统后,即使在非正常关机后,系统也不需要检查文件系统。宕机发生后,恢复ext3文件系统的时间只要数十秒钟。

2、数据的完整性
ext3文件系统能够极大地提高文件系统的完整性,避免了意外宕机对文件系统的破坏。在保证数据完整性方面,ext3文件系统有2种模式可供选择。其中之一就是“同时保持文件系统及数据的一致性”模式,采用这种方式,你永远不再会看到由于非正常关机而存储在磁盘上的垃圾文件。

3、文件系统的速度
尽管使用ext3文件系统时,有时在存储数据时可能要多次写数据,但从总体上看来,ext3比ext2的性能还要好一些。这是因为ext3的日志功能对磁盘的驱动器读写头进行了优化,所以文件系统的读写性能较之ext2文件系统并来说,性能并没有降低。

4、数据转换
由ext2文件系统转换成ext3文件系统非常容易,只要简单地键入两条命令即可完成整个转换过程,用户不用花时间备份、恢复、格式化分区等。另外,ext3文件系统可以不经任何更改,而直接加载成为ext2文件系统。

5、多种日志模式
ext3有多种日志模式,一种工作模式是对所有的文件数据及metadata(定义文件系统中数据的数据,即数据的数据)进行日志记录(data=journal模式);另一种工作模式则是只对metadata记录日志,而不对数据进行日志记录,也即所谓data=ordered或者data=writeback模式。系统管理人员可以根据系统的实际工作要求,在系统的工作速度与文件数据的一致性之间作出选择。

ext3 于 2001 年 11 月在 2.4.15 内核版本中被采用到 Linux 内核主线中。ext3 和 20 世纪 90 年代后期的其它文件系统,如微软的 NTFS,使用日志来解决这个问题。日志是磁盘上的一种特殊的分配区域,其写入被存储在事务中;如果该事务完成磁盘写入,则日志中的数据将提交给文件系统自身。如果系统在该操作提交前崩溃,则重新启动的系统识别其为未完成的事务而将其进行回滚,就像从未发生过一样。这意味着正在处理的文件可能依然会丢失,但文件系统本身保持一致,且其它所有数据都是安全的。

在使用 ext3 文件系统的 Linux 内核中实现了三个级别的日志记录方式:
日记(journal)、顺序(ordered)和回写(writeback)

日记 是最低风险模式,在将数据和元数据提交给文件系统之前将其写入日志。这可以保证正在写入的文件与整个文件系统的一致性,但其显著降低了性能。

顺序 是大多数 Linux 发行版默认模式;顺序模式将元数据写入日志而直接将数据提交到文件系统。顾名思义,这里的操作顺序是固定的:首先,元数据提交到日志;其次,数据写入文件系统,然后才将日志中关联的元数据更新到文件系统。这确保了在发生崩溃时,那些与未完整写入相关联的元数据仍在日志中,且文件系统可以在回滚日志时清理那些不完整的写入事务。在顺序模式下,系统崩溃可能导致在崩溃期间文件的错误被主动写入,但文件系统它本身 -- 以及未被主动写入的文件 -- 确保是安全的。

回写 是第三种模式 -- 也是最不安全的日志模式。在回写模式下,像顺序模式一样,元数据会被记录到日志,但数据不会。与顺序模式不同,元数据和数据都可以以任何有利于获得最佳性能的顺序写入。这可以显著提高性能,但安全性低很多。尽管回写模式仍然保证文件系统本身的安全性,但在崩溃或崩溃之前写入的文件很容易丢失或损坏。

跟之前的 ext2 类似,ext3 使用 16 位内部寻址,这意味着对于有着 4K 块大小的 ext3 在最大规格为 16 TiB 的文件系统中可以处理的最大文件大小为 2 TiB。

实际使用ext3文件系统
创建新的ext3文件系统,例如要把磁盘上的hda6分区格式化ext3文件系统,并将日志记录在/dev/hda1分区,那么操作过程如下:
[root@debian root]# mke2fs -j /dev/hda6
mke2fs 1.24a (02-Sep-2001)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
.. .. ..
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 30 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

在创建新的文件系统时,可以看到,ext3文件系统执行自动检测的时间为180天或每第31次被mount时,实际上这个参数可以根据需要随意调节。以下将新的文件系统mount到主分区/data目录下:
[root@freeoa root]# mount -t ext3 /dev/hda6 /data

说明:以上将已格式化为ext3文件系统的/dev/hda6分区加载到/data目录下。ext3 基于ext2 的代码,它的磁盘格式和 ext2 的相同;这意味着,一个干净卸装的 ext3 文件系统可以作为 ext2 文件系统重新挂装。Ext3文件系统仍然能被加载成ext2文件系统来使用,你可以把一个文件系统在ext3和ext2自由切换。这时在ext2文件系统上的ext3日志文件仍然存在,只是ext2不能认出日志而已。

将ext2文件系统转换为ext3文件系统
将linux系统的文件系统由ext2转至ext3,有以下几处优点:第一系统的可用性增强了,第二数据集成度提高,第三启动速度提高了,第四ext2与ext3文件系统之间相互转换容易。以转换文件系统为例,将ext2文件系统转换为ext3文件系统,命令如下:
[root@freeoa root]# tune2fs -j /dev/hda9
tune2fs 1.24a (02-Sep-2001)
Creating journal inode: done
This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

这样,原来的ext2文件系统就转换成了ext3文件系统。注意将ext2文件系统转换为ext3文件系统时,不必要将分区缷载下来转换。转换完成后,不要忘记将/etc/fstab文件中所对应分区的文件系统由原来的ext2更改为ext3。

ext3日志的存放位置
可以将日志放置在另外一个存储设备上,例如存放到分区/dev/hda8。例如要在/dev/hda8上创建一个ext3文件系统,并将日志存放在外部设备/dev/hda2上,则运行以下命令:
[root@freeoa root]#mke2fs -J device=/dev/hda8 /dev/hda2

ext3文件系统修复
新的e2fsprogs中的e2fsck支持ext3文件系统。当一个ext3文件系统被破坏时,先卸载该设备,再用e2fsck修复:
[root@freeoa root] # umount /dev/hda5
[root@freeoa root] #e2fsck -fy /dev/hda5

总而言之,ext3日志文件系统是目前linux系统由ext2文件系统过度到日志文件系统最为简单的一种选择,实现方式也最为简洁。由于是直接从 ext2文件系统发展而来,系统由ext2文件系统过渡到ext3日志文件系统升级过程平滑,可以最大限度地保证系统数据的安全性。目前linux系统要使用日志文件系统,最保险的方式就是选择ext3文件系统。

Linux文件系统ext3升级版本Next3

ext3文件系统久经考验,但它确实缺乏许多实用的功能,例如快照技术(Snapshot)在任意时间创建文件系统的一个副本。虽然ext3文件系统可以通过间接方法创建快照,但受很多限制,一个名叫Next3的文件系统试图为 Linux 用户提供一种简单灵活的方法直接在ext3文件系统中实现快照。

Next3由CTERA Networks开发,目前主要用于其C200网络附加存储设备。CTERA表示Next3 的快照能创建文件系统的任意时间点数据副本,可以随时从灾难中恢复到先前的状态。Next3为Linux用户提供了一种免费的、GPL授权的文件系统级别的快照技术。Next3的源代码已发布在SourceForge上,开发者提议将其合并到mainline kernel,不过内核维护者Ted Ts'o对此表示疑虑。Next3是一种新类型的文件系统,而不仅仅是ext3的一个扩展,它通过创建一种特殊的文件去代表文件系统的一个快照,总体上它与存储容量的大小相同,但它们是稀疏文件,因此一开始几乎不占任何空间。当硬盘上一个块发生变化,文件系统会首先检查这个块是否已经保存在最近的快照中。如果没有,受影响的块会移动到快照文件,然后分配一个新块去替代它。

----------------------------------------------------
The ext3 or third extended filesystem is a journaled file system that is commonly used by the Linux kernel. It is the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper and later in a February 1999 kernel mailing list posting, and the filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

Advantages
Although its performance (speed) is less attractive than competing Linux filesystems such as JFS, ReiserFS and XFS, it has a significant advantage in that it allows in-place upgrades from the ext2 file system without having to back up and restore data. Ext3 also uses less CPU power than ReiserFS and XFS. It is also considered safer than the other Linux file systems due to its relative simplicity and wider testing base.[citation needed]
The ext3 file system adds, over its predecessor:
* A Journaling file system
* Online file system growth
* Htree indexing for larger directories. An HTree is a specialized version of a B-tree (not to be confused with the H tree fractal).

Without these, any ext3 file system is also a valid ext2 file system. This has allowed well-tested and mature file system maintenance utilities for maintaining and repairing ext2 file systems to also be used with ext3 without major changes. The ext2 and ext3 file systems share the same standard set of utilities, e2fsprogs, which includes an fsck tool. The close relationship also makes conversion between the two file systems (both forward to ext3 and backward to ext2) straightforward.

While in some contexts the lack of "modern" filesystem features such as dynamic inode allocation and extents could be considered a disadvantage, in terms of recoverability this gives ext3 a significant advantage over file systems with those features. The file system metadata is all in fixed, well-known locations, and there is some redundancy inherent in the data structures that may allow ext2 and ext3 to be recoverable in the face of significant data corruption, where tree-based file systems may not be recoverable.

Size limits
ext3 has a maximum size for both individual files and the entire filesystem. For the filesystem as a whole that limit is 232 blocks. Both limits are dependent on the block size of the filesystem; the following chart summarizes the limits:
Block size     Max file size     Max filesystem size
1 KB     16 GB     2 TB
2 KB     256 GB     8 TB
4 KB     2 TB     16 TB
8 KB[limits 1]     2 TB     32 TB

1. ^ In Linux, 8 KB block size is only available on architectures which allow 8 KB pages, such as Alpha.

Journaling levels
There are three levels of journaling available in the Linux implementation of ext3:

Journal (lowest risk)
Both metadata and file contents are written to the journal before being committed to the main file system. Because the journal is relatively continuous on disk, this can improve performance in some circumstances. In other cases, performance gets worse because the data must be written twice - once to the journal, and once to the main part of the filesystem.

Ordered (medium risk)
Only metadata is journaled; file contents are not, but it's guaranteed that file contents are written to disk before associated metadata is marked as committed in the journal. This is the default on many Linux distributions. If there is a power outage or kernel panic while a file is being written or appended to, the journal will indicate the new file or appended data has not been "committed", so it will be purged by the cleanup process. (Thus appends and new files have the same level of integrity protection as the "journaled" level.) However, files being overwritten can be corrupted because the original version of the file is not stored. Thus it's possible to end up with a file in an intermediate state between new and old, without enough information to restore either one or the other (the new data never made it to disk completely, and the old data is not stored anywhere). Even worse, the intermediate state might intersperse old and new data, because the order of the write is left up to the disk's hardware. XFS uses this form of journaling.

Writeback (highest risk)
Only metadata is journaled; file contents are not. The contents might be written before or after the journal is updated. As a result, files modified right before a crash can become corrupted. For example, a file being appended to may be marked in the journal as being larger than it actually is, causing garbage at the end. Older versions of files could also appear unexpectedly after a journal recovery. The lack of synchronization between data and journal is faster in many cases. JFS uses this level of journaling, but ensures that any "garbage" due to unwritten data is zeroed out on reboot.

Disadvantages
Functionality
Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Because of that, ext3 lacks a number of features of more recent designs, such as extents, dynamic allocation of inodes, and block suballocation. There is a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode.

ext3, like most current Linux filesystems, cannot be fsck-ed while the filesystem is mounted for writing. Attempting to check a file system that is already mounted may detect bogus errors where changed data has not reached the disk yet, and corrupt the file system in an attempt to "fix" these errors.

Defragmentation
There is no online ext3 defragmentation tool that works on the filesystem level. An offline ext2 defragmenter, e2defrag, exists but requires that the ext3 filesystem be converted back to ext2 first. But depending on the feature bits turned on in the filesystem, e2defrag may destroy data; it does not know how to treat many of the newer ext3 features.

There are userspace defragmentation tools like Shake and defrag. Shake works by allocating space for the whole file as one operation, which will generally cause the allocator to find contiguous disk space. It also tries to write files used at the same time next to each other. Defrag works by copying each file over itself. However they only work if the filesystem is reasonably empty. A true defragmentation tool does not exist for ext3.

That being said, as the Linux System Administrator Guide states, "Modern Linux filesystem(s) keep fragmentation at a minimum by keeping all blocks in a file close together, even if they can't be stored in consecutive sectors. Some filesystems, like ext3, effectively allocate the free block that is nearest to other blocks in a file. Therefore it is not necessary to worry about fragmentation in a Linux system."

While ext3 is more resistant to file fragmentation than the FAT filesystem, nonetheless ext3 filesystems can get fragmented over time or on specific usage patterns, like slowly-writing large files. Consequently the successor to the ext3 filesystem, ext4, is planned to eventually include an online filesystem defragmentation utility, and currently supports extents (contiguous file regions).

Recovery
There is no support of deleted file recovery in file system design. Ext3 driver actively deletes files by wiping file inodes for crash safety reasons. That's why accidental 'rm -rf ...' may cause permanent data loss.There are still several techniques and some commercial software like UFS Explorer Standard Recovery version 4 for recovery of deleted or lost files using file system journal analysis; however, they do not guarantee any specific file recovery.There is no chance of file recovery after file system format.

Compression
Support for transparent compression is available as an unofficial patch for ext3. This patch is a direct port of e2compr and still needs further development, it compiles and boots well with upstream kernels[citation needed] but journaling is not implemented yet. The current patch is named e3compr.

Lack of snapshots support
Unlike a number of modern file systems, Ext3 does not have native support for snapshots - the ability to quickly capture the state of the filesystem at arbitrary times, instead relying on less space-efficient volume level snapshots provided by the Linux LVM. The Next3 file system is a modified version of Ext3 which offers snapshots support, yet retains compatibility to the EXT3 on-disk format.

No checksumming in journal
Ext3 does not do checksumming when writing to the journal. If barrier=1 is not enabled as a mount option (in /etc/fstab), and if the hardware is doing out-of-order write caching, one runs the risk of severe filesystem corruption during a crash.

Consider the following scenario: If hard disk writes are done out-of-order (due to modern hard disks caching writes in order to amortize write speeds), it is likely that one will write a commit block of a transaction before the other relevant blocks are written. If a power failure or unrecoverable crash should occur before the other blocks get written, the system will have to be rebooted. Upon reboot, the file system will replay the log as normal, and replay the "winners" (transactions with a commit block, including the invalid transaction above which happened to be tagged with a valid commit block). The unfinished disk write above will thus proceed, but using corrupt journal data. The file system will thus mistakenly overwrite normal data with corrupt data while replaying the journal. There is a test program available to trigger the problematic behavior. If checksums had been used, where the blocks of the "fake winner" transaction were tagged with a mutual checksum, the file system could have known better and not replayed the corrupt data onto the disk. Journal checksumming has been added to ext4.

Filesystems going through the device mapper interface (including software RAID and LVM implementations) may not support barriers, and will issue a warning if that mount option is used. There are also some disks that do not properly implement the write cache flushing extension necessary for barriers to work, which causes a similar warning. In these situations, where barriers are not supported or practical, reliable write ordering is possible by turning off the disk's write cache and using the data=journal mount option. Turning off the disk's write cache may be required even when barriers are available. Applications like databases expect a call to fsync() will flush pending writes to disk, and the barrier implementation doesn't always clear the drive's write cache in response to that call. There is also a potential issue with the barrier implementation related to error handling during events such as a drive failure.

ext4
An enhanced version of the filesystem was announced by Theodore Ts'o on June 28, 2006 under the name of ext4. On October 11, 2008, the patches that mark ext4 as stable code were merged in the Linux 2.6.28 source code repositories, marking the end of the development phase and recommending its adoption.



该文章最后由 阿炯 于 2020-03-20 13:37:23 更新,目前是第 3 版。