LizardFS与MooseFS常见问题
2019-06-15 21:06:29 阿炯

本文是我因工作需要使用过的两款分布式存储解决方案软件:LizardFSMooseFS常见问题,其中的翻译借助了工具,希望能对所见者有所帮助。

--------------------------------------------------------------
LizardFS Frequently Asked Questions
LizardFS常见问题

共有6个章节
General Questions(一般性问题)
Server related(服务相关)
Dummy-Clients(无)
Dummy-Platforms(无)
Networking(网络)
High Availability(高可用)

-------------------------------
General Questions
一般性问题

Why is it called LizardFS ?
为何叫做LizardFS?

It's a metaphor for something that can grow back if partially lost.
这是一个比喻,如果部分丢失,某些东西可能会重新生长。

Can I move my systems from MooseFS to LizardFS ?
我能把我的系统从MooseFS移到LizardFS上吗?

Yes. This is documented in the Upgrading Lizardfs manual.
对的。这在Lizardfs升级手册中有记录。

Why a fork from MooseFS ?
为什么要从MooseFS派生?

At the time we did the fork there was nearly no movement in the GIT repo of MosseFS  Only 2 devs had access to the sourceforge repo and there were no more committers increase in the number of possible contributors with a simultaneous lack of response on the part of MooseFS maintenance.
在我们做MosseFS派生的时候,mossefs的git repo中几乎没有移动过,只有2个devs可以从sourceforge repo访问,并且没有更多的提交者及贡献者增加可能,同时moosefs维护方面缺乏响应。

What kind of erasure coding does LizardFS use ?
LizardFS使用哪种擦除编码?

Reed-Solomon
里德·所罗门纠错码

Is there support for different storage sizes in one lizardFS deployment ?
在一个Lizardfs部署中是否支持不同的存储大小?

Yes
是的

Is there support for different network bandwidth in one lizardFS deployment?
在一个lizardfs部署中是否支持不同的网络带宽?

Yes. You can have different chunkservers having different bandwidth.As for different bw per client, check out: Configuring QoS.
的的。可以拥有不同带宽的chunkserver。对于每个客户端设置的不同BW,详情请查看:配置QoS章节。
    
Is there support for different RAM amount in one lizardFS deployment?
在一个Lizardfs部署中是否支持不同的RAM数量?

Yes. Different chunkservers can have different amounts of RAM.
的的。不同的chunkserver可以有不同数量的内存。

Is there support for encryption, if yes what kind?
是否支持加密,如果支持,将是什么类型?

Encryption of any kind supported by the platform you run your master and chunkservers on is supported since we make use of the underlying POSIX filesystems to store the bits and pieces.
运行master服务器和chunkserver的平台支持任何类型的加密,因为我们使用底层的posix文件系统来存储位和段。
    
How are the deletes managed, if there's a garbage collector?
如果有垃圾收集器,如何管理删除?

Deleted files are sent to trash and removed when trashtime expires.
已删除的文件将被发送到垃圾箱,并在垃圾时间到期时删除。

Are the meta data servers "inside" lizard or "outside"?
元数据服务器是“内部”还是“外部”?

Inside
内部的

How do LizarFS chunks get distributed ?
LizarFS中的块是如何分布的?

The operation is daisy chained. The process actually looks roughly the following way:
操作是菊花链。这个过程实际上大致如下:
Client starts writing chunk to the first chunkserver.
客户端开始向第一个chunkserver写入区块。
As soon as the header and the data part of the first slice of the chunk arrive at the chunkserver it starts writing to the next one if goal >=2.
当块的第一个切片的头和数据部分到达chunkserver时,如果目标大于等于2,则开始写入下一个切片。

Same goes for the next chunkserver if goal >=3
如果目标大于等于3,那么下一个chunkserver也会出现同样的情况。
As soon as the client has finished writing the chunk, it selects another chunkserver to start the same process for the next chunk, unless you define something else in your topology setup of course.
一旦客户机完成了区块的写入,它就会选择另一个chunkserver为下一个区块启动相同的进程,当然,除非在拓扑设置中定义了其它内容。

This, of course, is only true for replication goals. In EC mode it will be distributed writes from the client to as many chunkservers as defined in the EC goal so nothing of the above would apply.
当然,这只适用于复制目标。在EC模式下,它将从客户机向EC目标中定义的尽可能多的chunkserver进行分布式写入,因此上述任何内容都不适用。

-------------------------------
Server related
服务相关

How do I completely reset a cluster ?
如何完全重置群集?

The simplest way is to create a new metadata file. Go to your metadata directory on your current master server (look at the DATA_PATH in the mfsmaster.cfg file), than stop the master and create a new empty metadata file by executing:
最简单的方法是创建一个新的元数据文件。转到当前主服务器上的元数据目录(查看mfsmaster.cfg文件中的数据路径),然后停止主服务器并通过执行以下操作创建新的空元数据文件:
echo -n "MFSM NEW" > metadata.mfs

Start the master and your cluster will be clear, all remaining chunks will be deleted.
启动主机后集群将被清除,所有剩余的块将被删除。

How do I create a full backup of all my metadata (in case of recreating a master etc...)
如何创建所有元数据的完整备份(在重新创建主控形状等情况下…)

Copy your data directory somewhere safe (default path: /var/lib/mfs).
将数据目录复制到安全的地方(默认路径为/var/lib/mfs)。

Files you should be interested in keeping are primarily:
应该感兴趣的文件主要包括:

metadata.mfs - your metadata set in binary form. This file is updated hourly + on master server shutdown. You can also trigger a metadata dump with lizardfs-admin save-metadata HOST PORT, but an admin password needs to be set in the mfsmater.cfg first.
metadata.mfs-二进制格式的元数据集。主服务器关闭时,此文件每小时更新一次。您还可以使用lizardfs admin save-metadata 主机 端口 触发元数据转储,但首先需要在mfsmater.cfg中设置管理密码。

sessions.mfs - additional information on user sessions.
sessions.mfs-有关用户会话的其他信息。

changelog*.mfs - changes to metadata that weren't dumped to metadata.mfs yet.
changelog*.mfs-对尚未转储到metadata.mfs的元数据的更改。

How do I speed up/slow down the main chunk loop execution during which chunks are checked if they need replication/deletion/rebalance/etc
如何加快/减慢主块循环的执行速度?在主块循环执行过程中,如果需要复制/删除/重新平衡块,将检查这些块。

Adjust the following settings in the master server configuration file:
在主服务器配置文件中调整以下设置:

CHUNKS_LOOP_PERIOD
Time in milliseconds between chunks loop execution (default is 1000).
块循环执行之间的时间(毫秒)(默认值为1000)。

CHUNKS_LOOP_MAX_CPU
Hard limit on CPU usage by chunks loop (percentage value, default is 60).
块循环对CPU使用的硬限制(百分比值,默认值为60)。

-------------------------------
Networking
网络

What do you recommend for interface bonding ?
对于接口连接推荐用什么?
 
This depends largely on your policies. Since LizardFS does round robin between chunkservers if using goals, rr would probably gain the best results. If you use erasure coding, advanced balancing in LACP would be probably the most optimal way to do it.
这在很大程度上取决于你的策略。在Lizardfs中,如果使用目标在chunkserver之间进行循环,rr可能会获得最佳结果。如果使用擦除编码,LACP中的高级平衡可能是最理想的方法。

-------------------------------
High Availability
高可用

What do you mean by High Availability?
高可用性是什么意思?

With the help of multiple chunk servers and good goals, files can be stored multiple times. Therefore, a certain level of high availability on a file level can be achieved easily.
在多台存储服务器的帮助下,可以多次存储文件。因此在文件级别上可以很容易地实现一定级别的高可用性。

In addition, it is important to know that per default, the master service only can be active in a master role on one node at the time. If this node fails, e.g. because of broken hardware or out-of-memory situations, the current master has to be demoted (if still possible) and an existing shadow has to be promoted manually.
此外,重要的是要知道在默认情况下,主服务每次只能在一个节点上的主角色中处于活动状态。如果此节点出现故障(例如,由于硬件损坏或内存不足的情况),则必须降级当前主节点(如果仍有可能),并且必须手动升级现有影子服务。

If the fail over happens automatically, a good state of high availability is achieved on a service level. Thus the term "High Availability" refers to keeping the master role alive when everything goes down under.
如果故障转移是自动发生的,那么就可以在服务级别上实现高可用性的良好状态。因此,术语“高可用性”是指当一切都陷入困境时,保持主角色的活动状态。

How can I achieve High Availability of the master?
如何实现主机的高可用性?

There are multiple ways of keeping the master highly available.
保持主节点服务高度可用的方法有多种。

One would be to demote and promote manually if you need to. The better way would be to delegate that task to a mechanism which knows the current state of all (possible) master nodes and can perform the fail over procedure automatically.
如果需要的话,可以手动降级和提升。更好的方法是将该任务指派给一个机制,该机制知道所有(可能的)主节点的当前状态,并且可以自动执行故障转移过程。

Known methods, when only using open-source software, are building Pacemaker/ Corosync clusters with self-written OCF agents. Another way could be using keepalived.
当只使用开源软件时,已知的方法是使用自写的OCF代理构建pacemaker/corosync集群。另一种方法是使用keepalived。

This is too complicated! I need a better solution for High Availability.
这太复杂了!我需要一个更好的高可用性解决方案。

An officially supported way to achieve high availability of the master is to obtain the uRaft component from Skytechnology Sp. z o.o., the company behind LizardFS. Based on the raft algorithm, the uRaft service makes sure that all master nodes talk to each other and exchange information regarding their health states.
一个官方支持的方法来实现主机的高可用性是从Skytechnology sp.z o.o.获得URAFT组件,这是Lizardfs背后的公司。基于RAFT算法,URAFT服务确保所有主节点都能相互通信,并交换有关其健康状态的信息。

In order to ensure that a master exists, the nodes participate in votes. If the current master fails, uRaft moves a floating IP from the formerly active node to the new designated master. All uRaft nodes have to be part of one network and must be able to talk to each other.
为了确保主节点存在,节点参与投票。如果当前主机发生故障,URAFT会将浮动IP从以前的活动节点移动到新的指定主机。所有URAFT节点必须是一个网络的一部分,并且必须能够相互通信。

The uRaft component can be obtained by signing a support contract with SkyTechnology Sp. z o.o..
Uraft组件可通过与Skytechnology sp.z o.o.签订支持合同而获得。

--------------------------------------------------------------
MooseFS Frequently Asked Questions
MooseFS常见问题

1. What average write/read speeds can we expect?
1.我们期望的平均读/写速度是多少?

Aside from common (for most filesystems) factors like: block size and type of access (sequential or random), in MooseFS the speeds depend also on hardware performance. Main factors are hard drives performance and network capacity and topology (network latency). The better the performance of the hard drives used and the better throughput of the network, the higher performance of the whole system.
除了常见的(对于大多数文件系统)因素,如块大小和访问类型(顺序或随机),在moosefs中,速度还取决于硬件性能。主要因素是硬盘性能、网络容量和拓扑结构(网络延迟)。所用硬盘的性能越好,网络的吞吐量越好,整个系统的性能就越高。

2. Does the goal setting influence writing/reading speeds?
2.目标设定是否影响书写/读速度?
Generally speaking, it does not. In case of reading a file, goal higher than one may in some cases help speed up the reading operation, i. e. when two clients access a file with goal two or higher, they may perform the read operation on different copies, thus having all the available throughtput for themselves. But in average the goal setting does not alter the speed of the reading operation in any way.
一般来说,不会的。在读取文件的情况下,目标高于一在某些情况下可能有助于加快读取操作,即当两个客户机访问目标为二或更高的文件时,它们可以在不同的副本上执行读取操作,从而使所有可用的输出都为它们自己。但平均而言,目标设定不会以任何方式改变读操作的速度。

Similarly, the writing speed is negligibly influenced by the goal setting. Writing with goal higher than two is done chain-like: the client send the data to one chunk server and the chunk server simultaneously reads, writes and sends the data to another chunk server (which may in turn send them to the next one, to fulfill the goal). This way the client's throughtput is not overburdened by sending more than one copy and all copies are written almost simultaneously. Our tests show that writing operation can use all available bandwidth on client's side in 1Gbps network.
同样,写作速度也会受到目标设定的明显影响。目标高于两个的写入是连锁的:客户机将数据发送到一个区块服务器,区块服务器同时读取、写入并将数据发送到另一个区块服务器(反过来,该服务器可能会将数据发送到下一个区块服务器,以实现此目标)。这样,客户端的可用资源不会因为发送多个副本而超负载工作,而且几乎所有副本都是同时写入的。测试表明,在1Gbps网络中,写操作可以使用客户端的所有可用带宽。

3. Are concurrent read and write operations supported?
3.是否支持并发读写操作?

All read operations are parallel - there is no problem with concurrent reading of the same data by several clients at the same moment. Write operations are parallel, execpt operations on the same chunk (fragment of file), which are synchronized by Master server and therefore need to be sequential.
所有读取操作都是并行的-多个客户机在同一时刻同时读取同一数据没有问题。写操作是并行的,在同一块(文件片段)上执行Execpt操作,该块由主服务器同步,因此需要是顺序的。

4. How much CPU/RAM resources are used?
4.使用了多少CPU/RAM资源?

In our environment (ca. 1 PiB total space, 36 million files, 6 million folders distributed on 38 million chunks on 100 machines) the usage of chunkserver CPU (by constant file transfer) is about 15-30% and chunkserver RAM usually consumes in between 100MiB and 1GiB (dependent on amount of chunks on each chunk server). The master server consumes about 50% of modern 3.3 GHz CPU (ca. 5000 file system operations per second, of which ca. 1500 are modifications) and 12GiB RAM. CPU load depends on amount of operations and RAM on the total number of files and folders, not the total size of the files themselves. The RAM usage is proportional to the number of entries in the file system because the master server process keeps the entire metadata in memory for performance. HHD usage on our master server is ca. 22 GB.
在我们的环境中(大约1个PIB总空间,3600万个文件,600万个文件夹分布在100台机器上的3800万个块上),chunk server CPU(通过不断的文件传输)的使用率约为15-30%,chunkserver RAM通常消耗在100mib到1gib之间(取决于每个块服务器上的块的数量)。master服务器大约消耗现代3.3GHz CPU(每秒约5000个文件系统操作,其中约1500个是修改)和12GiB RAM的50%。CPU负载取决于操作量和RAM取决于文件和文件夹的总数,而不是文件本身的总大小。RAM的使用与文件系统中的条目数成比例,因为主服务器进程将整个元数据保存在内存中以提高性能。我们主服务器上的hhd使用量约为22 GB。

5. Is it possible to add/remove chunkservers and disks on the fly?
5.是否可以动态添加/删除chunkserver和磁盘?

You can add/remove chunk servers on the fly. But keep in mind that it is not wise to disconnect a chunk server if this server contains the only copy of a chunk in the file system (the CGI monitor will mark these in orange). You can also disconnect (change) an individual hard drive. The scenario for this operation would be:
    Mark the disk(s) for removal (see How to mark a disk for removal?)
    Reload the chunkserver process
    Wait for the replication (there should be no "undergoal" or "missing" chunks marked in yellow, orange or red in CGI monitor)
    Stop the chunkserver process
    Delete entry(ies) of the disconnected disk(s) in mfshdd.cfg
    Stop the chunkserver machine
    Remove hard drive(s)
    Start the machine
    Start the chunkserver process
您可以动态添加/删除chunk服务器。但请记住,如果此服务器包含文件系统中块的唯一副本(CGI监视器将用橙色标记这些副本),则断开块服务器是不明智的。还可以断开(更改)单个硬盘,此操作的方案是:
标记要删除的磁盘(请参阅如何标记要删除的磁盘?)
重新加载chunkserver进程
等待复制(CGI监视器中不应存在以黄色、橙色或红色标记的"undergoal"或"missing"块)
停止chunkserver进程
删除mfshdd.cfg中已断开连接的磁盘的条目
停止ChunkServer计算机
删除硬盘驱动器
启动机器
启动chunkserver进程

If you have hotswap disk(s) you should follow these:
    Mark the disk(s) for removal (see How to mark a disk for removal?)
    Reload the chunkserver process
    Wait for the replication (there should be no "undergoal" or "missing" chunks marked in yellow, orange or red in CGI monitor)
    Delete entry(ies) of the disconnected disk(s) in mfshdd.cfg
    Reload the chunkserver process
    Unmount disk(s)
    Remove hard drive(s)
如果您有热插拔磁盘,请按照以下步骤操作:
标记要删除的磁盘(请参阅如何标记要删除的磁盘?)
重新加载chunkserver进程
等待复制(CGI监视器中不应存在以黄色、橙色或红色标记的"undergoal"或"missing"块)
删除mfshdd.cfg中已断开连接的磁盘的条目
重新加载chunkserver进程
卸载磁盘
删除硬盘驱动器

If you follow the above steps, work of client computers won't be interrupted and the whole operation won't be noticed by MooseFS users.
如果按照上述步骤操作,客户机的工作不会中断,整个操作也不会被moosefs用户注意到。

6. How to mark a disk for removal?
6.如何标记要删除的磁盘?

When you want to mark a disk for removal from a chunkserver, you need to edit the chunkserver's mfshdd.cfg configuration file and put an asterisk '*' at the start of the line with the disk that is to be removed. For example, in this mfshdd.cfg we have marked "/mnt/hdd" for removal:
/mnt/hdb
/mnt/hdc
*/mnt/hdd
/mnt/hde

如果要标记磁盘以便从chunkserver中删除,则需要编辑chunkserver的mfshdd.cfg配置文件,并在要删除的磁盘所在行的开头放置星号“*”。例如,在此mfshdd.cfg中,我们标记了"mnt/hdd"以便删除:
/mnt/hdb
/mnt/hdc
*/mnt/hdd
/mnt/hde

After changing the mfshdd.cfg you need to reload chunkserver (on Linux Debian/Ubuntu: service moosefs-pro-chunkserver reload).
更改mfshdd.cfg后,需要重新加载chunkserver(在linux debian/ubuntu上可执行:service moosefs pro chunkserver reload)。

Once the disk has been marked for removal and the chunkserver process has been restarted, the system will make an appropriate number of copies of the chunks stored on this disk, to maintain the required "goal" number of copies.
一旦磁盘被标记为要删除并且chunkserver进程重新启动,系统将为此磁盘上存储的块制作适当数量的副本,以保持所需的“目标”副本数量。

Finally, before the disk can be disconnected, you need to confirm there are no "undergoal" chunks on the other disks. This can be done using the CGI Monitor. In the "Info" tab select "Regular chunks state matrix" mode.
最后,在断开磁盘连接之前,您需要确认其它磁盘上没有“undergoal”块。这可以使用CGI监视器来查看,在“Info”选项卡中,选择“All chunks state matrix”模式。

7. My experience with clustered filesystems is that metadata operations are quite slow. How did you resolve this problem?
7.我对集群文件系统的经验是元数据操作非常缓慢。你是如何解决这个问题的?

During our research and development we also observed the problem of slow metadata operations. We decided to aleviate some of the speed issues by keeping the file system structure in RAM on the metadata server. This is why metadata server has increased memory requirements. The metadata is frequently flushed out to files on the master server.
在我们的研究和开发过程中,我们还观察到元数据操作缓慢的问题。我们决定通过将文件系统结构保存在元数据服务器的RAM中来缓解一些速度问题。这就是元数据服务器增加内存需求的原因。元数据经常被刷新到主服务器上的文件中。

Additionally, in CE version the metadata logger server(s) also frequently receive updates to the metadata structure and write these to their file systems.
此外,在开源社区版本中,元数据记录器服务器还经常接收元数据结构的更新,并将其写入文件系统。

In Pro version metaloggers are optional, because master followers are keeping synchronised with leader master. They're also saving metadata to the hard disk.
在Pro版本中,metalogger是可选的,因为主追随者与主领导者保持同步。它们还将元数据保存到硬盘。

8. What does value of directory size mean on MooseFS? It is different than standard Linux ls -l output. Why?
8.在moosefs上,目录大小的值意味着什么?它不同于标准的Linux ls -l输出。为什么?

Folder size has no special meaning in any filesystem, so our development team decided to give there extra information. The number represents total length of all files inside (like in mfsdirinfo -h -l) displayed in exponential notation.
文件夹大小在任何文件系统中都没有特殊的意义,所以我们的开发团队决定在那里提供额外的信息。数字表示以指数表示法显示的内部所有文件(如mfsdirinfo -h -l)的总长度。

You can "translate" the directory size by the following way:
There are 7 digits: xAAAABB. To translate this notation to number of bytes, use the following expression:
您可以通过以下方式“转换”目录大小:
有7个数字:xAAAABB。要将此符号转换为字节数,请使用以下表达式:

AAAA.BB xBytes

Where x:
0 =
1 = kibi
2 = Mebi
3 = Gibi
4 = Tebi

Example:
To translate the following entry:

drwxr-xr-x 164 root root 2010616 May 24 11:47 test
xAAAABB

Folder size 2010616 should be read as 106.16 MiB.

When x = 0, the number might be smaller:

Example:
Folder size 10200 means 102 Bytes.

9. When I perform df -h on a filesystem the results are different from what I would expect taking into account actual sizes of written files.
9.当我在一个文件系统上执行df -h时,结果与考虑到实际写入文件大小的期望值不同。

Every chunkserver sends its own disk usage increased by 256MB for each used partition /hdd, and the master sends a sum of these values to the client as total disk usage. If you have 3 chunkservers with 7 hdd each, your disk usage will be increased by 3*7*256MB (about 5GB).
每个chunkserver发送自己的磁盘使用量,每个使用的分区/hdd增加256MB,master将这些值的总和作为磁盘使用总量发送给客户机。如果您有3个chunkserver,每个7个HDD,您的磁盘使用量将增加3*7*256MB(大约5GB)。

The other reason for differences is, when you use disks exclusively for MooseFS on chunkservers df will show correct disk usage, but if you have other data on your MooseFS disks df will count your own files too.
差异的另一个原因是,当在chunkservers上专门为moosefs使用磁盘时,df将显示正在的磁盘使用情况,但是如果在moosefs磁盘上有其它数据,df也将计算其它的文件。

If you want to see the actual space usage of your MooseFS files, use mfsdirinfo command.
如果要查看moosefs文件的实际空间使用情况,请使用mfsdirinfo命令。

10. Can I keep source code on MooseFS? Why do small files occupy more space than I would have expected?
10.我可以在moosefs上保留源代码吗?为什么小文件占用的空间比我想象的要大?

The system was initially designed for keeping large amounts (like several thousands) of very big files (tens of gigabytes) and has a hard-coded chunk size of 64MiB and block size of 64KiB. Using a consistent block size helps improve the networking performance and efficiency, as all nodes in the system are able to work with a single 'bucket' size. That's why even a small file will occupy 64KiB plus additionally 4KiB of checksums and 1KiB for the header.
该系统最初设计用于保存大量(如数千个)非常大的文件(数十GB),硬编码块大小为64MIB,块大小为64KIB。使用一致的块大小有助于提高网络性能和效率,因为系统中的所有节点都可以使用单个“bucket”大小。这就是为什么即使是一个小文件也会占用64kib外加上4kib的校验和及1kib的报头。

The issue regarding the occupied space of a small file stored inside a MooseFS chunk is really more significant, but in our opinion it is still negligible. Let's take 25 million files with a goal set to 2. Counting the storage overhead, this could create about 50 million 69 KiB chunks, that may not be completely utilized due to internal fragmentation (wherever the file size was less than the chunk size). So the overall wasted space for the 50 million chunks would be approximately 3.2TiB. By modern standards, this should not be a significant concern. A more typical, medium to large project with 100,000 small files would consume at most 13GiB of extra space due to block size of used file system.
关于存储在moosefs块中的小文件的占用空间的问题确实更为重要,但在我们看来,这仍然可以忽略不计。让我们拿2500万个文件,目标设为2。计算存储开销,这可能会创建约5000万个69kib块,由于内部碎片(文件大小小于块大小的地方),可能无法完全利用这些块。因此,5000万块的总浪费空间大约为3.2tib。按照现代标准,这不应该是一个重大问题。由于所用文件系统的块大小,具有100000个小文件的更典型的中型到大型项目最多会消耗13Gib的额外空间。

So it is quite reasonable to store source code files on a MooseFS system, either for active use during development or for long term reliable storage or archival purposes.
因此,在moosefs系统上存储源代码文件是非常合理的,无论是用于开发期间还是用于长期可靠的存储或存档目的。

Perhaps the larger factor to consider is the comfort of developing the code taking into account the performance of a network file system. When using MooseFS (or any other network based file system such as NFS, CIFS) for a project under active development, the network filesystem may not be able to perform file IO operations at the same speed as a directly attached regular hard drive would.
也许更多的考虑因素是网络文件系统的性能所导致开发过程中的舒适性。在项目开发过程中使用moosefs(或任何其他基于网络的文件系统,如nfs、cifs)时,网络文件系统可能无法以与直接连接的常规硬盘相同的速度执行文件IO操作。

Some modern integrated development environments (IDE), such as Eclipse, make frequent IO requests on several small workspace metadata files. Running Eclipse with the workspace folder on a MooseFS file system (and again, with any other networked file system) will yield slightly slower user interface performance, than running Eclipse with the workspace on a local hard drive.
一些现代集成开发环境(IDE),如Eclipse,对几个小的工作空间元数据文件频繁地发出IO请求。在moosefs文件系统上运行带有workspace文件夹的Eclipse(同样,在任何其他联网的文件系统上)会比在本地硬盘驱动器上运行带有workspace的Eclipse产生稍慢的用户界面体验。

You may need to evaluate for yourself if using MooseFS for your working copy of active development within an IDE is right for you.
如果在一个IDE中使用moosefs作为活动开发的工作副本是适合的,那么可能需要自己进行评估。

In a different example, using a typical text editor for source code editing and a version control system, such as Subversion, to check out project files into a MooseFS file system, does not typically resulting any performance degradation. The IO overhead of the network file system nature of MooseFS is offset by the larger IO latency of interacting with the remote Subversion repository. And the individual file operations (open, save) do not have any observable latencies when using simple text editors (outside of complicated IDE products).
在另一个示例中,使用典型的文本编辑器进行源代码编辑和版本控制系统(如Subversion)将项目文件签出到moosefs文件系统中,通常不会导致任何性能下降。moosefs的网络文件系统性质的IO开销被与远程子版本存储库交互的较大IO延迟所抵消。当使用简单的文本编辑器(复杂的IDE产品之外)时,单个文件操作(打开、保存)没有任何感观上的延迟。

A more likely situation would be to have the Subversion repository files hosted within a MooseFS file system, where the svnserver or Apache + mod_svn would service requests to the Subversion repository and users would check out working sandboxes onto their local hard drives.
更可能的情况是,将Subversion存储库文件托管在moosefs文件系统中,在该系统中,SVNServer或Apache+Mod_SVN将向Subversion存储库提供请求服务,用户将在本地硬盘驱动器上查看正在工作的沙盒。

11. Do Chunkservers and Metadata Server do their own checksumming?
11.ChunkServer和元数据服务器是否进行自我的校验和检查?

Chunk servers do their own checksumming. Overhead is about 4B per a 64KiB block which is 4KiB per a 64MiB chunk.
Chunk服务器执行自己的校验和。每个64kib块的开销约为4b,每个64mib块的开销为4kib。

Metadata servers don't. We thought it would be CPU consuming. We recommend using ECC RAM modules.
元数据服务器没有。我们认为这会消耗CPU,建议使用ECC RAM模块。

12. What resources are required for the Master Server?
12.主服务器需要哪些资源?

The most important factor is RAM of MooseFS Master machine, as the full file system structure is cached in RAM for speed. Besides RAM, MooseFS Master machine needs some space on HDD for main metadata file together with incremental logs.
最重要的因素是moosefs Master 主机的内存,因为完整的文件系统结构被高速缓存在内存中。除了内存之外,moosefs主机还需要一些硬盘空间来存放主元数据文件和增量日志。

The size of the metadata file is dependent on the number of files (not on their sizes). The size of incremental logs depends on the number of operations per hour, but length (in hours) of this incremental log is configurable.
元数据文件的大小取决于文件的数量(而不是其大小)。增量日志的大小取决于每小时的操作数,但此增量日志的长度(以小时为单位)是可配置的。

13. When I delete files or directories, the MooseFS size doesn't change. Why?
13.当我删除文件或目录时,moosefs的大小不会改变。为什么?

MooseFS does not immediately erase files on deletion, to allow you to revert the delete operation. Deleted files are kept in the trash bin for the configured amount of time before they are deleted.
MooseFS不会在删除时立即删除文件,以便恢复删除操作。删除的文件在被真正删除之前置于回收站中,直到配置的保留时间之后。

You can configure for how long files are kept in trash and empty the trash manually (to release the space). There are more details in Reference Guide in section "Operations specific for MooseFS".
您可以配置文件在回收站中保留的时间,并手动清空回收站(释放空间)。在参考指南的“MooseFS专用操作”一节中有更多详细信息。

In short - the time of storing a deleted file can be verified by the mfsgettrashtime command and changed with mfssettrashtime.
简而言之-查看已删除文件的真正删除时间可以通过mfsgettrashtime命令验证,并用mfsettrashtime来更改。

14. When I added a third server as an extra chunkserver, it looked like the system started replicating data to the 3rd server even though the file goal was still set to 2.
14.当我添加第三个服务器作为额外的chunkserver时,看起来系统开始将数据复制到第三个服务器,即使文件目标仍然设置为2。

Yes. Disk usage balancer uses chunks independently, so one file could be redistributed across all of your chunkservers.
是的。磁盘使用平衡算法将独立地使用块,因此可以跨所有chunkserver来重新分布一个文件。

15. Is MooseFS 64bit compatible?
15.MooseFS 64位兼容吗?

Yes!
是的!

16. Can I modify the chunk size?
16.我可以修改块大小吗?

No. File data is divided into fragments (chunks) with a maximum of 64MiB each. The value of 64 MiB is hard coded into system so you cannot modify its size. We based the chunk size on real-world data and determined it was a very good compromise between number of chunks and speed of rebalancing / updating the filesystem. Of course if a file is smaller than 64 MiB it occupies less space.
不可以。文件数据被分为多个片段(块),每个片段的最大值为64Mib。64 Mib的值是硬编码到系统中的,因此不能修改其大小。我们根据实际数据确定块的大小,可以确定这是块的数量和重新平衡/更新文件系统的速度之间的一个很好的折衷值。当然,如果一个文件小于64Mib,它占用的空间就更少了。

In the systems we take care of, several file sizes significantly exceed 100GB with no noticable chunk size penalty.
在我们所处理过的系统中,有几个文件大小显著超过100GB,但没有明显的块大小与带来的读写惩罚。

17. How do I know if a file has been successfully written to MooseFS
17.如何知道文件是否已成功写入MooseFS

Let's briefly discuss the process of writing to the file system and what programming consequences this bears.
让我们简单地讨论一下写入文件系统的过程,以及这会带来什么编程后果。

In all contemporary filesystems, files are written through a buffer (write cache). As a result, execution of the write command itself only transfers the data to a buffer (cache), with no actual writing taking place. Hence, a confirmed execution of the write command does not mean that the data has been correctly written on a disk. It is only with the invocation and completion of the fsync (or close) command that causes all data kept within the buffers (cache) to get physically written out. If an error occurs while such buffer-kept data is being written, it could cause the fsync (or close) command to return an error response.
在所有现代的文件系统中,文件都是通过缓冲区(写缓存)写入的。因此,执行写命令本身只将数据传输到缓冲区(缓存),而不进行实际的写入。因此,确认执行写入命令并不意味着数据已正确写入磁盘。只有在调用并完成fsync(或close)命令后,才会导致保存在缓冲区(缓存)中的所有数据物理写入。如果在写入缓冲区保留的数据时发生错误,则可能导致fsync(或close)命令返回错误的响应。

The problem is that a vast majority of programmers do not test the close command status (which is generally a very common mistake). Consequently, a program writing data to a disk may "assume" that the data has been written correctly from a success response from the write command, while in actuality, it could have failed during the subsequent close command.
问题是绝大多数程序员不测试关闭命令状态(这通常是一个非常常见的错误)。因此,将数据写入磁盘的程序可能“假定”数据是从写入命令的成功响应中正确写入的,而实际上,它可能在随后的关闭命令中失败。

In network filesystems (like MooseFS), due to their nature, the amount of data "left over" in the buffers (cache) on average will be higher than in regular file systems. Therefore the amount of data processed during execution of the close or fsync command is often significant and if an error occurs while the data is being written [from the close or fsync command], this will be returned as an error during the execution of this command. Hence, before executing close, it is recommended (especially when using MooseFS) to perform an fsync operation after writing to a file and then checking the status of the result of the fsync operation. Then, for good measure, also check the return status of close as well.
在网络文件系统(如MooseFS)中,由于其性质,缓冲区(缓存)中“剩余”的数据量平均将高于常规文件系统中的数据量。因此,在执行close或fsync命令期间处理的数据量通常很大,如果在[从close或fsync命令]写入数据时发生错误,则在执行此命令期间,这将作为错误返回。因此,在执行close之前,建议(特别是在使用MooseFS时)在写入文件后执行fsync操作后再检查fsync操作结果的状态。即为了更好的测试,也要检查关闭的返回状态。

NOTE! When stdio is used, the fflush function only executes the "write" command, so correct execution of fflush is not sufficient to be sure that all data has been written successfully - you should also check the status of fclose.
注意!当使用stdio时,fflush函数只执行"write"命令,因此,正确执行fflush不足以确保所有数据都已成功写入-还应检查fclose的状态。

The above problem may occur when redirecting a standard output of a program to a file in shell. Bash (and many other programs) do not check the status of the close execution. So the syntax of "application > outcome.txt" type may wrap up successfully in shell, while in fact there has been an error in writing out the "outcome.txt" file. You are strongly advised to avoid using the above shell output redirection syntax when writing to a MooseFS mount point. If necessary, you can create a simple program that reads the standard input and writes everything to a chosen file, where this simple program would correctly employ the appropriate check of the result status from the fsync command. For example, "application | mysaver outcome.txt", where mysaver is the name of your writing program instead of application > outcome.txt.
当将程序的标准输出重定向到shell中的文件时,可能会发生上述问题。bash(和许多其他程序)不检查关闭执行的状态。因此,“application>outcome.txt”类型的语法可能在shell中成功结束,而实际上在写出“outcome.txt”文件时出错。强烈建议您在写入moosefs挂载点时避免使用上述shell输出重定向语法。如果需要可以创建一个简单的程序,该程序读取标准输入并将所有内容写入选定的文件,在该文件中,该简单程序将正确地使用fsync命令对结果状态进行适当检查。例如,“application mysaver outcome.txt”,其中mysaver是编写程序的名称,而不是application>outcome.txt此类方式。

Please note that the problem discussed above is in no way exceptional and does not stem directly from the characteristics of MooseFS itself. It may affect any system of files - network type systems are simply more prone to such difficulties. Technically speaking, the above recommendations should be followed at all times (also in cases where classic file systems are used).
请注意,上述讨论的问题绝不是特例,也不是直接源于MooseFS本身的特点。它可能会影响到任何文件系统--网络类型的系统更容易出现这种困难。从技术上讲,应始终遵循上述建议(也适用于使用传统文件系统的情况)。

18. What are limits in MooseFS (e.g. file size limit, filesystem size limit, max number of files, that can be stored on the filesystem)?
18.MooseFS中的限制是什么(例如,文件大小限制、文件系统大小限制、可以存储在文件系统上的最大文件数)?
    The maximum file size limit in MooseFS is 2^57 bytes = 128 PiB.
    MooseFS中的最大文件大小限制为257字节=128 pib。
    The maximum filesystem size limit is 264 bytes = 16 EiB = 16 384 PiB
    最大文件系统大小限制为2^64字节=16 eib=16 384 pib
    The maximum number of files, that can be stored on one MooseFS instance is 231 - over 2.1 bln.
    一个MooseFS实例上最多可以存储2^31个文件,超过2.1 亿。
    
19. Can I set up HTTP basic authentication for the mfscgiserv?
19.我可以为mfscgiserv设置HTTP基本身份验证吗?

mfscgiserv is a very simple HTTP server written just to run the MooseFS CGI scripts. It does not support any additional features like HTTP authentication. However, the MooseFS CGI scripts may be served from another full-featured HTTP server with CGI support, such as lighttpd or Apache. When using a full-featured HTTP server such as Apache, you may also take advantage of features offered by other modules, such as HTTPS transport. Just place the CGI and its data files (index.html, mfs.cgi, chart.cgi, mfs.css, acidtab.js, logomini.png, err.gif) under chosen DocumentRoot. If you already have an HTTP server instance on a given host, you may optionally create a virtual host to allow access to the MooseFS CGI monitor through a different hostname or port.
mfscgiserv是一个非常简单的HTTP服务器,只为运行moosefs CGI脚本而编写。它不支持任何其它功能,比如HTTP身份验证。但是,moosefs CGI脚本可以通过另一个具有CGI支持的全功能HTTP服务器(如lighttpd或apache)来提供。当使用全功能HTTP服务器(如Apache)时,可以利用其它模块(如HTTPS传输)提供的功能。只需将CGI及其数据文件(index.html、mfs.cgi、chart.cgi、mfs.css、acidtab.js、logomini.png、err.gif)放在选中的上档的根目录下。如果在给定主机上已有HTTP服务器实例,则可以选择创建虚拟主机,以允许通过其他主机名或端口访问moosefs CGI监控页面。

20. Can I run a mail server application on MooseFS? Mail server is a very busy application with a large number of small files - will I not lose any files?
20.我可以在MooseFS上运行邮件服务器应用程序吗?邮件服务器是一个非常繁忙的应用程序,有大量的小文件-我不会丢失任何文件吗?

You can run a mail server on MooseFS. You won't lose any files under a large system load. When the file system is busy, it will block until its operations are complete, which will just cause the mail server to slow down.
您可以在moosefs上运行邮件服务器。在大负载系统下不会丢失任何文件。当文件系统繁忙时,它将阻塞,直到其操作完成,这只会导致邮件服务器处理速度减慢。

21. Are there any suggestions for the network, MTU or bandwidth?
21.对网络、MTU或带宽有什么建议吗?

We recommend using jumbo-frames (MTU=9000). With a greater amount of chunkservers, switches should be connected through optical fiber or use aggregated links.
我们建议使用巨型帧(mtu=9000)。由于chunkserver数量较多,交换机应通过光纤连接或使用聚合链路。

22. Does MooseFS support supplementary groups?
22.MooseFS是否支持额外的组?

Yes.
是的。

23. Does MooseFS support file locking?
23.MooseFS支持文件锁定吗?

Yes, since MooseFS 3.0.
是的,从3.0版本开始。

24. Is it possible to assign IP addresses to chunk servers via DHCP?
24.是否可以通过DHCP将IP地址分配给区块服务器?

Yes, but we highly recommend setting "DHCP reservations" based on MAC addresses.
是的,但是我们强烈建议基于MAC地址设置“DHCP保留”。

25. Some of my chunkservers utilize 90% of space while others only 10%. Why does the rebalancing process take so long?
25.我的一些ChunkServer占用了90%的空间,而其他的只有10%。为什么再平衡过程需要这么长时间?

Our experiences from working in a production environment have shown that aggressive replication is not desirable, as it can substantially slow down the whole system. The overall performance of the system is more important than equal utilization of hard drives over all of the chunk servers. By default replication is configured to be a non-aggressive operation. At our environment normally it takes about 1 week for a new chunkserver to get to a standard hdd utilization. Aggressive replication would make the whole system considerably slow for several days.
我们在生产环境中的经验表明,积极的复制是不可取的,因为它可以大大降低整个系统的速度。系统的整体性能比在所有块服务器上平均利用硬盘更重要。默认情况下,复制配置为非激进性操作。在我们的环境中,新的chunkserver要达到标准的HDD利用率通常需要1周左右。激进性的复制会使整个系统在几天内相当缓慢。

Replication speeds can be adjusted on master server startup by setting these two options:
通过设置以下两个选项,可以在主服务器启动时调整复制速度:

    CHUNKS_WRITE_REP_LIMIT
    Maximum number of chunks to replicate to one chunkserver (default is 2,1,1,4).
    要复制到一个chunkserver的最大块数(默认值为2,1,1,4)。
    One number is equal to four same numbers separated by colons.
    一个数字等于用冒号分隔的四个相同数字。
        First limit is for endangered chunks (chunks with only one copy)
        第一个限制是针对濒危块(只有一个副本的块)
        Second limit is for undergoal chunks (chunks with number of copies lower than specified goal)
        第二个限制是针对目标数理不及标准设定的块(副本数低于指定目标的块)
        Third limit is for rebalance between servers with space usage around arithmetic mean
        第三个限制是使用算术平均值来控制在服务器之间重新平衡
        Fourth limit is for rebalance between other servers (very low or very high space usage)
        第四个限制是在其他服务器之间重新平衡(空间使用率非常低或非常高)
    Usually first number should be grater than or equal to second, second greater than or equal to third, and fourth greater than or equal to third (1st >= 2nd >= 3rd <= 4th).
    通常,第一个数字应大于或等于第二个,第二个大于或等于第三个,第四个大于或等于第三个(1st >= 2nd >= 3rd <= 4th)。
    
    CHUNKS_READ_REP_LIMIT
    Maximum number of chunks to replicate from one chunkserver (default is 10,5,2,5).
    从一个chunkserver复制的最大块数(默认值为10,5,2,5)。
    One number is equal to four same numbers separated by colons. Limit groups are the same as in write limit, also relations between numbers should be the same as in write limits (1st >= 2nd >= 3rd <= 4th).
    一个数字等于用冒号分隔的四个相同数字。极限组与写入极限相同,数字之间的关系也应与写入极限值相同(1st >= 2nd >= 3rd <= 4th)。

Tuning these in your environment will requires some experiments.
在您的环境中调优这些需要做一些实验。

26. I have a Metalogger running - should I make additional backup of the metadata file on the Master Server?
26.我有一个metalogger正在运行-我应该在主服务器上对元数据文件进行额外备份吗?

Yes, it is highly recommended to make additional backup of the metadata file. This provides a worst case recovery option if, for some reason, the metalogger data is not useable for restoring the master server (for example the metalogger server is also destroyed).
是的,强烈建议对元数据文件进行额外备份。如果出于某种原因,metalogger数据不能用于恢复主服务器(例如,metalogger服务器也会被销毁),那么这将提供最坏的恢复选项。

The master server flushes metadata kept in RAM to the metadata.mfs.back binary file every hour on the hour (xx:00). So a good time to copy the metadata file is every hour on the half hour (30 minutes after the dump). This would limit the amount of data loss to about 1.5h of data. Backing up the file can be done using any conventional method of copying the metadata file - cp, scp, rsync, etc.
主服务器每小时(xx:00)将RAM中保存的元数据刷新到metadata.mfs.back二进制文件。因此,复制元数据文件的最佳时间是每半小时(转储后30分钟)。这将把数据丢失量限制在1.5h左右。备份文件可以使用复制元数据文件的任何常规方法(cp、scp、rsync等)来完成。

After restoring the system based on this backed up metadata file the most recently created files will have been lost. Additionally files, that were appended to, would have their previous size, which they had at the time of the metadata backup. Files that were deleted would exist again. And files that were renamed or moved would be back to their previous names (and locations). But still you would have all of data for the files created in the X past years before the crash occurred.
根据备份的元数据文件还原系统后,最近创建的文件将丢失。另外,附加到的文件将具有它们在元数据备份时所具有的先前大小。删除的文件将再次存在。重命名或移动的文件将恢复到以前的名称(和位置)。但是,在崩溃发生之前的X年中,仍然拥有创建的文件的所有数据。

In MooseFS Pro version, master followers flush metadata from RAM to the hard disk once an hour. The leader master downloads saved metadata from followers once a day.
在MooseFS 专业版中,主节点每小时将元数据从RAM刷新到硬盘一次。其它的备节点每天从主节点那里下载一次保存的元数据。

27. I think one of my disks is slower / damaged. How should I find it?
27.我想我的一个磁盘慢了/损坏了。我该如何找到它?

In the CGI monitor go to the "Disks" tab and choose "switch to hour" in "I/O stats" column and sort the results by "write" in "max time" column. Now look for disks which have a significantly larger write time. You can also sort by the "fsync" column and look at the results. It is a good idea to find individual disks that are operating slower, as they may be a bottleneck to the system.
在CGI监控页面中,转到“磁盘”选项卡,在“I/O状态”列中选择“切换到小时”,并在“最大时间”列中按“写入”对结果进行排序。现在,寻找写时间大得多的磁盘。还可以按“fsync”列排序并查看结果。最好找到运行速度较慢的单个磁盘,因为它们可能是系统的瓶颈。

It might be helpful to create a test operation, that continuously copies some data to create enough load on the system for there to be observable statisics in the CGI monitor. On the "Disks" tab specify units of "minutes" instead of hours for the "I/O stats" column.
创建一个连续复制一些数据以在系统层面上制造相当的负载,这样就可以在CGI监视器中有可观测的状态,这样对测试操作可能会有所帮助。在“磁盘”选项卡上,为“I/O状态”列指定“分钟”而不是“小时”的单位。

Once a "bad" disk has been discovered to replace it follow the usual operation of marking the disk for removal, and waiting until the color changes to indicate that all of the chunks stored on this disk have been replicated to achieve the sufficient goal settings.
一旦发现要替换的“坏”磁盘,请按照通常的操作将该磁盘标记为要删除,然后等待颜色更改,以指示已复制存储于此磁盘上的所有块,满足实现足够的目标设置。

28. How can I find the master server PID?
28.如何找到主服务器进程ID?

Issue the following command:
使用以下命令:
# mfsmaster test

29. Web interface shows there are some copies of chunks with goal 0. What does it mean?
29.Web界面显示目标为0的块有一些副本,这是意味着什么?

This is a way to mark chunks belonging to the non-existing (i.e. deleted) files. Deleting a file is done asynchronously in MooseFS. First, a file is removed from metadata and its chunks are marked as unnecessary (goal=0). Later, the chunks are removed during an "idle" time. This is much more efficient than erasing everything at the exact moment the file was deleted.
这是一种标记属于不存在(即已删除)文件的块的方法。删除文件是在MooseFS中异步完成的。首先,从元数据中删除文件,并将其块标记为不必要(目标=0)。稍后,块在“空闲”时间被删除。这比删除文件时删除所有内容要有效得多。

Unnecessary chunks may also appear after a recovery of the master server, if they were created shortly before the failure and were not available in the restored metadata file.
如果在故障发生前不久创建了不必要的块,并且这些块在还原的元数据文件中不可用,则在主服务器恢复后也可能出现不必要的块。

30. Is every error message reported by mfsmount a serious problem?
30.mfsmount报告的每个错误消息都是严重问题吗?

No. mfsmount writes every failure encountered during communication with chunkservers to the syslog. Transient communication problems with the network might cause IO errors to be displayed, but this does not mean data loss or that mfsmount will return an error code to the application. Each operation is retried by the client (mfsmount) several times and only after the number of failures (reported as try counter) reaches a certain limit (typically 30), the error is returned to the application that data was not read/saved.
否。mfsmount将与chunkserver通信期间遇到的每个故障写入系统日志。与网络的短暂通信问题可能会导致显示IO错误,但这并不意味着数据丢失,或者mfsmount将向应用程序返回错误代码。客户端(mfsmount)多次重试每个操作,只有当失败数(报告为try counter)达到某个限制(通常为30)后,错误才会返回给未读取/保存数据的应用程序。

Of course, it is important to monitor these messages. When messages appear more often from one chunkserver than from the others, it may mean there are issues with this chunkserver - maybe hard drive is broken, maybe network card has some problems - check its charts, hard disk operation times, etc. in the CGI monitor.
当然,监控这些消息很重要。当消息从一个chunkserver出现的频率比从其他chunkserver出现的频率更高时,这可能意味着chunkserver有问题-可能硬盘损坏,可能网卡有问题-检查相关的图表,尤其是在CGI监视器中的硬盘操作时间之类的。

Note: XXXXXXXX in examples below means IP address of chunkserver. In mfsmount version < 2.0.42 chunkserver IP is written in hexadecimal format. In mfsmount version >= 2.0.42 IP is "human-readable".
注:以下示例中的XXXXXXXX表示chunkserver的IP地址。在mfsmount版本<2.0.42中,chunkserver ip以十六进制格式写入的。在mfsmount版本中>=2.0.42 IP是“可读友好的”。

What does

file: NNN, index: NNN, chunk: NNN, version: NNN - writeworker: connection with (XXXXXXXX:PPPP) was timed out (unfinished writes: Y; try counter: Z)
文件:nnn,索引:nnn,块:nnn,版本:nnn-WriteWorker:与(xxxxxxxx:pppp)的连接超时(未完成的写入:y;尝试计数:z)

message mean?
信息意味着什么?

This means that Zth try to write the chunk was not successful and writing of Y blocks, sent to the chunkserver, was not confirmed. After reconnecting these blocks would be sent again for saving. The limit of trials is set by default to 30.
这意味着第z次尝试写入块失败,而发送到chunkserver的y块的写入未得到确认。重新连接后,将再次发送这些块以进行保存。尝试限制次数默认设置为30。

This message is for informational purposes and doesn't mean data loss.
此消息仅供参考,并不意味着数据丢失。

What does

file: NNN, index: NNN, chunk: NNN, version: NNN, cs: XXXXXXXX:PPPP - readblock error (try counter: Z)

message mean?

This means that Zth try to read the chunk was not successful and system will try to read the block again. If value of Z equals 1 it is a transitory problem and you should not worry about it. The limit of trials is set by default to 30.
这意味着第Z次尝试读取块失败,系统将再次尝试读取块。如果z值等于1,这是一个暂时的问题,不应该担心它。尝试限制次数默认设置为30。

31. How do I verify that the MooseFS cluster is online? What happens with mfsmount when the master server goes down?
31.如何验证MooseFS集群是否联机?主服务器停机时mfsmount会发生什么?

When the master server goes down while mfsmount is already running, mfsmount doesn't disconnect the mounted resource, and files awaiting to be saved would stay quite long in the queue while trying to reconnect to the master server. After a specified number of tries they eventually return EIO - "input/output error". On the other hand it is not possible to start mfsmount when the master server is offline.
当mfsmount在运行时master服务关闭,mfsmount不会断开已装入资源的连接,并且将尝试重新连接到主服务器时,等待保存的文件将在队列中停留很长时间,当经过指定次数的尝试后,将最终返回EIO-“I/O错误”。另一方面,当master服务脱机时,将无法启动mfsmount。

There are several ways to make sure that the master server is online, we present a few of these below.
有几种方法可以确保主服务器联机在线,下面我们将介绍其中的一些方法。

Check if you can connect to the TCP port of the master server (e.g. socket connection test).
检查是否可以连接到主服务器的TCP端口(如socket connection test)。

In order to assure that a MooseFS resource is mounted it is enough to check the inode number - MooseFS root will always have inode equal to 1. For example if we have MooseFS installation in /mnt/mfs then stat /mnt/mfs command (in Linux) will show:
为了验证moosefs资源已载入,只需检查inode编号即可,moosefs根目录的inode始终等于1。例如,如果在/mnt/mfs中安装了moosefs,那么stat/mnt/mfs命令(在Linux中)将显示:

$ stat /mnt/mfs
File: `/mnt/mfs'
Size: xxxxxx Blocks: xxx IO Block: 4096 directory
Device: 13h/19d Inode: 1 Links: xx
(...)

Additionaly mfsmount creates a virtual hidden file .stats in the root mounted folder. For example, to get the statistics of mfsmount when MooseFS is mounted we can cat this .stats file, eg.:
另外,mfsmount在根装载文件夹中创建虚拟隐藏文件.stats。例如,要在moosefs装载时获取mfsmount的统计信息,我们可以对该.stats文件使用cat指令查看,例如:

$ cat /mnt/mfs/.stats
fuse_ops.statfs: 241
fuse_ops.access: 0
fuse_ops.lookup-cached: 707553
fuse_ops.lookup: 603335
fuse_ops.getattr-cached: 24927
fuse_ops.getattr: 687750
fuse_ops.setattr: 24018
fuse_ops.mknod: 0
fuse_ops.unlink: 23083
fuse_ops.mkdir: 4
fuse_ops.rmdir: 1
fuse_ops.symlink: 3
fuse_ops.readlink: 454
fuse_ops.rename: 269
(...)

If you want to be sure that master server properly responds you need to try to read the goal of any object, e.g. of the root folder:
如果要确保主服务器正确响应,则需要尝试读取任何对象的目标,例如在根文件夹执行设定:

$ mfsgetgoal /mnt/mfs
/mnt/mfs: 2

If you get a proper goal of the root folder, you can be sure that the master server is up and running.
如果能获得根文件夹的正确设定,那么就可以确保主服务器已启动并正在运行。

32. I found that MooseFS trash takes too much of my disk space. How to clean it up manually?
32.我发现MooseFS垃圾占用了我太多的磁盘空间。如何手动清理?

In order to purge MooseFS' trash, you need to mount special directory called "MooseFS Meta".
为了清除MooseFS的垃圾,需要挂载名为“moosefs-meta”的特殊目录。

Create mountdir for MooseFS Meta directory first:
首先为MooseFS元数据目录创建挂载点:
mkdir /mnt/mfsmeta

and mount mfsmeta:
开始挂载
mfsmount -o mfsmeta /mnt/mfsmeta

If your Master Server Host Name differs from default mfsmaster and/or port differs from default 9421, use appropriate switch, e.g.:
如果主服务器主机名不同于默认的mfsmaster或端口不为默认的9421,请使用适当的主机名和端口,例如:
mfsmount -H master.host.name -P PORT -o mfsmeta /mnt/mfsmeta

Then you can find your deleted files in /mnt/mfsmeta/trash/SUBTRASH directory. Subtrash is a directory inside /mnt/mfsmeta named 000..FFF. Subtrashes are helpful if you have many (e.g. millions) of files in trash, because you can easily operate on them using Unix tools like find, whereas if you had all the files in one directory, such tools may fail.
可以在/mnt/mfsmeta/trash/subtrash目录中找到已删除的文件。子目录是/mnt/mfsmeta中名为000..fff的目录。如果垃圾箱中有许多(例如数百万)文件,则子目录可能会帮上忙,因为可以使用诸如find之类的Unix工具轻松地对它们进行操作,而如果一个目录中包含所有的文件,则此类工具可能会失败。

If you do not have many files in trash, mount Meta with mfsflattrash parameter:
如果垃圾箱中没有很多文件,请使用mfslattrash参数来挂载meta:
mfsmount -o mfsmeta,mfsflattrash /mnt/mfsmeta

or if you use Master Host Name or Port other than default:
或者使用的主机名或端口(非默认值情况):
mfsmount -H master.host.name -P PORT -o mfsmeta,mfsflattrash /mnt/mfsmeta

In this case your deleted files will be available directly in /mnt/mfsmeta/trash (without subtrash).
在这种情况下,待删除的文件将直接在/mnt/mfsmeta/trash(无子目录)中提供。

In both cases you can remove files by simply using rm file or undelete them by moving them to undel directory available in trash or subtrash (mv file undel).
在这两种情况下,都可以通过使用rm文件来删除文件,或者通过将文件移动到垃圾箱或子目录(mv file undel)中的undel目录来恢复删除文件。

Remember, that if you do not want to have certain files moved to trash at all, set "trash time" (in seconds) for these files to 0 prior to deletion. If you set specific trash time for a directory, all the files created in this directory inherit trash time from parent, e.g.:
记住,如果根本不想将某些文件移到垃圾桶中,请在删除之前将这些文件的“垃圾桶时间”(秒)设置为0。如果为目录设置了特定的垃圾回收时间,则此目录中创建的所有文件都会从父目录继承垃圾回收时间,例如:
mfssettrashtime 0 /mnt/mfs/directory

You can also set a trash time to other value, e.g. 1 hour:
还可以将垃圾时间设置为其它值,例如1小时:
mfssettrashtime 3600 /mnt/mfs/directory

For more information on specific parameters passed to mfsmount or mfssettrashtime, see man mfsmount and man mfstrashtime.
有关传递给mfsmount或mfsettrashtime的特定参数的详细信息,请参见man mfsmount和man mfstrashtime。