perl grep函数使用参考-FreeOA

perl grep函数使用参考

2013-04-02 15:00:09

(一)、grep函数使用方法

grep有2种表达方式：
grep BLOCK LIST
grep EXPR, LIST

BLOCK表示一个code块，通常用{}表示；EXPR表示一个表达式，通常是正则表达式。原文说EXPR可是任何东西，包括一个或多个变量、操作符、文字、函数或子函数调用。LIST是要匹配的列表。

This is similar in spirit to, but not the same as, grep(1) and its relatives. In particular, it is not limited to using regular expressions.

Evaluates the BLOCK or EXPR for each element of LIST (locally setting $_ to each element) and returns the list value consisting of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true.

grep对列表里的每个元素进行BLOCK或EXPR匹配，它遍历列表，并临时设置元素为$_。在列表上下文里，grep返回匹配命中的所有元素，结果也是个列表。在标量上下文里，grep返回匹配命中的元素个数。

grep函数有两个参数：一个代码块和一个列表。

对于列表中的每个元素，它的值会被赋到$_(Perl的标量默认值)，然后执行代码块。如果代码块的返回值是false，相应值被丢弃。如果代码块返回值是true，相应值会作为返回值之一。

注意：代码块和第二个参数间没有逗号！

The syntax of the grep function is as follows:
grep BLOCK LIST
grep (EXPR, LIST)

where:

BLOCK – contains one or more statements delimitated by braces; the last statement in the block determines whether the block will be evaluated true or false. The block will be evaluated for each element of the list and if the result is true, that element will be added to the returned list. If you need to apply a more sophisticated filter that consists of multiple code lines, you may consider using the grep function with a block.

EXPR – it is any expression that supports $_, in particular a regular expression. The expression is applied against each element of the list and if the result of evaluation is true, the current element will be appended to the returned list.

LIST – it is a list or an array

Note that in a scalar context the Perl grep function will return the number of times the BLOCK or the EXPR is evaluated true.

How does it work? Well, the grep function iterates through the elements of the list and at each iteration step:
sets $_ to the current element of the list
evaluates the BLOCK or EXPR against $_

if the result of evaluation is true, in list context adds the value of $_ to the output list and in scalar context increments the count of matched elements

Because the elements of the list are stored in the special scalar variable $_ (that is an alias to the current element) you can modify them. However, try to avoid this feature if you want to create clear and robust code - to modify the elements of an array by running a particular expression or function against them, you should use the map function instead (keep in mind: grep to filter, map to modify).

The Perl grep function is very convenient to use any time you need to loop through a list in order to extract a subset of elements from it, elements that match a certain condition.

简单的匹配操作可用作判断
When $match is true, grep($match, @array) in scalar context will always return the number of elements in @array. Try this:
if(grep /$match/, @array){
print "found it\n";
}

当然可以使用'smart matching operator'(这需要Perl在5.10版本以后)：
print "found it\n" if($match ~~ @array);

(二)、grep vs loops

open FILE "<myfile" or die "Can't open myfile:$!";
print grep /terrorism|nuclear/i,<FILE>;
这里打开一个文件myfile，然后查找包含terrorism或nuclear的行。<FILE>;返回一个列表，它包含了文件的完整内容。可能你已发现，如果文件很大的话，这种方式很耗费内存，因为文件的所有内容都拷贝到内存里了。

代替的方式是使用loop(循环)来完成：
while ($line=<FILE>){
if ($line=~/terrorism|nuclear/i) {print $line}}

上述code显示，loop可以完成grep能做的任何事情。那为什么还要用grep呢？答案是grep更具perl风格，而loop是C风格的。

更好的解释是：
(1)、grep让读者更显然的知道，你在从列表里选择某元素；
(2)、grep比loop简洁。

建议：如果你是perl新手，那就规矩的使用loop比较好；等你熟悉perl了，就可使用grep这个有力的工具。

跟Unix的grep相比

简单说明一下：内建的grep函数是Unix grep命令的一般化实现。

Unix grep基于正则表达式过滤一个文件的每行内容。

Perl grep可以基于任何条件过滤任何列表。

下面Perl代码是Unix grep一个简单的实现版本：
my $regex = shift;
print grep { $_ =~ /$regex/ } <>;

第一行从命令行读入一个正则表达式，命令行其它参数应该是文件名。钻石操作符<>从所有文件(命令行参数)中提取每一行，grep根据正则式进行过滤，通过过滤的行会打印出来。

(三)、几个grep的示例

1. 统计匹配表达式的列表元素个数
$num_apple = grep /^apple$/i, @fruits;

在标量上下文里，grep返回匹配中的元素个数；在列表上下文里，grep返回匹配中的元素的一个列表。因此上述code返回apple单词在@fruits数组中存在的个数。因为$num_apple是个标量，它强迫grep结果位于标量上下文里。

2. 从列表里抽取唯一元素
@unique = grep {++$count{$_}<2} qw(a b a c d d e f g f h h);
print "@unique\n";

上述code运行后会返回：a b c d e f g h

即qw(a b a c d d e f g f h h)这个列表里的唯一元素被返回了。为什么会这样呀？让我们看看：

%count是个hash结构，它的key是遍历qw()列表时，逐个抽取的列表元素，++$count{$_}表示$_对应的hash值自增。在这个比较上下文里，++$count{$_}与$count{$_}++的意义是不一样的，前者表示在比较之前，就将自身值自增1；后者表示在比较之后，才将自身值自增1。所以，++$count{$_}<2表示将$count{$_}加1，然后与2进行比较。$count{$_}值默认是undef或0。所以当某个元素a第一次被当作hash的关键字时，它自增后对应的hash值就是1，当它第二次当作hash关键字时，对应的hash值就变成2了。变成2后，就不满足比较条件了，所以a不会第2次出现。

所以上述code就能从列表里唯一1次的抽取元素了。

2.1 抽取列表里精确出现2次的元素
@crops = qw(wheat corn barley rice corn soybean hay alfalfa rice hay beets corn hay);
@duplicates = grep { $count{$_} == 2 } grep { ++$count{$_} > 1 } @crops;
print "@duplicates\n";

运行结果是：rice

注意这里grep了2次，顺序是从右至左。首先'grep {++$count{$_}>1 }@crops'返回一个列表，列表的结果是@crops里出现次数大于1的元素。然后再对产生的临时列表进行grep{$count{$_}==2}计算，这里的意思你也该明白了，就是临时列表里，元素出现次数等于2的被返回。所以上述code就返回rice了，rice出现次数大于1，并且精确等于2。

3. 在当前目录里列出文本文件
@files = grep { -f and -T } glob '* .*';
print "@files\n";

这个就很容易理解，glob返回一个列表，它的内容是当前目录里的任何文件，除了以'.'开头的。{}是个code块，它包含了匹配它后面的列表的条件。这只是grep的另一种用法，其实与 grep EXPR,LIST 这种用法差不多了。-f and -T匹配列表里的元素，首先它必须是个普通文件，接着它必须是个文本文件。据说这样写效率高点哦，因为-T开销更大，所以在判断-T前，先判断-f了。

4. 数组去重复
@array=qw(To be or not to be that is the question);
@found_words=grep {$_=~ /b|o/i and ++$counts{$_} < 2} @array;
print "@found_words\n";

运行结果是：To be or not to question

{}里的意思就是，对@array里的每个元素，先匹配它是否包含b或o字符(不分大小写)，然后每个元素出现的次数，必须小于2(也就是1次啦)。

grep返回一个列表，包含了@array里满足上述2个条件的元素。

5. 从二维数组里选择元素，并且x<y
# An array of references to anonymous arrays
@data_points = ( [ 5, 12 ], [ 20, -3 ],[ 2, 2 ], [ 13, 20 ],[5,8] );
@y_gt_x = grep { $_->[0]<$_->[1] } @data_points;
foreach $xy (@y_gt_x) { print "$xy->[0],$xy->[1]\n" }

运行结果是：
5,12
13,20
5,8

这里，你应该理解匿名数组哦，[]是个匿名数组，它实际上是个数组的引用(类似于C里面的指针)，而@data_points的元素就是匿名数组，例如：
foreach(@data_points){print $_->[0]}

这样访问到匿名数组里的第1个元素，把0替换成1就是第2个元素了。所以{$_->[0]<$_->[1]}就很明白了哦，它表示每个匿名数组的第一个元素的值，小于第二个元素的值。而'grep {$_->[0]<$_->[1] } @data_points'就会返回满足上述条件的匿名数组列表。

grep的{}复杂程度如何，取决于program可用虚拟内存的数量。

6. 过滤掉小数字

my @numbers = qw(8 2 5 3 1 7);
my @big_numbers = grep { $_ > 4 } @numbers;
print "@big_numbers\n"; # (8, 5, 7)

grep返回大于4的值，过滤掉不大于4的值。

7. 找出超过一年的老文件

my @files = glob "*.log";
my @old_files = grep { -M $_ > 365 } @files;
print join "\n", @old_files;

glob "*.log"会返回当前文件所有.log为扩展名的文件。

-M $path_to_file 返回文件最后一次修改至今的天数。

这个例子过滤掉365天内修改的文件，并得到至少存在了一年以上的文件。

8.查找目录下指定的文件

foreach my $infile (grep { !/^\./ && -f "$indir/$_" } readdir(DIR)){
...
}

readdir(DIR)读入指点文件夹句柄DIR下的所有文件（包括以.开头的隐藏文件）的文件名，构成一个文件名列表(list)，然后每一次读入一个文件名保存到临时变量$_中，传递给grep处理，先用!/^\./判断文件名是否以'.'开头，保证该文件不是隐藏文件，接着通过-f "$indir/$_"判断该文件是否存在。

9.综合示例

@foundlist = grep(pattern, @searchlist);
查找匹配函数。与同名的UNIX查找工具类似，其在列表中抽取与指定模式匹配的元素，参数pattern为欲查找的模式，返回值是匹配元素的列表。

@list = ("This", "is", "a", "test");
@foundlist = grep(/^[tT]/, @list); #查找以T开头的单词(不区分大小写)
结果：@foundlist = ("This", "test");

# 删除由4个字母构成的单词
print join(" ", (grep{!/^\w{4}$/} (qw(Here are freeOA some four letter words.))));

官方文档参考