深入理解perl的grep
2013-04-02 20:59:29 阿炯

Perl grep函数从LIST中提取符合EXPR的值。grep函数在标量环境中,返回为真的次数;在列表环境中,返回EXPR为真的元素的列表,它是Perl的内建函数。本文就其用法做深入的分析和对比,可以在一定程度上加深对'perl'与'grep'的使用认知,尤其是'TIMTOWTDI'规则。

关于内置'grep'指令,可参考:perl grep函数使用参考
---------------
grep函数的用法入门
@matches = grep( /^MyItem$/, @someArray );
@matches = grep( $_ == $val, @a );

例子:
#!/usr/bin/perl
@list = (1,"Test", 0, "foo", 20 );
@has_digit = grep ( /\d/, @list );
print "@has_digit\n";

结果如下:
1 0 20

再来,取得非真的元素:
#!/usr/bin/perl
@list = (1,"Test", 0, "foo", 20 );
@has_digit = grep ( !/\d/, @list );  #!非真
print "@has_digit\n";

结果如下:
Test foo

在标量环境下返回的结果:
#!/usr/bin/perl
@list = (1,"Test", 0, "foo", 20 );
$has_digit = grep ( /\d/, @list );
print "$has_digit\n";

结果如下:
3 #匹配次数

将'array'转换为'hash':
my %params = map {$_=>1} @badparams;
判断所要查找的值是否在其中:
if(exists($params{$someparam})) { ... }

我们来对比一下不同的处理方式所带来的不同性能差异。

use Benchmark;
my @list;
for (1..10_000) {
 push @list, $_;
}

timethese(10000, {
 'grep'=> sub {
 if ( grep(/^5000$/o, @list) ) {
  # code
 }},
 'hash'=> sub {
 my %params = map { $_ => 1 } @list;
 if ( exists($params{5000}) ) {
  # code
 }},
});

结果如下:
Benchmark: timing 10000 iterations of grep, hash...
 grep:  8 wallclock secs ( 7.82 usr +  0.01 sys =  7.83 CPU) @ 1277.14/s (n=10000)
 hash: 60 wallclock secs (60.07 usr +  0.04 sys = 60.11 CPU) @ 166.36/s (n=10000)

修正版本:使用了模块'List::Util::first'与'List::MoreUtils::any':
use List::Util qw(first);
use List::MoreUtils qw(any);
use Benchmark;

my @list = ( 1..10_000 );
my $hit = 5_000;
my $hit_regex = qr/^$hit$/; # precompute regex
my %params;
$params{$_} = 1 for @list;  # precompute hash
timethese(
 100_000, {
 'any' => sub {
  die unless ( any { $hit_regex } @list );
 },
 'first' => sub {
  die unless ( first { $hit_regex } @list );
 },
 'grep' => sub {
  die unless ( grep { $hit_regex } @list );
 },
 'hash' => sub {
  die unless ( $params{$hit} );
 },
});

结果如下:
Benchmark: timing 100000 iterations of any, first, grep, hash...
 any:  0 wallclock secs ( 0.67 usr +  0.00 sys =  0.67 CPU) @ 149253.73/s (n=100000)
 first:  1 wallclock secs ( 0.63 usr +  0.01 sys =  0.64 CPU) @ 156250.00/s (n=100000)
 grep: 42 wallclock secs (41.95 usr +  0.08 sys = 42.03 CPU) @ 2379.25/s (n=100000)
 hash:  0 wallclock secs ( 0.01 usr +  0.00 sys =  0.01 CPU) @ 10000000.00/s (n=100000)
(warning: too few iterations for a reliable count)

可见使用'hash'式式效率很高,上面的示例中的'hash'可能在'map'导致了性能下降。

---------------
检查Perl数组是否包含特定值

grep函数可使用正则表达式检查给定的值是否存在于给定的数组中,它采用两个参数:value和array。
value: Provides value to search for.
array: Provides array to search in.

返回值
它返回一个布尔值。

如果数组已排序,请使用“二分查找(Binary Search)”法。

如果同一数组被多次重复搜索,请先将其转换为哈希,然后检查哈希。如果内存吃紧,那么将数组中的每个项移动到散列中。内存效率更高,但会破坏原始数组。若在数组中重复搜索相同的值,则延迟构建缓存(在搜索每个项目时,首先检查搜索结果是否存储在持久化哈希中。如果在哈希中找不到搜索结果,则搜索数组并将结果放入持久化哈希,以便下次在哈希中找到它并跳过搜索)。

注意:这些优化只会在处理长数组时更快,不要过度优化。

数组转哈希:
my %params = map { $_ => 1 } @badparams;

if(exists($params{$someparam})) { ... }

还可以向列表中添加更多(唯一|unique)参数:
$params{$newparam} = 1;

稍后返回(唯一)参数列表:
@badparams = keys %params;

smartmatch就不介绍了。List::Util是非常好的选择。

grep { $_ eq $value } @array

If you insist on using regular expressions, the safe way would be
grep /^\Q$value\E$/, @array


如果只用grep来进行元素的查找,直接比较比用正则在性能上要好一些。
use v5.32;
use Benchmark;

my (@list,%params);
for (1..10_000) {
    push @list, $_;
    $params{%_}=1;
}

timethese(10_000, {
    'grep1' => sub {
        if (grep(/^5000$/o, @list)){
            # code
        }
    },
    'grep2' => sub {
        if (grep({$_ == 5000} @list)){
            # code
        }
    },
});


Benchmark: timing 10000 iterations of grep1, grep2...
grep1:  3 wallclock secs ( 3.64 usr +  0.00 sys =  3.64 CPU) @ 2747.25/s (n=10000)
grep2:  2 wallclock secs ( 2.17 usr +  0.00 sys =  2.17 CPU) @ 4608.29/s (n=10000)


新版本已经将List::Util做为内置模块了,且其中的3个用于范围运算的子函数都具备了:

use v5.32;
use List::Util qw(none any all);

my @numbers = qw(7 4 1 3 78);
if(none {$_ > 100} @numbers) {print "No elements over 100\n"};
if(any {$_ > 50} @numbers) {print "Some elements over 50\n"};
if(all {$_ <80} @numbers) {print "All elements less than 80\n"};

---------------
使用'List::Util'模块中的'first'函数来代替

use List::Util qw/first/;
my @a = qw(foo bar baz);
if ( first { $_ eq 'bar' } @a ) { say "Found bar!" }

NB. first returns the first element it finds and so doesn't have to iterate through the complete list (which is what grep will do).
List::Util::first
$foo = first { ($_ && $_ eq "value" } @list; # first defined value in @list

下面提供一个类'first'实现的函数,它运行的效率更高。
It works by stopping once it finds the element. It's written in C for speed, and its Perl equivalent looks like this subroutine:
sub first (&@) {
 my $code = shift;
 foreach (@_) {
  return $_ if &{$code}();
 }
undef;
}

This is answered in perlfaq4's answer to "How can I tell whether a certain element is contained in a list or array?".

---------------
使用模块'List::MoreUtils'中的'any'函数

use List::MoreUtils qw/any/;
my @array = qw(foo bar baz);
print "Exist\n" if any {($_ eq "foo")} @array;

-------------------
仿'List::'中的方法
如果不想引入'List::'系列模块,也可以自己来实现它们的功能:

sub first (&@) {
  my $code = shift;
  $code->() and return $_ foreach @_;
  undef
}

sub any (&@) {
  my $code = shift;
  $code->() and return 1 foreach @_;
  undef
}

--------------------------
新版本中可以使用的方法

在perl 5.10中,可以使用智能匹配(~~)。
print "Exist\n" if $var ~~ @array;

use 5.010;
if( $item ~~ @array ){
 say "The array contains $item"
}
if( $item ~~ %hash ){
 say "The hash contains $item"
}
The smart match ~~ operator.

if( $element ~~ @list ){ ... }
if( $element ~~ [ 1, 2, 3 ] ){ ... }

You could also use the given/when construct. Which uses the smart match functionality internally.
given( $element ){
 when( @list ){ ... }
}

You can also use a for loop as a "topicalizer" ( meaning it sets $_ ).
for( @elements ){
 when( @list ){ ... }
}

One thing that will come out in Perl 5.12 is the ability to use the post-fix version of when. Which makes it even more like if and unless.
given( $element ){
 when @list;
}

If you have to be able to run on older versions of Perl, there still are several options.

You might think you can get away with using List::Util::first, but there are some edge conditions that make it problematic.

You can safely use grep however.

if( grep { $element eq $_ } 0..9 ){ ... }

This is safe because grep gets called in a scalar context. Arrays return the number of elements when called in scalar context. So this will continue to work even if we try to match against undef.

If you're only matching against strings, you could also use a hash. This can speed up your program if @list is large and, you are going to match against %hash several times. Especially if @array doesn't change, because then you only have to load up %hash once.
my %hash = map { $_, 1 } @array;
if( $hash{ $element } ){ ... }

You could also make your own subroutine. This is one of the cases where it is useful to use prototypes.
sub in(&@){
local $_;
my $code = shift;
for( @_ ){ # sets $_
 if( $code->() ){
  return 1;
 }
}
return 0;
}
if( in { $element eq $_ } @list ){ ... }
if (grep { $_ eq $element } @list) {....}

sub is (&@) {
  my $test = shift;
  $test->() and return 1 for @_;
}

sub in (@) {@_}
if( is {$_ eq "a"} in qw(d c b a) ) {
  print "Welcome in perl!\n";
}

---------------
更为常规的方法
当然旧的版本里最好还是将数组转为哈希,通过判断其键来确定其是否在其中。

If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead:
@articles = ( 1..10, 150..2000, 2017 );
undef $read;
for (@articles) { vec($read,$_,1) = 1 }

Now check whether vec($read,$n,1) is true for some $n.
These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.

如果处理速度不重要,你也可以得到列表里有哪些元素包含你想找到的字串。
If speed is of little concern, the common idiom uses grep in scalar context (which returns the number of items that passed its condition) to traverse the entire list. This does have the benefit of telling you how many matches it found.

将得到有多少个匹配量
my $is_there = grep $_ eq $whatever, @array;

将得到一个匹配的数组
my @matches = grep $_ eq $whatever, @array;