perl split函数使用参考
2013-11-10 21:24:57 阿炯

语法
split /PATTERN/, EXPR, LIMIT
split /PATTERN/, EXPR
split /PATTERN/
split

功能
将字符串 EXPR 拆分成字符串列表,列表中的每个元素被称为拆分得到的fields,在列表上下文返回拆分得到的列表,在标量上下文返回拆分得到列表的长度。

可按字符分割
在没有任何限制的字符串之间分割
在有限制的字符串之间分割
在未定义的值上分割
在模式或正则表达式上分割
拆分成哈希
空间分割

注意:在5.11以前,如果在列表上下文或是没有接收返回值变量的情况下它会用拆分得到的列表覆盖@_变量,所以如果你的Perl版本低于5.11时就需要注意了。

split还提供了一些可选参数,使得你可以更精确地控制分割行为。比如:
指定分割的最大次数:@fields = split /pattern/, $string [, $limit]
输出的字段中去除模式匹配的值:@fields = split /pattern/, $string [, $limit] [, EXPR]
按照多个模式来划分数据:@fields = split [ /pattern/ ] [, $string [, $limit ]](其中 /pattern/ 可以是一个模式列表,比如 /[,\t]/ 表示逗号或制表符)

pattern 是一个正则表达式,它提供了分割字符串的标准。表达式是要拆分的字符串。Limit 是一种限制,它会在字符串中找到的第 (n-1) 个模式处停止拆分。

返回值:该方法返回两个上下文中的值,如下:
1.在数组上下文中:这里它返回在表达式中找到的字段列表。如果没有指定表达式,则返回 $_。

2.在标量上下文中:这里它返回在表达式中找到的字段数,然后将字段存储在@_数组中。

关键特性如下:

1.如果调用split时只传了一个参数PATTERN,则EXPR默认为$_。

2.与PATTERN相匹配的EXPR中的任何内容都将会被视为分隔符,它他不会出现在拆分结果中。(注意:分隔符可能不止一个字符,也有可能没有任何字符,当PATTERN为空字符串时可以进行零宽度匹配)

3.模式不需要是常量,可以使用表达式来指定在运行期间产生的变化的模式。

4.如果PATTERN匹配空字符串,EXPR将在匹配位置(字符之间)分割。

5.如果LIMIT为负数,则被视为无穷大,split会尽可能分隔出更多的fields,并返回所有的fields。

6.如果省略了LIMIT或LIMIT为0,split会尽可能分隔得到更多的fields,然后把所有尾部为空的fields删除掉,把得到的结果返回。不管尾部有多少空fields都会被删除掉。

7.split会尽量避免拆分出过多的fields,当把split的返回值直接赋给一个列表且LIMIT被省略或为0时,则LIMIT被视为比接收返回值的列表长度大一的数字。

8.如果EXPR为空字符串,无论PATTERN是什么,无论LIMIT的值是多少,都将得到一个空列表。

9.如果EXPR的开始处有与PATTERN(PATTERN是非空模式)匹配的字符串,则会产生前导的空fields。

10.如果PATTERN中包含“捕获组”,则对于每一次分隔都将为所有“捕获组”分别产生一个附加的fields(按“捕获组”的顺序);如果有任何没有被匹配的捕获组,则它捕获undef值而不是空字符串。需要注意的是,每次分隔都会产生一批(有多少个“捕获组”就产生多少个)附加fields,并且这些附加fields是不计数的。

Perl V5.38 官方手册页中的部分内容摘录:

If omitted, PATTERN defaults to a single space, " ", triggering the previously described awk emulation.
如果省略,PATTERN默认为单个空格" ",从而触发前面描述的awk仿真。

If LIMIT is specified and positive, it represents the maximum number of fields into which the EXPR may be split; in other words, LIMIT is one greater than the maximum number of times EXPR may be split. Thus, the LIMIT value 1 means that EXPR may be split a maximum of zero times, producing a maximum of one field (namely, the entire value of EXPR).
如果指定了LIMIT并且为正,则它表示EXPR可以拆分的最大字段数;即LIMIT比EXPR可以被分割的最大次数大一。因此,LIMIT值1意味着EXPR可以被拆分最多零次,产生最多一个字段(即EXPR的整个值本身)。

If LIMIT is negative, it is treated as if it were instead arbitrarily large; as many fields as possible are produced.
如果LIMIT为负,则将其视为任意大;产生尽可能多的字段列。

If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved); if all fields are empty, then all fields are considered to be trailing (and are thus stripped in this case).
如果LIMIT被省略(或者等价为零),那么它通常被视为负,但后面的空字段被剥离(总是保留空的前导字段)移除;如果(其后)所有字段都是空的,那么所有字段都被认为是尾端的(因此在这种情况下被剥离移除)。读起来费力吧:应该是当LIMIT为零或未指定时,保留前端为空的内容而移除后端为空的内容。

因此,以下内容:
my @x = split(/,/, "a,b,c,,,"); # ("a", "b", "c")
只生成一个三元素列表。

my @y = split(/,/, ",,a,b,c,,,"); # ("","","a", "b", "c")
生成前置两个空元素的五元素列表。

my @x = split(/,/, "a,b,c,,,", -1); # ("a", "b", "c", "", "", "")
生成一个六元素列表,即尽可能多的字段列。

my @x = split(/,/, ",,a,b,c,,,", -1); # ("","","a", "b", "c", "", "", "")

In time-critical applications, it is worthwhile to avoid splitting into more fields than necessary. Thus, when assigning to a list, if LIMIT is omitted (or zero), then LIMIT is treated as though it were one larger than the number of variables in the list; for the following, LIMIT is implicitly 3:
在时间关键型应用程序中,避免拆分到超出需要的字段是值得的。因此当分配给列表时,如果LIMIT被省略(或为零),则LIMIT被视为比列表中的变量数量大一个;对于以下情况,LIMIT隐含地为3(因为只需要两个元素列表)(针对/etc/passwd文件内容格式):
my ($login, $passwd) = split(/:/);

Note that splitting an EXPR that evaluates to the empty string always produces zero fields, regardless of the LIMIT specified.
请注意,无论指定了LIMIT,拆分处理为空字符串的EXPR总是会产生零字段。

很明显,下面的示例的LIMIT为4,
my $epf='freeoa:x:1:9:freeoa:/freeoa:/bin/bash';
my ($u1,$p1,$d1)=split(/:/,$epf);
say "U1:$u1,P1:$p1,D1:$d1";
say '-' x 9;
my ($u2,$p2,$d2)=split(/:/,$epf,4);
say "U2:$u2,P2:$p2,D2:$d2";


单纯对LIMIT的理解为:
-1:尽可能按要求多拆分元素列
0:与不声明相同,全量拆分,按上下文取结果元素集;默认行为,重申:空元素留前不留后
1:返回就是EXPR本身,不进行拆分
>1:按PATTERN从前往后拆,直至拆分至(n-1)个模式处;LIMIT不超过完全对EXPR拆分后数量+1(空格情况有例外)


与正则的配合处理

要在结果中保留分隔符,那么只需将该分隔符放在括号内。

split(/-|,/,"1-10,20",3);
## ('1','10','20')

split(/(-|,)/,"1-10,20",3);
## ('1','-','10',',','20')

split(/(-)|,/,"1-10,20",3);
## ('1','-','10',undef,'20')

split(/-|(,)/  , "1-10,20", 3);
# ("1", undef, "10", ",", "20")

split(/(-)|(,)/,"1-10,20",3);
## ('1','-',undef,'10',undef,',','20')


官方不推荐使用 split() 来解析 CSV(逗号分隔值)文件。如果数据中有逗号,请改用 Text::CSV。

split函数是非常灵活且有用 - 其将字符串进行分割并把分割后的结果放入数组中,其还可以与正则表达式(RE)配合使用,如果未特定则工作在$_变量上。

split函数可以这样使用:
$info = "Caine:Michael:Actor:14, Leafy Drive";
@personal = split(/:/, $info);

其结果是:
@personal = ("Caine", "Michael", "Actor", "14, Leafy Drive");

如果我们已经把信息存放在$_变量中,那么可以这样:
@personal = split(/:/);

如果各个域被任何数量的冒号分隔,可以用RE代码进行分割:
$_ = "Capes:Geoff::Shot putter:::Big Avenue";
@personal = split(/:+/);

其结果是:
@personal = ("Capes", "Geoff", "Shot putter", "Big Avenue");

但是下面的代码:
$_ = "Capes:Geoff::Shot putter:::Big Avenue";
@personal = split(/:/);

其结果是:
@personal = ("Capes", "Geoff", "", "Shot putter", "", "", "Big Avenue");

单词可以被分割成字符,句子可以被分割成单词,段落可以被分割成句子:
@chars = split(//, $word);
@words = split(/ /, $sentence);
@sentences = split(/\./, $paragraph);

在第一句中,空字符串在每个字符间匹配,所以@chars数组是一个字符的数组。
// 之间的部分表示split用到的正则表达式(或者说分隔法则)
\s 是一种通配符,代表空格
+ 代表重复一次或者一次以上
所以,\s+ 代表一个或者一个以上的空格

split (/\s+/, $line) 表示把字符串$line,按空格为界分开。
比如说, $line = "你好 朋友 欢迎光临我的主页 freeoa.net";
split (/\s+/, $line)后得到:
你好 朋友 欢迎访问我的主页  freeoa.net

分析/etc/passwd文件
my ($username, $password, $uid, $gid, $real_name, $home, $shell) = split /:/, $passwd;

my @fields = split /:/, $passwd;

提供部分字段:
my ($username, $real_name) = @fields[0, 4];

my ($username, $real_name) = (split /:/, $passwd)[0, 4];

多字符分隔
my @words = split /[=&]/, $str;

连接分割字符串

tried the string convert to hash in the following format.
$string="1:one;2:two;3:three";
=>
%hash=(1=>"one", 2=>"two", 4=>"three");

正确的用法:
my %hash = split /[;:]/, $string;

%hash = map{split /\:/, $_}(split /;/, $string);

从'ss -s'中取得各种连接状态的计数

# ss -s
TCP: 74 (estab 14, closed 30, orphaned 4, synrecv 0, timewait 30/0), ports 0
$ncc='estab 14, closed 30, orphaned 4, synrecv 0, timewait 30/0';
%nctype=split(/[,\s]+/,$ncc);

------------------------------------------------------------
split函数的一个特殊点
上周在Linux上写了一个脚本来kill一个进程,其中取得进程对应的PID的一段代码如下:

# get current PID of DB RACGIMON
my $process = qx/$PSEF | $GREP "racgimon startd" | $GREP -v grep/;
$process = (split(' ', $process))[1];

其中的split函数调用本来是写成:
$process = (split(/ /, $process))[1];

可是运行的时候确发现第2种写法并不能正确的取得PID。当时由于急着要用就没有仔细琢磨了,从了第1种写法,直到今天刚刚仔细看了《Programming Perl》中对split函数的讲解才弄明白了其中的区别。

其实,split函数的正规语法应该是:
split /PATTERN/, EXPR

而第1种写法中使用单引号(或者双引号)来分隔空格(whitespace)实际上是一种特殊的例子:
As a special case, specifying a space " " will split on whitespace just as split with no arguments does. Thus, split(" ") can be used to emulate awk's default behavior, whereas split(/ /) will give you as many null initial fields as there are leading spaces.

看了这段解释就明白了,原因就在于 split(/ /, EXPR)会在碰到一个空格时就产生一个空(NULL)字段并将其加入到返回值列表中。

Hash赋值语句:
($v,$k) =split(/\t/,$line);
%z=map {split /\t/} @lines;
use strict;

my $myfile = "test.txt";
my %hash=();
open(FILE,$myfile) Ζ die;
{
local $/ = undef;
%hash = split /\Ζ\n/,<FILE>; # munch munch
close FILE;
}

%z=map {reverse split /\t/} @lines;

-----------------------------
while (<IN>) {
 chomp;
 my %rec;
 @rec{@fields} = split /:/;
 $users{$rec{name}} = \%rec;
}

-----------------------------
while (<IN>) {
  chomp;
  my ($name) = /([^:]+)/;
  @{ $users{$name} }{ @fields } = split /:/;
}

-----------------------------
chomp(@{$users{(/([^:]+)/)[0]}}{@fields} = split /:/) while <IN>;

-----------------------------
use Unix::PasswdFile;

my $pw = Unix::PasswdFile->new('/etc/passwd') or die 'Could not open /etc/passwd';

my %users = map { $_ => [$ps->user($_)] } $pw->users;

-----------------------------
str to hash by undef value

my $str='ID,IMPL_LIST,LEGAL_TIME_LIMIT,SERIES_NUMBER,BUSINESS_ID';
my (%h1,%h2);

@h1{split /,/,$str}=();
%h2=map{$_=>undef} split(/,/,$str);

say Dumper(\%h1,\%h2);

-----------------------------


Using the Perl split() function

The split() function is used to split a string into smaller sections. You can split a string on a single character, a group of characers or a regular expression (a pattern). You can also specify how many pieces to split the string into. This is better explained in the examples below.

Example 1. Splitting on a character

A common use of split() is when parsing data from a file or from another program. In this example, we will split the string on the comma ','. Note that you typically should not use split() to parse CSV (comma separated value) files in case there are commas in your data: use Text::CSV instead.

my $data = 'Becky Alcorn,25,female,Melbourne';
my @values = split(',', $data);
foreach my $val (@values){
 print "$val\n";
}

This program produces the following output:
Becky Alcorn
25
female
Melbourne

Example 2. Splitting on a string

In the same way you use a character to split, you can use a string. In this example, the data is separated by three tildas '~~~'.
my $data = 'Bob the Builder~~~10:30am~~~1,6~~~ABC';
my @values = split('~~~', $data);

foreach my $val (@values){
 print "$val\n";
}

This outputs:
Bob the Builder
10:30am
1,6
ABC

Example 3. Splitting on a pattern

In some cases, you may want to split the string on a pattern (regular expression) or a type of character. We'll assume here that you know a little about regular expressions. In this example we will split on any integer:

my $data = 'Home1Work2Cafe3Work4Home';

# \d+ matches one or more integer numbers
my @values = split(/\d+/, $data);

foreach my $val (@values){
 print "$val\n";
}

The output of this program is:
Home
Work
Cafe
Work
Home

Example 4. Splitting on an undefined value

If you split on an undefined value, the string will be split on every character:
my $data = 'Becky Alcorn';
my @values = split(undef,$data);

foreach my $val (@values){
 print "$val\n";
}

The results of this program are:
B
e
c
k
y

A
l
c
o
r
n

Example 5. Splitting on a space

If you use a space ' ' to split on, it will actually split on any kind of space including newlines and tabs (regular expression /\s+/) rather than just a space. In this example we print 'aa' either side of the values so we can see where the split took place:
my $data = "Becky\n\nAlcorn";
my @values = split(' ',$data);
 
# Print 'aa' either side of the value, so we can see where it split
foreach my $val (@values) {
 print "aa${val}aa\n";
}

This produces:
aaBeckyaa
aaAlcornaa

As you can see, it has split on the newlines that were in our data. If you really want to split on a space, use regular expressions:
my @values = split(/ /,$data);

Example 6. Delimiter at the start of the string

If the delimiter is at the start of the string then the first element in the array of results will be empty. We'll print fixed text with each line so that you can see the blank one:
my $data = ',test,data';
my @values = split(',',$data);

# We print "Val: " with each line so that you can see the blank one
foreach my $val (@values) {
 print "Val: $val\n";
}

The output of this program is:
Val:
Val: test
Val: data

Example 7. Split and context

If you do not pass in a string to split, then split() will use $_. If you do not pass an expression or string to split on, then split() will use ' ':

  foreach ('Bob the Builder', 'Thomas the TankEngine', 'B1 and B2') {
    my @values = split;
    print "Split $_:\n";
    foreach my $val (@values) {
      print "  $val\n";
    }
  }

This produces:
  Split Bob the Builder:
    Bob
    the
    Builder
  Split Thomas the TankEngine:
    Thomas
    the
    TankEngine
  Split B1 and B2:
    B1
    and
    B2

Example 8. Limiting the split

You can limit the number of sections the string will be split into. You can do this by passing in a positive integer as the third argument. In this example, we're splitting our data into 3 fields - even though there are 4 occurrances of the delimiter:

  my $data = 'Becky Alcorn,25,female,Melbourne';

  my @values = split(',', $data, 3);

  foreach my $val (@values) {
    print "$val\n";
  }

This program produces:
  Becky Alcorn
  25
  female,Melbourne

Example 9. Keeping the delimiter

Sometimes, when splitting on a pattern, you want the delimiter in the result of the split. You can do this by capturing the characters you want to keep inside parenthesis. Let's do our regular expression example again, but this time we'll keep the numbers in the result:

  my $data = 'Home1Work2Cafe3Work4Home';

  # \d+ matches one or more integer numbers
  # The parenthesis () mean we keep the digits we match
  my @values = split(/(\d+)/, $data);

  foreach my $val (@values) {
    print "$val\n";
  }

The output is:
  Home
  1
  Work
  2
  Cafe
  3
  Work
  4
  Home

Example 10. Splitting into a hash

If you know a bit about your data, you could split it directly into a hash instead of an array:
  my $data = 'FIRSTFIELD=1;SECONDFIELD=2;THIRDFIELD=3';

  my %values =  split(/[=;]/, $data);

  foreach my $k (keys %values) {
    print "$k: $values{$k}\n";
  }

The output of this program is:
 FIRSTFIELD: 1
 THIRDFIELD: 3
 SECONDFIELD: 2

The problem is that if the data does not contain exactly what you think, for example FIRSTFIELD=1;SECONDFIELD=2;THIRDFIELD= then you will get an 'Odd number of elements in hash assignment' warning. Here is the output of the same program but with this new data:
  Odd number of elements in hash assignment at ./test.pl line 8.
  FIRSTFIELD: 1
  Use of uninitialized value in concatenation (.) or string at ./test.pl line 11.
  THIRDFIELD:
  SECONDFIELD: 2

Map with Split & Trim in Perl
map takes two inputs:
 an expression or block: this would be the trim expression (you don't have to write your own -- it's on CPAN)
 and a list to operate on: this should be split's output:

use String::Util 'trim';
my @values = map { trim($_) } split /\t/, $line;

完整的用法
my @values = map {s/^\s+|\s+$//g; $_}, split(/\t/, $line, 5), $line
my @trimmed = grep { s/^\s*|\s*$//g } split /\t/, $line;

grep acts as a filter on lists. This is why the \s+s need to be changed to \s*s inside the regex. Forcing matches on 0 or more spaces prevents grep from filtering out items in the list that have no leading or trailing spaces.


Split a string into array in Perl

需求:
my $line = "file1.gz file2.gz file3.gz";
my @abc = split('', $line);
print "@abc\n";

期望的结果:
file1.gz
file2.gz
file3.gz

Splitting a string by whitespace is very simple:
print $_, "\n" for split ' ', 'file1.gz file1.gz file3.gz';

my $line = "file1.gz file1.gz file3.gz";
my @abc = split(/\s+/, $line);

This is a special form of split actually (as this function usually takes patterns instead of strings):
As another special case, split emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a literal string composed of a single space character (such as ' ' or "\x20"). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.

使用正则直接处理
my $line = "file1.gz file2.gz file3.gz";
my @abc =  ($line =~ /(\w+[.]\w+)/g);
print @abc,"\n";

或直接针对扩展名进行split(.gz extension)
my $line = "file1.gzfile1.gzfile3.gz";
my @abc = split /(?<=\.gz)/, $line;
print $_, "\n" for @abc;

Here I used (?<=...) construct, which is look-behind assertion, basically making split at each point in the line preceded by .gz substring.

If you work with the fixed set of extensions, you can extend the pattern to include them all:
如果使用固定的扩展集,则可以扩展模式以包括所有扩展:
my $line2 = "file1.gzfile2.txtfile5.bz2file2.gzfile3.xls";
my @exts = ('txt', 'xls', 'gz', 'bz2');
my $patt = join '|', map { '(?<=\.' . $_ . ')' } @exts;
say "PATT:$patt";
my @abc2 = split /$patt/, $line2;
print "$_\n" for @abc2;

按字符分割

my $string = "Hello, how are you?";
my @chars = $string =~ /./sg;
my @chars = $string =~ /(.)/sg;
print "Fourth char: " . $chars[5] . "\n";


Processing a String One Character at a Time

问题:You want to process a string one character at a time.

解决方案:
Use split with a null pattern to break up the string into individual characters, or use unpack if you just want their ASCII values:
@array = split(//, $string);
@array = unpack("C*", $string);

Or extract each character in turn with a loop:
while (/(.)/g) {# . is never a newline here
    # do something with $1
}

讨论:
Perl的基本单元是字符串,而不是字符,很少需要一次处理一个字符。某种更高级的Perl操作,如模式匹配,更容易解决问题。

Here’s an example that prints the characters used in the string "an apple a day", sorted in ascending ASCII order:
以下示例打印字符串“每天一个苹果”中使用的字符,按ASCII升序排序:
%seen = ();
$string = "an apple a day";
foreach $byte (split //, $string) {
    $seen{$byte}++;
}
print "unique chars are: ", sort(keys %seen), "\n";

unique chars are: adelnpy

These split and unpack solutions give you an array of characters to work with. If you don’t want an array, you can use a pattern match with the /g flag in a while loop, extracting one character at a time:
这些拆分和解包解决方案为您提供了一系列可供使用的字符。如果不需要数组,可以在while循环中使用带有/g标志的模式匹配,每次提取一个字符:
%seen = ();
$string = "an apple a day";
while ($string =~ /(.)/g) {
    $seen{$1}++;
}
print "unique chars are: ", sort(keys %seen), "\n";

unique chars are:  adelnpy

一般来说,如果发现自己在逐个字符地处理,可能有更好的方法。与其使用索引和子字符串运算(substr)或拆分(split)和解包(unpack),不如使用模式。解包函数unpack可以更有效地计算32位校验和,而不是像下一个例子那样手动计算32位的校验和。

以下示例使用foreach循环计算$string的校验和。有更好的校验和,这恰好是传统的、计算简单的校验和的基础。如果想要更可靠的校验和,请参阅CPAN的MD5模块。

$sum = 0;
foreach $ascval (unpack("C*", $string)) {
    $sum += $ascval;
}
print "sum is $sum\n";
# prints "1248" if $string was "an apple a day"

This does the same thing, but much faster:
$sum = unpack("%32C*", $string);

This lets us emulate the SysV checksum program:
# sum - compute 16-bit checksum of all input files
$chksum = 0;
while (<>) { $chksum += unpack("%16C*", $_) }
$chksum %= (2 ** 16) - 1;
print "$chksum\n";

Here’s an example of its use:
% perl sum /etc/termcap
1510

If you have the GNU version of sum, you’ll need to call it with the --sysv option to get the same answer on the same file.
% sum --sysv /etc/termcap
1510 851 /etc/termcap

提供每次处理一个字符输入的小程序是slowcat。其想法是在每个字符打印出来后暂停,这样就可以在观众面前呈现出滚动文本,且速度足够慢,让他们能够阅读。
# slowcat - emulate a s l o w line printer
# usage: slowcat [-DELAY] [files ...]
$DELAY = ($ARGV[0] =~ /^-([.\d]+)/) ? (shift, $1) : 1;
$| = 1;
while (<>) {
    for (split(//)) {
        print;
        select(undef,undef,undef, 0.01 * $DELAY);
    }
}

另见:
The split and unpack functions in perlfunc(1) and Chapter 3 of Programming Perl; the use of expanding select for timing is explained in Section 3.10


参考文档
http://perldoc.perl.org/functions/split.html