Perl字符编码处理模块-Encode-FreeOA

Perl字符编码处理模块-Encode

2016-11-09 19:21:47

阿炯

Encode是Perl内置核心的字符编码处理模块，它提供了将字符流在各种编码间转换的功能。它有若干子模块，能够支持大多数语言编码处理。

Encode consists of a collection of modules whose details are too extensive to fit in one document. This one itself explains the top-level APIs and general topics at a glance. For other topics and more details, see the documentation for these modules:
Encode::Alias - Alias definitions to encodings
Encode::Encoding - Encode Implementation Base Class
Encode::Supported - List of Supported Encodings
Encode::CN - Simplified Chinese Encodings
Encode::JP - Japanese Encodings
Encode::KR - Korean Encodings
Encode::TW - Traditional Chinese Encodings
Encode::Guess - Guesses encoding from data
charnames - access to Unicode character names and named character sequences; also define character names

Perl的编码处理模块Encode，在程序里简单的use Encode就可以使用了。用到的方法主要是两个(encode,decode)：
结果=encode(编码方式a,要转码的字串);
结果=decode(编码方式b,要转码的字串);

encode的作用是把字串编码成“编码方式a”，decode的作用是把以"编码方式b"编码的字串解码。要注意的是，encode方法里的字串，需要是已经解码过的。也就是说，要把gbk编码的字串专成utf8,不能直接encode("utf8",gbk_string),而要encode("utf8",decode("gbk",gbk))
也就是说，似乎Encode模块内部有一种格式。作为encode和decode方法的中间格式。decode可以把某种编码转成这种格式，而encode可以把这种格式转化成特定的编码。

要了解这两个函数的作用需要清楚几个概念：

1、Perl字符串是使用utf8编码的，它由Unicode字符组成而不是单个字节，每个utf8编码的Unicode字符占1~4个字节(变长)。

2、进入或离开Perl处理环境(比如输出到屏幕、读入和保存文件等等)时不是直接使用Perl字符串，而需要把Perl字符串转换成字节流，转换过程中使用何种编码方式完全取决于你(或者由Perl代劳)。一旦Perl字符串向字节流的编码完成，字符的概念就不存在了，变成了纯粹的字节组合，如何解释这些组合则是你自己的工作。

可以看出如果想要Perl按照我们的字符概念来对待文本，文本数据就需要一直用Perl字符串的形式存放。但平时写出的每个字符一般都被作为纯ASCII字符保存(包括在程序中明文写出的字符串)，也就是字节流的形式，这里就需要encode和decode函数的帮助了。

encode函数顾名思义是用来编码Perl字符串的。它将Perl字符串中的字符用指定的编码格式编码，最终转化为字节流的形式，因此和Perl处理环境之外的事物打交道经常需要它。其格式很简单：
$octets = encode(ENCODING,$string[,CHECK])

$string：Perl字符串
encoding：是给定的编码方式
$octets: 是编码之后的字节流
check：表示转换时如何处理畸变字符(也就是Perl认不出来的字符)。一般不需使用。

编码方式视语言环境的不同有很大变化，默认可以识别utf8、ascii、ascii-ctrl、iso-8859-1等。

decode函数则是用来解码字节流的。它按照你给出的编码格式解释给定的字节流，将其转化为使用utf8编码的Perl字符串，一般来说从终端或者文件取得的文本数据都应该用decode转换为Perl字符串的形式。它的格式为：
$string = decode(ENCODING,$octets[,CHECK])
$string、ENCODING、$octets和CHECK的含义同上。

现在就很容易理解上面写的那段程序了。因为字符串是用明文写出的，存放的时候已经是字节流形式，丧失了本来的意义，所以首先就要用decode函数将其转换为Perl字符串，由于汉字一般都用gbk格式编码，这里decode也要使用gbk编码格式。转换完成后Perl对待字符的行为就和我们一样了，平时对字符串进行操作的函数基本上都能正确对字符进行处理，除了那些本来就把字符串当成一堆字节的函数(如vec、pack、unpack等)。于是split就能把字符串切成单个字符了。最后由于在输出的时候不能直接使用utf8编码的字符串，还需要将切割后的字符用encode函数编码为gbk格式的字节流，再用print输出。

列出支持的编码

use Encode;
@list = Encode->encodings();

#To get a list of all available encodings including those that have not yet been loaded, say:
@all_encodings = Encode->encodings(":all");
#列出与中文相关的编码
@with_cn = Encode->encodings("Encode::CN");

Encoding via PerlIO(从PerlIO层处理编码)，请参考文章末的参考链接。

编码与JSON

在数据中含有UTF-8字符的时候需要稍微注意，如果直接按照上面的方法将会出现乱码。JSON模块的encode_json和decode_json自身是支持UTF8编码的，但是perl为了简洁高效，默认是认为程序是非UTF8的，因此在程序开头处需要申明需要UTF8支持。另外如果需要用到JSON编码的功能(即encode_json)的话，还需要加入Encode模块的支持。总之，在程序开始处加入以下：
use utf8;
use Encode;

Encode::_utf8_off($freeoa);
$freeoa=decode("utf8",$freeoa);

另外，如果使用非UTF8进行编码的内容的话，最好先使用Encode的from_to命令转换成UTF8，之后再进行JSON编码。比如使用GBK编码的简体字(一般来自比较早的Windows的文件等会偶尔变成非UTF8编码)，先进性如下转换：
use JSON;
use Encode 'from_to';

# 假设$json是GBK编码的
my $json = '{"test" : "我是来自FreeOA的GBK编码的哦"}';

from_to($json, 'GBK', 'UTF-8');

my $data = decode_json($json);

其它编码相关的参考说明：

Perl中的编解码可在IO层处理，可用binmode()函数来指定具体哪一层：
binmode $fh,':7bit-jis';
或直接用实际的模块进行指定，更不易出错：
open my $fh,'<:7bit-jis',$file or die $!;

如果实在不知道源数据的编码方式，可以在Encode(v5.7.3首次引入)模块中的Encode::Guess模块来帮助检测数据的编码。

'use utf8'这个编译指令仅仅是通知Perl源数据的编码形式是utf-8的，并没有让输入与输出形式也是utf-8。所以还要另外指定输入与输出的编码格式。

#在终端中列出当前的perl系统中支持的编码格式
perl -MEncode -e 'print join "\n"=>Encode->encodings(":all")'

:utf8 和 :encoding(UTF-8) 并不同。后者是说“文件句柄保证是UTF-8”，如果传递给它无效的数据，它将停止运行。而前者是说“文件句柄是UTF-8”，但并不会对其进行证实。此时将会出现这样的情况：使用':utf8'层的程序将有可能传递无效的数据，因此这是一个安全漏洞。所以尽可能不要使用':uf8'层。这在perlmonks.org上也有说明(编号:644786)。

技术背景

Perl支持Unicode字符串，其内部编码为ISO-8859-1或UTF8。有一个被称为“SvUTF8”的标志开关来对应“UTF8标志”，其内部对为UTF-8的字符串标置为1，对ISO-8859-1（或原始二进制）的字符串设置为0。在Perl中无论内部编码如何，都会有一个由字符（而非字节）组成的字符串（FreeOA：这里说的是dualvar吗？）。

一旦设置了UTF8标志，Perl就不会进一步检查UTF8序列的有效性。通常这是没有问题的，因为是Perl首先设置了这个标志。然而有些人手动设置UTF8标志，导致其绕过了编码/解码函数和PerlIO层内置的保护，要么是因为这样做更容易（更少的输入），要么是出于性能的考量，甚至是人们都不知道自己做错了什么。

:utf8 为PerlIO层在传入数据上设置的utf8标志，而不检查字节序列；这不是一个错误或缺陷，而是这个PerlIO层的功能。在其他层（最重要的是编码层[:encoding layer]）（安全地）将输入转换为UTF8之后，在内部使用它。设置UTF8标志的函数_utf8_on可从Encoding模块获得。

几个XS模块在来自文件或套接字的传入数据上设置UTF8标志（想想数据库和网络协议），有时不检查UTF8序列的有效性。

Perl的函数在默认情况下使用Unicode语义（除了一些错误，但请参阅Unicode::Semantics以获得解决方法），这意味着\w匹配任何字母数字字符或下划线。这确实匹配了大量的Unicode字符。类似的语义也适用于\d和\s，但许多人认为\w是[A-Za-z0-9_]的缩写，\d是[0-9]的缩写，而\s是[\f\t\r\n]的缩写。这不是真的。从5年前发布的5.8开始，它们与Unicode语义匹配。

建议

请不要设置UTF8标志，除非您完全确信您的数据实际上是有效的UTF8，请再次记住：':utf8'设置标志而不进行UTF8检查。

使用:encoding(utf8)或:encoding(UTF-8) 代替 PerlIO 层的'utf8'。

使用utf8::decode或Encode::decode_utf8或Encode::decode("utf8",…)或Encode::decode("UTF-8",…)代替_utf8_on，不要使用SvUTF8_on，而是使用sv_utf8_decode，或者使用is_utf8_string先检查有效性。

如果您不希望非ascii部分匹配，或者事先过滤/禁止非ascii字符（码点（数值）大于127的字符），而不是写\w，\d或\s，而是写一个文字字符类。

Perl文档缺陷

一些官方Perl文档仍在代码示例中使用':utf8'。这已经在多年早些时候的当前开发版本中进行了更改。

如果在终端、编辑器/集成开发环境都设置为能正确处理UTF-8数据的情况下，依然在屏幕上看到了乱码，此时需要检查是否已经安装了正确的字体。

Perl从v5.14开始提供了相当不错的官方（安全）支持，非常值得在各个环境下使用。

输入的数据可能来自文件、命令行、套接字、或其他数据源。这些输出数据可能会被写入STDOUT、文件、或其他的数据接收器中。可以通过设置PERL UNICODE环境变量为AS,来告诉Perl所有的输入和输出格式都是UTF-8，字母A和字母S的组合在perldoc perlrun的-C部分中存在描述。但这种设置并不像你在代码中设置环境变量那么简单。你必须在你的程序运行之前就完成这些设置。在Linux系统中，可以这样做：
PERL_ UNICODE=AS perl program.pl
或者可以导出这个变量，此时它会被设置为对所有的程序均有效
export PERL UNICODE=AS

在Windows系统中，语法是：
set PERL UNICODE=AS

Encode::is_utf8()函数用于决定 Perl 是否需要将字符串视为Latin-1编码或UTF-8格式。这是由于UTF-8标志的设置并不意味着字符串事实上是UTF-8。就和Encode::Guess-样，仅仅只是一个猜测，仍然需要在这之前对所使用的编码层进行显式设置。

另外非内置的'utf8::all'模块可以改善上述编写过程的编码困惑，提供了一个简约的解决方案。

https://unicode.org/charts/ 有相关字符名称列表，在perl中可以输出相关的代码点(Code-Points)：
say "\N{U+263A}";
say chr(0x263a);
say "\x{2603}";

Why does modern Perl avoid UTF-8 by default?

文章中提到了7点重要事项来规范Perl中的Unicode的使用，或叫做Perl Unicode使用最佳实践？原文小结如下：

1.Set your PERL_UNICODE envariable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects, not lexical ones.

2.At the top of your source file (program, module, library, dohickey), prominently assert that you are running perl version 5.12 or better via:
use v5.12; # minimal for unicode string feature
use v5.14; # optimal for unicode string feature

3.Enable warnings, since the previous declaration only enables strictures and features, not warnings. I also suggest promoting Unicode warnings into exceptions, so use both these lines, not just one of them. Note however that under v5.14, the utf8 warning class comprises three other subwarnings which can all be separately enabled: nonchar, surrogate, and non_unicode. These you may wish to exert greater control over.

use warnings;
use warnings qw( FATAL utf8 );

4.Declare that this source unit is encoded as UTF‑8. Although once upon a time this pragma did other things, it now serves this one singular purpose alone and no other:
use utf8;

5.Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTF‑8 unless you tell it otherwise. That way you do not affect other module’s or other program’s code.
use open qw(:encoding(UTF-8) :std);

6.Enable named characters via \N{CHARNAME}.
use charnames qw( :full :short );

7.If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say:
binmode(DATA, ":encoding(UTF-8)");

There is of course no end of other matters with which you may eventually find yourself concerned, but these will suffice to approximate the state goal to “make everything just work with UTF‑8”, albeit for a somewhat weakened sense of those terms.

One other pragma, although it is not Unicode related, is:
use autodie;

It is strongly recommended.

当然也有大佬做成标准模板，供每次编写脚本时直接使用：
use 5.014;
use utf8;
use strict;
use autodie;
use warnings;
use warnings    qw< FATAL utf8     >;
use open        qw< :std :utf8     >;
use charnames   qw< :full >;
use feature     qw< unicode_strings >;

use File::Basename      qw< basename >;
use Carp                qw< carp croak confess cluck >;
use Encode              qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;

END { close STDOUT }

if (grep /\P{ASCII}/ => @ARGV) {
   @ARGV = map { decode("UTF-8", $_) } @ARGV;
}

$0 = basename($0); # shorter messages
$| = 1;

binmode(DATA, ":encoding(UTF-8)");

# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
    confess "Uncaught exception: @_" unless $^S;
};

# now promote run-time warnings into stack-dumped
# exceptions *unless* we're in an try block, in
# which case just cluck the stack dump instead
local $SIG{__WARN__} = sub {
    if ($^S) { cluck   "Trapped warning: @_" }
    else     { confess "Deadly warning: @_" }
};

while (<>) {
    chomp;
    $_ = NFD($_);
    ...
} continue {
    say NFC($_);
}

__END__

All perl source code should be in UTF-8 by default. You can get that with use utf8 or export PERL5OPTS=-Mutf8.

The perl DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").

Program arguments to perl scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS=-CA.

The standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S for all of them, or I, O, and/or E for just some of them. This is like perl -CS.

Any other handles opened by perl should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and o for particular ones of these; export PERL5OPTS=-CD would work. That makes -CSAD for all of them.

Cover both bases plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std. See uniquote.

You don't want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8. And make sure your input streams are always binmoded to :encoding(UTF-8), not just to :utf8.

Code points between 128–255 should be understood by perl to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS=-Mfeature=unicode_strings. That will make uc("\xDF") eq "SS" and "\xE9" =~ /\w/. A simple export PERL5OPTS=-Mv5.12 or better will also get that.

Named Unicode characters are not by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek or some such. See uninames and tcgrep.

You almost always need access to the functions from the standard Unicode::Normalize module various types of decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run incoming stuff through NFD and outbound stuff from NFC. There's no I/O layer for these yet that I'm aware of, but see nfc, nfd, nfkd, and nfkc.

Windows/Linux 下使用经验

"Unicode" on Windows is UTF-16LE, and each character is 2 or 4 bytes. Linux uses UTF-8, and each character is between 1 and 4 bytes.

Windows uses CRLF (\r\n, 0D 0A) line endings while Unix just uses LF (\n, 0A).

在2025年9月，在Perl v5.3x下测试Tk-804.036对Unicode代码点输出的支持情况：
use v5.32;
use Tk;
use utf8;
use Tk::ROText;
use charnames ':full';
use open qw(:encoding(UTF-8) :std);

1.Linux下的Tk界面中正常显示Unicode图形字符(\N{U+1F6B4})
2.Windows 1x下的Tk界面中无法正常显示Unicode图形字符(\N{U+1F6B4})，斜方框中有一问号，即使将系统的编码修改为"Beta: Use Unicode UTF-8 for worldwide language support"。但中文展示依然没有问题。

参考来源：
perl utf8字符处理相关问题

最新版本：2.86

项目主页：https://metacpan.org/release/Encode