Technical Trivia: Perl tips - Unicode

Thursday, February 4, 2010

Perl tips - Unicode

First, use Encode;

Reading/Writing

$string = Encode::decode('UTF-8',$text); (assuming the input file (or STDIN) is encoded in UTF-8).
You can now handle $string as you would normal strings (e.g. split(//) will split it at character boundaries)
Do $text = Encode::encode('UTF-8', $string); before writing it out to file (assuming you want the output file (or STDOUT) in that encoding).

Regex

\p{L} - full glyph (e.g. the letter 'A')
\p{M} - partial glyph (e.g. the accent ` on the letter 'A', giving 'À')
\p{N} - digit
\p{P} - punctuation
\p{kannada} - any Kannada character
\P{} - invert the condition

E.g. to match a line if it contains no numerals and no punctuation do
$line ~= m/\p{N}|\p{P}/

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)