Thursday, February 4, 2010

Perl tips - Unicode

First, use Encode;

Reading/Writing
  • $string = Encode::decode('UTF-8',$text); (assuming the input file (or STDIN) is encoded in UTF-8).
  • You can now handle $string as you would normal strings (e.g. split(//) will split it at character boundaries)
  • Do $text = Encode::encode('UTF-8', $string); before writing it out to file (assuming you want the output file (or STDOUT) in that encoding).
Regex
  • \p{L} - full glyph (e.g. the letter 'A')
  • \p{M} - partial glyph (e.g. the accent ` on the letter 'A', giving 'À')
  • \p{N} - digit
  • \p{P} - punctuation
  • \p{kannada} - any Kannada character
  • \P{} - invert the condition
E.g. to match a line if it contains no numerals and no punctuation do
$line ~= m/\p{N}|\p{P}/

No comments:

Post a Comment