Thursday, May 2, 2013

Unicode tips for Python

  • To use non-Latin characters in regular expressions, use u'...' instead of r'...', even if you have to escape every backslash; e.g. the regex u'(?u)[०-९]\\s' matches a Devanagari digit followed by whitespace.
  • Remove zero-width joiners/non-joiners from Unicode text to get a normalized representation; otherwise words that are rendered the same in a browser/editor will be stored differently, and will not be equal on comparison; e.g. use the regex u'[\\u200D\\u200C]' and replace all matches with u'' (the empty string).