Processing UTF-8 Files with Perl

I thought I'd point out a common text-processing problem—processing text in files that contain multibyte characters—and show a solution I recently discovered for Perl. Consider the following text:

Oracle® Database Administrator’s Guide

Without cautious programming, the registered trademark symbol and the curled apostrophe can easily get mangled. Here's how the text is represented in CP-1252, the single-byte American English character set used on Microsoft Windows:

4f 72 61 63 6c 65 ae 20 44 61 74 61 62 61 73 65 20 41 64 6d 69 6e 69 73 74 72
 O  r  a  c  l  e  ®     D  a  t  a  b  a  s  e     A  d  m  i  n  i  s  t  r
61 74 6f 72 92 73 20 47 75 69 64 65
 a  t  o  r  ’  s     G  u  i  d  e

The text cannot be represented in ASCII, because neither the registered trademark symbol nor the curled apostrophe are ASCII characters. (Byte values ae and 92 are greater than 7f, the highest byte value defined for ASCII.)

4f 72 61 63 6c 65 46 20 44 61 74 61 62 61 73 65 20 41 64 6d 69 6e 69 73 74 72
 O  r  a  c  l  e  .     D  a  t  a  b  a  s  e     A  d  m  i  n  i  s  t  r
61 74 6f 72 18 73 20 47 75 69 64 65
 a  t  o  r  ?  s     G  u  i  d  e

It also cannot be represented in ISO-8859-1, the default chracter encoding for HTML 2.0. (HTML 3.2 acknowledges ISO-8859-1 as the best-supported HTML character set, and HTML 4.01 explicitly does not define a default encoding.) While the registered trademark symbol is an ISO-8859-1 character, the curled apostrophe is not. Byte value 92 is a control character in the ISO-8859-1 encoding.

4f 72 61 63 6c 65 ae 20 44 61 74 61 62 61 73 65 20 41 64 6d 69 6e 69 73 74 72
 O  r  a  c  l  e  ®     D  a  t  a  b  a  s  e     A  d  m  i  n  i  s  t  r
61 74 6f 72 92 73 20 47 75 69 64 65
 a  t  o  r  ?  s     G  u  i  d  e

UTF-8, on the other hand, is the default character encoding for XML, and, by extension, XHTML. In UTF-8, the registered trademark symbol is a two-byte character (c2 ae) and the curled apostrople is a three-byte character (e2 80 99). Here's a representation:

4f 72 61 63 6c 65 c2 ae 20 44 61 74 61 62 61 73 65 20 41 64 6d 69 6e 69 73 74 72
 O  r  a  c  l  e     ®     D  a  t  a  b  a  s  e     A  d  m  i  n  i  s  t  r
61 74 6f 72 e2 80 99 73 20 47 75 69 64 65
 a  t  o  r        ’  s     G  u  i  d  e

The following program replaces non-ASCII characters with XML character entitles:

open(IN, "<$ARGV[0]") or die "$!"; # Input as default encoding
my $file = do { local $/; <IN> }; # Read file contents into scalar
close(IN);

$file =~ s/(.)/asciiize($1)/eg; # Process by char

sub asciiize {
    return $_[0] if (ord($_[0]) < 128);     # ASCII
    return sprintf('&#x%04X;', ord($_[0])); # Non-ASCII
}

print $file;

Because it assumes an input file containing one character per byte, here is how it encodes the text when read from a UTF-8 file:

Oracle&#x00c2;&#x00ae; Database Administrator&#x00e2;&#x0080;&#x0099;s Guide

This will be rendered in browsers like this:

OracleÂ® Database Administratorâ€™s Guide

The repaired version of the program takes advantage of a feature new to Perl in version 5.8, a three-argument version of the open function:

open(IN, '<:encoding(utf8)', $ARGV[0]) or die "$!"; # Input as UTF-8
my $file = do { local $/; <IN> }; # Read file contents into scalar
close(IN);

$file =~ s/(.)/asciiize($1)/eg; # Process by char

sub asciiize {
    return $_[0] if (ord($_[0]) < 128);     # ASCII
    return sprintf('&#x%04X;', ord($_[0])); # Non-ASCII
}

print $file;

Following this, Perl no longer associates bytes with characters, meaning character-based processing can proceed accurately. Here is how the improved version encodes the text:

Oracle&#x00ae; Database Administrator&#x2019;s Guide

The text is now safe for ASCII-based processing and non-UTF-8-aware text editors, but it will still be correctly rendered in browsers like this:

Oracle® Database Administrator’s Guide

This is well and good when you know for sure whether or not your files are all in UTF-8. In the HTML you processess, however, it's likely some files will be in UTF-8 and some will be in ISO-8859-1. If you try to process an ISO-8859-1 file containing non-ASCII characters as UTF-8, Perl throws an error. Here's the result of processing an ISO-8859-1 file containing a registered trademark symbol (ae) as UTF-8:

Malformed UTF-8 character (unexpected continuation byte 0xae, with no preceding
start byte) in substitution iterator at deutf.pl line 14.

A simple way around this is to first sample the file looking for UTF-8 indicators. If any are found, decode the UTF-8 into Perl 5.8's internal, Unicode-friendly text representation. The decode function comes from the Encode module, which requires Perl 5.8 or later. (In fact, it's a standard module in 5.8 and later, so it's nearly assured to be part of any 5.8 or later Perl installation.) Here's a program showing the strategy:

use Encode;

open(IN, "<$ARGV[0]") or die "$!"; # Input as default encoding
my $file = do { local $/; <IN> }; # Read file contents into scalar
close(IN);
if ($file =~ /<?xml[^>]+encoding[\s\x0d\x0a]*=[\s\x0d\x0a]*['"]utf-?8/i ||
    $file =~ /<meta[^>]+charset[\s\x0d\x0a]*=[\s\x0d\x0a]*utf-?8/i) {
    $file = decode('utf8', $file);
}

$file =~ s/(.)/asciiize($1)/eg; # Process by char

sub asciiize {
    return $_[0] if (ord($_[0]) < 128);     # ASCII
    return sprintf('&#x%04X;', ord($_[0])); # Non-ASCII
}

print $file;

When processing UTF-8 files with Perl, remember to

Use Perl 5.8 or later.
When you know your input is in UTF-8, use the three-argument open function and identify the input encoding as 'utf8'.
When processing HTML or XHTML, sample the text for UTF-8 indicators, and decode the text as necessary.