does perl's \W not match accented characters in UTF-8 locales, or am I doing something wrong?

Asked by Peter Cordes on 2008-07-15

I'm running AMD64 Ubuntu Hardy, with locales 2.7.9-4. The locales on my system are the default English ones, plus
en_CA, fr_CA, and fr_CA.UTF-8. (added to /var/lib/locales/supported.d/local, and ran whatever to generate /usr/lib/locale/xx_YY)

 I'm trying to write a perl script that does some regex matching on UTF-8 (English and French) text. I'm finding that \w doesn't match accented vowels, even with LANG=fr_CA.utf8. (I can understand accented vowels not counting as English word characters, but they are definitely valid French word characters.) The fr_CA locale's LC_CTYPE definition seems to work, though, and somehow this hack actually works with UTF-8.

#!/usr/bin/perl -wCDS
# -CD = utf8 default for input/output. -CS = utf8 stdin/out/err

use locale;
use utf8;
use POSIX qw(locale_h);
#setlocale(LC_ALL, "fr_CA.ISO8859-1");
# we want accented chars to count as part of words,
# but Ubuntu's fr_CA.utf8 locale doesn't include accented characters as words.

my $eaccent = "é\n";
$eaccent =~ s/\w//;
print $eaccent;
print "word characters are: ", +(sort grep /\w/, map { chr } 0..255), "\n";

With setlocale commented, the script outputs
é (so the s/// didn't match it)
word characters are: _0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz

With setlocale uncommented:

word characters are: µ_0123456789AaÁáÀàÂâÅåÄäÃãªÆæBbCcÇçDdÐðEeÉéÈèÊêËëFfGgHhIiÍíÌìÎîÏïJjKkLlMmNnÑñOoÓóÒòÔôÖöÕõØøºPpQqRrSsßTtUuÚúÙùÛûÜüVvWwXxYyÝýÿZzÞþ

 With 0..65535 instead of 0..255, you get a whole lot of word characters either way, but I think é is still not part of it.

 I'm not familiar enough with how locales are supposed to work to report a bug right away, but it certainly seems to me that something's weird, either in the locale files or in perl. (I'm surprised setlocale works, too. Maybe I'm just lucky that EN and FR both fit in ISO 8859-1.)

 thanks.

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu langpack-locales Edit question
Assignee:
No assignee Edit question
Solved by:
Jonathan Marsden
Solved:
2008-07-23
Last query:
2008-07-23
Last reply:
2008-07-22
Best Jonathan Marsden (jmarsden) said : #1

I think you may be confused, and may be over-complicating your example?

If we remove the unnecessary complications, and use charnames (so we can easily create an accented e without worrying about whether the encoding of the script file itself is UTF-* or ISO 8859-1), we get a test more like:

#!/usr/bin/perl -w
use charnames ':full';
my $eaccent = "\N{LATIN SMALL LETTER E WITH ACUTE} is an e acute\n";

print "e accent string before subst: " . $eaccent;
$eaccent =~ s/\w//;
print "e accent after subst: " . $eaccent;

This "just works" for me here, under Ubuntu 8.04 on x86, the script does the right thing and considers the e acute to be a part of a word. It does so whether I run it using

LANG="fr_CA.UTF-8" ./test2.pl

or just in my default (en_US.UTF-8) locale by doing just

./test2.pl

I suggest carefully reading the man pages

  man perluniintro

and then

  man perlunicode

for getting up to speed on the current state of Perl Unicode support.

Jonathan

Peter Cordes (peter-cordes) said : #2

Thanks for the "use charnames" hint; that definitely removes a complication. You're right that my example was over-complicated! I got into all kinds of weird stuff trying to work around the \w problem, and I didn't think of removing "use locale" once I'd found out that I needed perl -CDS.

It seems that perl's \w works as long as I don't "use locale". (Without that, perl ignores LANG and LC_* environment variables anyway.) So maybe Ubuntu's locale defs really are broken for UTF-8 locales. tr -d '[:alnum:]' doesn't delete accented characters (at least not LATIN SMALL LETTER E WITH ACUTE). See below...

 BTW, what I'm really trying to do is read in UTF-8 text files (generated by pdftotext), and find strings in them. The patterns come from another UTF-8 text file, read via the <> default (ARGV) filehandle. I have a table of metadata about the pdfs, which I want to verify by searching in the text of the pdfs to find the listed title and authors, etc. They're in English and French, but we might at some point have some docs in Inuktitut, which can be written with the latin alphabet or it's own syllabary.

 Anyway, I need perl -CDS, but I guess not "use locale", since I don't actually need sorting where accented characters go with unaccented, or locale-sensitive string comparisons.

 Your script also needs perl -CDS, or it prints the Latin1 encoding of LATIN SMALL LETTER E WITH ACUTE, like this: � is an e acute. (This looks like a question mark in reverse video.)

  So I think that actually solved my problem, but I think there's still a bug with locales. (My problem was solved by not using locales at all).

 If I "use locale", \w doesn't match LATIN SMALL LETTER E WITH ACUTE except in non-UTF-8 locales. perl -CDS still prints out the UTF-8 encoding of the character, even with LANG=fr_CA and "use locale", whatever that means.

 It's not just perl: as I said, tr -d can use character sets.
To feed it some UTF-8 input, I used this:
perl -wCDS -e 'use charnames ":full";print "some characters: \N{LATIN SMALL LETTER E WITH ACUTE}e\N{LATIN SMALL LETTER E WITH GRAVE}\n";'
which prints
some characters: éeè

perl -wCDS ... | LANG=en_CA.utf8 tr -d '[:alnum:]' < utf8.txt
 : éè
(same result with LANG=en_US.UTF-8)

 in a LANG=fr_CA xterm (or en_CA), perl without -CDS:
perl -w ... | tr -d '[:alnum:]'
 :
 So it leaves only the spaces and the colon.

tr simply uses isalnum(3) to ask libc if a character is in [:alnum:]. (coreutils, tr.c, line 375: is_char_class()). So I'm almost certain this is a bug. I'm going to go ahead and file a bug report. Thanks for helping me figure out what was going on.

Peter Cordes (peter-cordes) said : #3

Thanks Jonathan Marsden, that solved my question.

Peter Cordes (peter-cordes) said : #4

argh, just realized that tr is totally byte-oriented, as is isalnum(3). There's no possible way for it to work properly on multi-byte encodings. isalnum(3) is only defined for integer arguments "which must have the value of an unsigned char or EOF". So isalnum(3) can only hope to work for single-byte encodings like Latin1.

iswalnum(3) is the wide-char version, and might work.

 Perl's \w with "use locale" behaves exactly like tr -d '[:alnum:]', so maybe that's perl's problem, too. Maybe its locale support is based on narrow characters?

 I guess I don't know where the bug is anymore. I'll work on it, though.

Peter Cordes (peter-cordes) said : #5

glibc's iswalnum(3) works fine. Probably perl's "use locale" is just broken for UTF-8 locales.

 The follow program works filters exactly the way it should on UTF-8 input.
--- locale-test.c ---
#define _GNU_SOURCE
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>

int main(void)
{
 setlocale (LC_ALL, "");
 wchar_t c;
 while( WEOF != (c = getwc(stdin)) ){
  if(iswalnum(c) || c == btowc('\n') )
   putwc(c, stdout); // || die "maybe EILSEQ"
 }
 return 0;
}
---

Peter Cordes (peter-cordes) said : #6

Ah, reading perlunicode more closely, I see in the bugs section:
       Interaction with Locales

       Use of locales with Unicode data may lead to odd results. Currently, Perl attempts to attach 8-bit locale info to characters
       in the range 0..255, but this technique is demonstrably incorrect for locales that use characters above that range when mapped
       into Unicode. Perl’s Unicode support will also tend to run slower. Use of locales with Unicode is discouraged.

 Ok, so it's a known bug. damnit.

  Even without locales, it does say that \w and \W don't work inside [...].
But there are \p{foo} unicode character classes to play with.

 Ok, so that finally solves all the mysteries. Thanks again.