Is it possible to make lxml use hex instead of decimal for unicode entities?

Asked by usernamenumber

I am porting a perl/SAX tool to python/lxml. Ideally, given the same input, the new tool should produce the same output as the old tool. In fact, it introduces a number of problems for me if this is not the case. One annoying problem I am encountering is that SAX seems to store unicode entity IDs in hex, whereas lxml uses decimal, regardless of what value is used in the input:

>>> import lxml.etree as etree
>>> example_sax_output = "<foo>Copyright &#xA9; 2009 Foocorp, Inc</foo>" # Note: xA9
>>> e = etree.fromstring(example_sax_output)
>>> etree.tostring(e)
<foo>Copyright &#169; 2009 Foocorp, Inc</foo> # Note: 169

Is it possible to avoid this without doing something horribly kludgey like going through the output with a regex search and manually converting the values to hex?

Question information

Language:
English Edit question
Status:
Solved
For:
lxml Edit question
Assignee:
No assignee Edit question
Solved by:
usernamenumber
Solved:
Last query:
Last reply:
Revision history for this message
scoder (scoder) said :
#1

usernamenumber wrote:
> I am porting a perl/SAX tool to python/lxml. Ideally, given the same
> input, the new tool should produce the same output as the old tool. In
> fact, it introduces a number of problems for me if this is not the case.

It's always bad style to make applications depend on a specific XML
serialisation done by a specific tool. That's exactly what canonical XML
(C14N) was designed for.

> One annoying problem I am encountering is that SAX seems to store unicode
> entity IDs in hex, whereas lxml uses decimal, regardless of what value is
> used in the input:
>
> >>> import lxml.etree as etree
> >>> example_sax_output = "<foo>Copyright &#xA9; 2009 Foocorp, Inc</foo>" # Note: xA9
> >>> e = etree.fromstring(example_sax_output)
> >>> etree.tostring(e)
> <foo>Copyright &#169; 2009 Foocorp, Inc</foo> # Note: 169
>
> Is it possible to avoid this without doing something horribly kludgey
> like going through the output with a regex search and manually
> converting the values to hex?

There isn't a straight way to do that. Decimal character references were
chosen for compatibility with ElementTree, which uses "xmlcharrefreplace".
However, if you have a bit of memory and do not care too much about raw
performance, you can do this:

    # Python 2.6
    unicode_xml = etree.tostring(tree, encoding=unicode)
    bytes_xml = b''.join(chr(c) if c < 0x80 else b'&#x%X;' % c
                         for c in imap(ord, unicode_xml))

There's also a separate serialiser API in libxml2 that happens to output
hex entities. However, that's not used for backward compatibility reasons.

Stefan

Revision history for this message
scoder (scoder) said :
#2

Stefan Behnel wrote:
> usernamenumber wrote:
>> I am porting a perl/SAX tool to python/lxml. Ideally, given the same
>> input, the new tool should produce the same output as the old tool. In
>> fact, it introduces a number of problems for me if this is not the case.
>
> It's always bad style to make applications depend on a specific XML
> serialisation done by a specific tool. That's exactly what canonical XML
> (C14N) was designed for.

And, as a matter of fact, C14N uses hex charrrefs:

http://www.w3.org/TR/xml-c14n.html#Example-Chars

So maybe you should take a look at that.

http://codespeak.net/lxml/api.html#write-c14n-on-elementtree

Stefan

Revision history for this message
usernamenumber (usernamenumber) said :
#3

Thanks very much for the assistance, Stefan. You are a great help! As it turns out, write_c14n() actually uses yet another method of rendering entities, (\xc2\xa9 as opposed to &#xA9;), so it looks like I may have to just suck it up and deal with the output of my port being slightly different from that of the original tool (I don't think it's worth the extra processing to translate all the entities after the fact). But being able to write out to C14N (which I hadn't known about before now), might at least be able to avoid this problem in the future.

I do have one other question coming from this: I can find functions for writing out c14n content for ElementTree objects, but nothing for rendering an Element (the result of etree.fromstring(), for example) in this way. Am I missing something, or if I am working with a string do I just need to load it into a StringIO and run etree.parse() on it?

Revision history for this message
scoder (scoder) said :
#4

usernamenumber wrote:
> Thanks very much for the assistance, Stefan. You are a great help! As it
> turns out, write_c14n() actually uses yet another method of rendering
> entities, (\xc2\xa9 as opposed to &#xA9;)

Ah, right. Sure, it serialises to UTF-8, which doesn't require Unicode
character escaping. What you see is just what the Python prompt makes of
the byte series on output.

> so it looks like I may have
> to just suck it up and deal with the output of my port being slightly
> different from that of the original tool (I don't think it's worth the
> extra processing to translate all the entities after the fact). But
> being able to write out to C14N (which I hadn't known about before now),
> might at least be able to avoid this problem in the future.

Wise choice.

> I do have one other question coming from this: I can find functions for
> writing out c14n content for ElementTree objects, but nothing for
> rendering an Element (the result of etree.fromstring(), for example) in
> this way. Am I missing something, or if I am working with a string do I
> just need to load it into a StringIO and run etree.parse() on it?

You can get an ElementTree either by calling parse() or by wrapping an
Element in it, i.e.

    tree = etree.ElementTree(root_node)

http://codespeak.net/lxml/tutorial.html#the-elementtree-class
http://effbot.org/zone/element.htm#reading-and-writing-xml-files
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree-class

Stefan

Revision history for this message
usernamenumber (usernamenumber) said :
#5

Huh. Perhaps I'm mis-understanding you, but I was trying to do this, which fails...

>>> s = '<foo>Copyright \xc2\xa9 2009 Foocorp, Inc</foo>'
>>> etree.parse(etree.fromstring(s))
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "lxml.etree.pyx", line 2528, in lxml.etree.parse (src/lxml/lxml.etree.c:24676)
  File "parser.pxi", line 1339, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:57396)
TypeError: cannot parse from 'lxml.etree._Element'

What do you mean by a node, if not an etree._Element, as returned by fromstring()? In any case, I found something else that worked, as follows:

>>> etree.fromstring(s).getroottree().write_c14n(fh)

So unless that looks insane to you, I think I'm set. Thanks again! I've gone through a bunch python xml parsers, and lxml is miles beyond the rest!

Revision history for this message
usernamenumber (usernamenumber) said :
#6

D'oh! Nevermind. I misread your instructions and didn't notice that you were actually suggesting something different from what I was doing:

>>> etree.ElementTree(etree.fromstring(s))
<lxml.etree._ElementTree object at 0xb770cdcc>

...which of course works.

Thanks!