lxml - the Python XML toolkit

How to set libxml:XML_PARSE_HUGE-option in lxml?

Asked by bol on 2009-03-27

in lixbml2 changelog:

Daniel Veillard Sun Jan 18 15:06:05 CET

    * include/libxml/parserInternals.h SAX2.c: add a new define XML_MAX_TEXT_LENGHT limiting the maximum size of a single text node, the defaultis 10MB and can be removed with the HUGE parsing option

So how can I set in lxml this HUGE-option?
At the moment I use a from me modified version of libxml2 (with XML_MAX_TEXT_LENGTH set to 100MB), which solves my problem. But I hope to find a lxml-way to solve this.

Thanks in advance.

Question information

Language:
English Edit question
Status:
Answered
For:
lxml Edit question
Assignee:
No assignee Edit question
Last query:
2009-03-27
Last reply:
2009-03-27
scoder (scoder) said : #1

Hi,

forwarding this to the mailing list where it belongs.

bol wrote:
> in lixbml2 changelog:
>
> Daniel Veillard Sun Jan 18 15:06:05 CET
>
> * include/libxml/parserInternals.h SAX2.c: add a new define
> XML_MAX_TEXT_LENGHT limiting the maximum size of a single text node,
> the defaultis 10MB and can be removed with the HUGE parsing option

Yes, this was changed in libxml2 2.7.x.

> So how can I set in lxml this HUGE-option?

You currently can't, and I wonder if this should really be an option and
what the default should be here.

> At the moment I use a from me modified version of libxml2 (with
> XML_MAX_TEXT_LENGTH set to 100MB), which solves my problem. But I hope to
> find a lxml-way to solve this.

Could you say something about your use case?

Stefan

bol (bolsog-users) said : #2

Hi Stefan,

we are working with large text corpora (bigger than 10mb).
Lxml is used for splitting this corpora-xml-files and run via sockets (old non-xml-using) binaries for i.e. pos-tagging or tokenizing.

The option XML_PARSE_HUGE should be as in libxml default off.
But enabling would help us.
If this option is changeable we could use the redhat libxml2 package without the need to change one char in the code and recompile the rpm after each libxml2-update.

Grüße aus Stuttgart,

Andreas

scoder (scoder) said : #3

bol wrote:
> we are working with large text corpora (bigger than 10mb).
> Lxml is used for splitting this corpora-xml-files and run via sockets (old
> non-xml-using) binaries for i.e. pos-tagging or tokenizing.

Ok, that gives you a) the bit of structure that you need and b) safe and
portable encoding support (which I assume is critical here), so that's
fine with me. After all, XML is used for all sorts of things these days...

> The option XML_PARSE_HUGE should be as in libxml default off.

That's what I was wondering about. It's (sort of) on by default if you use
libxml2 2.6.x and 2.7.[012], but it's supposed to be off by default if you
use libxml2 2.7.3 and later. That's outside of the control of lxml. So you
would get one behaviour on one system and a different behaviour on another
system, even with the same version of lxml.

However, this is meant as a security measure to prevent traps like the
billion laughs attack. Therefore, I do understand that a) most people
won't notice and b) having it on by default seems like the right setting.

Is there any opposition to keeping the enforced parser restrictions
(limited tree depth and text node length) enabled by default in newer
libxml2 versions, and to provide a parser switch for disabling them? The
alternative would be to disable them by default on all libxml2 versions,
and to provide a switch that enables them if libxml2 supports it. But a
safe default sounds a lot better.

Stefan

Can you help with this problem?

Provide an answer of your own, or ask bol for more information if necessary.

To post a message you must log in.