Stream encodings
The streams passed to Parser
are expected to be bytes
-oriented.
The decoding is performed according to the
XML specification, i.e. based on the
encoding
attribute of the XML declaration:
<?xml version='1.0' encoding='ISO-8859-1'?>
Note
The XML declaration is optional for UTF-8 and UTF-16 encodings.
Wrong encoding
Sometimes, the encoding of the stream to parse does not match the one specified in the XML declaration.
>>> @xml_handle_element("root")
... def handler(node):
... yield node.text
>>> stream_bytes = b"<root>\xe0\xe9\xef\xf4\xf9</root>" # ISO-8859-1
>>> Parser(stream_bytes).return_from(handler)
Traceback (most recent call last):
...
bigxml.exceptions.BigXmlError: Not well-formed (invalid token)...
If you know that there is no XML declaration, you can add one before the stream:
>>> Parser(
... "<?xml version='1.0' encoding='ISO-8859-1'?>".encode("ISO-8859-1"),
... stream_bytes,
... ).return_from(handler)
'àéïôù'
But if the XML declaration is already here, you will need to change the encoding of your
stream manually. For bytes
instances, decode then encode:
>>> stream_bytes_with_xml_declaration = (
... b"<?xml version='1.0' encoding='UTF-8'?>" # wrong encoding specified
... b"<root>\xe0\xe9\xef\xf4\xf9</root>" # ISO-8859-1
... )
>>> Parser(
... stream_bytes_with_xml_declaration.decode("ISO-8859-1").encode("UTF-8"),
... ).return_from(handler)
'àéïôù'
For file-like objects, use codecs.EncodedFile
:
>>> import io
>>> stream_file = io.BytesIO(stream_bytes_with_xml_declaration)
>>> import codecs
>>> Parser(
... codecs.EncodedFile(stream_file, "UTF-8", "ISO-8859-1"),
... ).return_from(handler)
'àéïôù'