BigXML
Introduction
Parsing big XML files in Python is hard. On one hand, regular XML libraries load the
whole file into memory, which will crash the process if the file is too big. Other
solutions such as iterparse
do read the file as they parse it, but they are complex to
use if you don't want to run out of memory.
This is where the BigXML library shines:
- Works with XML files of any size
- No need to do memory management yourself
- Pythonic API (using decorators similar to what Flask does)
- Any stream can easily be parsed, not just files (e.g. usage with Requests)
- Secure from usual attacks against XML parsers
Philosophy
Because it needs to be able to handle big files, BigXML parses the input streams in on pass. This means that once an XML element has been seen, you cannot go back to it. In other words, all computations for a node need to be performed when it is encountered.
This library borrows ideas from event-based programming. Conceptually, you can define handlers that will react to XML elements with specific names. BigXML will then dispatch the nodes of the stream being parsed to the good handlers.
As the XML document is parsed, handlers of deeper nodes may yield some piece of information that will be gathered by parent handlers. At the end of the day, this produces a single iterable that will be handled by your application.
Tip
Think big and never go backward, or you will get an exception.
Installation
Install BigXML with pip:
$ python -m pip install bigxml
Imports
The most used imports are the following:
from bigxml import Parser, xml_handle_element, xml_handle_text
If you want to catch exceptions raised by this module:
from bigxml import BigXmlError
For type hints, you may also import:
from bigxml import HandlerTypeHelper, Streamable, XMLElement, XMLElementAttributes, XMLText
Warning
Always import directly from bigxml
. Importing from submodules is unsupported.