BigXML

Parse big xml files and stream with ease

Introduction

Parsing big XML files in Python is hard. On one hand, regular XML libraries load the whole file into memory, which will crash the process if the file is too big. Other solutions such as iterparse do read the file as they parse it, but they are complex to use if you don't want to run out of memory.

This is where the BigXML library shines:

  • Works with XML files of any size
  • No need to do memory management yourself
  • Pythonic API (using decorators similar to what Flask does)
  • Any stream can easily be parsed, not just files (e.g. usage with Requests)
  • Secure from usual attacks against XML parsers

Philosophy

Because it needs to be able to handle big files, BigXML parses the input streams in on pass. This means that once an XML element has been seen, you cannot go back to it. In other words, all computations for a node need to be performed when it is encountered.

This library borrows ideas from event-based programming. Conceptually, you can define handlers that will react to XML elements with specific names. BigXML will then dispatch the nodes of the stream being parsed to the good handlers.

As the XML document is parsed, handlers of deeper nodes may yield some piece of information that will be gathered by parent handlers. At the end of the day, this produces a single iterable that will be handled by your application.

Tip

Think big and never go backward, or you will get an exception.

Installation

Install BigXML with pip:

$ python -m pip install bigxml

Imports

The most used imports are the following:

from bigxml import Parser, xml_handle_element, xml_handle_text

If you want to catch exceptions raised by this module:

from bigxml import BigXmlError

For type hints, you may also import:

from bigxml import HandlerTypeHelper, Streamable, XMLElement, XMLElementAttributes, XMLText

Warning

Always import directly from bigxml. Importing from submodules is unsupported.