Recipes
Working with Requests
Requests is a library that simplifies making HTTP requests.
It is easy to stream an XML response from a server and parse it on the fly:
- When creating the request, specify
stream=True
so that the content is downloaded as needed if the server supports it; - Instead of getting the content from the response at once, use
Response.iter_content
to iterate over the response data in chunks.
>>> @xml_handle_element("root", "item")
... def handler(node):
... yield node.text
>>> response = requests.get("https://example.com/placeholder.xml", stream=True)
>>> parser = Parser(response.iter_content(None))
>>> for item in parser.iter_from(handler):
... print(item)
This example shows parsing in streaming with Requests.
It works quite well!
Note
The None
argument passed to Response.iter_content
asks to get the data as it is
received without buffering into chunks of a specific size. This allows to get
results as soon as possible.
Dataclasses
Although not mandatory, using a dataclass may seem natural to hold the parsed data:
<users>
<user id="13">
<firstname>Alice</firstname>
<lastname>Cooper</lastname>
</user>
<user id="37">
<firstname>Bob</firstname>
<lastname>Marley</lastname>
</user>
<user id="42">
<firstname>Carol</firstname>
</user>
</users>
>>> from dataclasses import dataclass
>>> @xml_handle_element("users", "user")
... @dataclass
... class User:
... firstname: str = 'N/A'
... lastname: str = 'N/A'
...
... @xml_handle_element("firstname")
... def handle_firstname(self, node):
... self.firstname = node.text
...
... @xml_handle_element("lastname")
... def handle_lastname(self, node):
... self.lastname = node.text
>>> with open("users.xml", "rb") as stream:
... for user in Parser(stream).iter_from(User):
... print(user)
User(firstname='Alice', lastname='Cooper')
User(firstname='Bob', lastname='Marley')
User(firstname='Carol', lastname='N/A')
Warning
All fields of the dataclass must have a default value.
If you need to get information from the node handled by the dataclass, you can do so in
the __post_init__
method:
>>> from dataclasses import dataclass, InitVar
>>> @xml_handle_element("users", "user")
... @dataclass
... class User:
... node: InitVar
... id: int = 0
... firstname: str = 'N/A'
... lastname: str = 'N/A'
...
... def __post_init__(self, node):
... self.id = int(node.attributes['id'])
...
... @xml_handle_element("firstname")
... def handle_firstname(self, node):
... self.firstname = node.text
...
... @xml_handle_element("lastname")
... def handle_lastname(self, node):
... self.lastname = node.text
>>> with open("users.xml", "rb") as stream:
... for user in Parser(stream).iter_from(User):
... print(user)
User(id=13, firstname='Alice', lastname='Cooper')
User(id=37, firstname='Bob', lastname='Marley')
User(id=42, firstname='Carol', lastname='N/A')
Warning
The node
attribute is an InitVar
, so that it is passed to __post_init__
but
not stored in class attributes. It must be the only mandatory field, since the class
is automatically instantiated with only one argument (the node). For more details,
see class handlers.
Yielding data in a class __init__
If you use a class handler, you may want to yield some data when
the class starts or ends to parse nodes. Of course, it is not possible to use the
yield
keyword in __init__
:
>>> @xml_handle_element("root", "cart")
... class Cart:
... def __init__(self, node):
... yield f"START cart parsing for user {node.attributes['user']}"
>>> with open("carts.xml", "rb") as stream:
... for item in Parser(stream).iter_from(Cart):
... print(item)
Traceback (most recent call last):
...
TypeError: __init__() should return None...
Instead, you can define a custom xml_handler
method:
>>> @xml_handle_element("root", "cart")
... class Cart:
... def __init__(self, node):
... self.user = node.attributes["user"]
...
... @xml_handle_element("product")
... def handle_product(self, node):
... yield f"product: {node.text}"
...
... def xml_handler(self, items):
... yield f"START cart parsing for user {self.user}"
... yield from items
... yield f"END cart parsing for user {self.user}"
>>> with open("carts.xml", "rb") as stream:
... for item in Parser(stream).iter_from(Cart):
... print(item)
START cart parsing for user Alice
product: 9781846975769
product: 9780008322052
END cart parsing for user Alice
START cart parsing for user Bob
product: 9780008117498
product: 9780340960196
product: 9780099580485
END cart parsing for user Bob
Streams without root
In some cases, you may be parsing a stream of XML elements that follow each other without starting with a common root.
For example, let's say a software outputs the following log file:
<log level="WARN">Main reactor overheat</log>
<log level="INFO">Starting emergency coolers</log>
<log level="DEBUG">Cooler 4 is online</log>
<log level="DEBUG">Cooler 2 is online</log>
<log level="INFO">Main reactor temperature is back to an acceptable level</log>
We can just wrap the stream in an XML element:
>>> @xml_handle_element("root", "log")
... def handler(node):
... yield f"{node.attributes['level']:5} {node.text}"
>>> with open("log.xml", "rb") as stream:
... parser = Parser(b"<root>", stream, b"</root>")
... for item in parser.iter_from(handler):
... print(item)
WARN Main reactor overheat
INFO Starting emergency coolers
DEBUG Cooler 4 is online
DEBUG Cooler 2 is online
INFO Main reactor temperature is back to an acceptable level
Infinite streams
Infinite streams are supported through file-like objects and iterables.
Here is an example using an infinite generator function as a stream:
>>> def collatz_generator(value):
... yield b"<root>"
... while True:
... yield b"<item>%d</item>" % value
... if value % 2:
... value = 3 * value + 1
... else:
... value //= 2
>>> stream = collatz_generator(42)
>>> @xml_handle_element("root", "item")
... def handler(node):
... yield int(node.text)
>>> items = Parser(stream).iter_from(handler)
>>> for _ in range(12):
... print(next(items))
42
21
64
32
16
8
4
2
1
4
2
1