Recently I’ve been working on processing an XML dump from Wikipedia. In just one file, we have every single Wikipedia article from the English version of the site. I’ve learned how to parse large XML files - specifically files so large that they don’t fit in memory.
Sample Code
Throughout this blog post, I refer to a file called colors.xml
. Here’s the contents of that file, if you’d like to follow along.
<colors>
<color>red</color>
<color>green</color>
</colors>
Parse
Python includes a built-in tool to parse XML files. With the parse
function, we can load an XML file into memory, and process it from there.
from xml.etree import ElementTree
# Regular Parsing
tree = ElementTree.parse("colors.xml")
root = tree.getroot()
for node in root:
print(node)
The key feature here is that parse
works by reading the entire file into memory. As long we have enough memory to hold the data, this method works fine. But it may well be the case that we don’t have enough memory.
If so, we’ll have to change our approach. Instead of loading the entire file at once, we’ll load it in bite-sized pieces, and process each piece as we go.
Iterparse
Python provides this functionality with the iterparse
function.
from xml.etree import ElementTree
context = ElementTree.iterparse("colors.xml", events=["start", "end"])
for event in context:
print(event)
Iterparse works by going through the XML document one element at a time, emitting information as it goes. Each emission contains either a start
event, or an end
event, accompanied by the XML element that was just processed.
This information is helpful because it helps us respond to events for specific XML elements. We’ll come back to that later.
Iterparse Events
Here’s an example of a start
event.
('start', <Element 'colors' at 0x01E2EF00>)
Iterparse emits a start
event when it has processed the opening tag of an element. Then, iterparse goes on to process any child elements that element may have, further emitting start
events for those.
When iterparse
processes an element’s closing tag, it has processed that element in its entirety. At that point, iterparse
emits an end
element.
With all these start
and end
events, we eventually build up a sequence of tuples.
You can see the correspondance between the original XML and the generated sequence below.
Original XML
<colors>
<color>red</color>
<color>green</color>
</colors>
Iterparse Output:
('start', <Element 'colors' at 0x01E2EF00>)
('start', <Element 'color' at 0x01E2EF60>) # Red - Start
('end', <Element 'color' at 0x01E2EF60>) # Red - End
('start', <Element 'color' at 0x01E2EF90>) # Green - Start
('end', <Element 'color' at 0x01E2EF90>) # Green - End
('end', <Element 'colors' at 0x01E2EF00>)
Saving Memory
The whole reason we used iterparse
was to reduce memory usage. But it’s not magic. We actually have to tell iterparse
when to remove XML data from memory. We can do that by calling the clear
method provided by each Element
emitted by iterparse
, after we are done processing that element.
for event in context:
event_type, el = x
if event_type == "end" and el.tag == "color":
# Finished processing the 'color' element.
# Remove its contents from memory.
el.clear()
It is important to note that iterparse
still builds the entire XML document in memory as it iterates through the XML file. If you never call clear, you wouldn’t be able to save any memory, and would essentially have the same result as calling parse
directly.
Conclusion
Tackling large files with iterparse
really just comes down to to two key steps - process elements one at a time, and clear elements from memory when they’re no longer needed.
That’s all I have for today. Thanks for reading!