Recently I’ve been working on processing an XML dump from Wikipedia. In just one file, we have every single Wikipedia article from the English version of the site. I’ve learned how to parse large XML files - specifically files so large that they don’t fit in memory.

Sample Code

Throughout this blog post, I refer to a file called colors.xml. Here’s the contents of that file, if you’d like to follow along.

<colors>
    <color>red</color>
    <color>green</color>
</colors>

Parse

Python includes a built-in tool to parse XML files. With the parse function, we can load an XML file into memory, and process it from there.

from xml.etree import ElementTree

# Regular Parsing
tree = ElementTree.parse("colors.xml")
root = tree.getroot()
for node in root:
    print(node)

The key feature here is that parse works by reading the entire file into memory. As long we have enough memory to hold the data, this method works fine. But it may well be the case that we don’t have enough memory.

If so, we’ll have to change our approach. Instead of loading the entire file at once, we’ll load it in bite-sized pieces, and process each piece as we go.

Iterparse

Python provides this functionality with the iterparse function.

from xml.etree import ElementTree

context = ElementTree.iterparse("colors.xml", events=["start", "end"])
for event in context:
    print(event)

Iterparse works by going through the XML document one element at a time, emitting information as it goes. Each emission contains either a start event, or an end event, accompanied by the XML element that was just processed.

This information is helpful because it helps us respond to events for specific XML elements. We’ll come back to that later.

Iterparse Events

Here’s an example of a start event.

('start', <Element 'colors' at 0x01E2EF00>)

Iterparse emits a start event when it has processed the opening tag of an element. Then, iterparse goes on to process any child elements that element may have, further emitting start events for those.

When iterparse processes an element’s closing tag, it has processed that element in its entirety. At that point, iterparse emits an end element.

With all these start and end events, we eventually build up a sequence of tuples.

You can see the correspondance between the original XML and the generated sequence below.

Original XML

<colors>
    <color>red</color>
    <color>green</color>
</colors>

Iterparse Output:

('start', <Element 'colors' at 0x01E2EF00>)
('start', <Element 'color' at 0x01E2EF60>) # Red - Start
('end', <Element 'color' at 0x01E2EF60>) # Red - End
('start', <Element 'color' at 0x01E2EF90>) # Green - Start
('end', <Element 'color' at 0x01E2EF90>) # Green - End
('end', <Element 'colors' at 0x01E2EF00>)

Saving Memory

The whole reason we used iterparse was to reduce memory usage. But it’s not magic. We actually have to tell iterparse when to remove XML data from memory. We can do that by calling the clear method provided by each Element emitted by iterparse, after we are done processing that element.

for event in context:
    event_type, el = x
    if event_type == "end" and el.tag == "color":
        # Finished processing the 'color' element. 
        # Remove its contents from memory.
        el.clear()

It is important to note that iterparse still builds the entire XML document in memory as it iterates through the XML file. If you never call clear, you wouldn’t be able to save any memory, and would essentially have the same result as calling parse directly.

Conclusion

Tackling large files with iterparse really just comes down to to two key steps - process elements one at a time, and clear elements from memory when they’re no longer needed.

That’s all I have for today. Thanks for reading!