What vulnerability does this XML parsing code have and how do you fix it?
Vulnerable code
from lxml import etree
def parse_upload(xml_bytes):
tree = etree.fromstring(xml_bytes)
return treeFixed
from lxml import etree
def parse_upload(xml_bytes):
parser = etree.XMLParser(
resolve_entities=False,
no_network=True,
load_dtd=False,
)
tree = etree.fromstring(xml_bytes, parser=parser)
return treeXML External Entity (XXE) injection. The lxml parser is called with no security configuration, so it uses its defaults — which include resolving external entities. An attacker uploads XML containing a DOCTYPE declaration that defines an entity pointing to file:///etc/passwd (or an internal service URL). When lxml parses the document, it reads that file and splices its contents into the XML tree, which the application then returns or logs.
The fix disables all three dangerous features: resolve_entities=False (don't substitute entity references), no_network=True (don't fetch remote URLs), and load_dtd=False (don't even parse DOCTYPE declarations). These three options together prevent all known XXE vectors in lxml.
References