Cybersecurity Application Security

XML External Entity Injection: Defusing XXE in Modern Parsers

July 01, 2026 10 min read 2 views

Your application accepts XML β€” maybe from a document upload, a SOAP endpoint, or a configuration parser β€” and somewhere in the processing chain, an XML parser is running with default settings. Those defaults are almost certainly dangerous. An attacker who can slip a crafted XML payload into that pipeline can read /etc/passwd, pivot to your internal network, or bring the server to its knees.

XML External Entity (XXE) injection has been on the OWASP Top 10 for years for exactly this reason: it is trivially exploitable and almost always preventable, yet it keeps appearing in production systems. This article walks you through how XXE works, what an attacker actually extracts, and β€” most importantly β€” how to shut it off in every parser you are likely to encounter.

What You'll Learn

  • How XXE payloads are constructed and why parsers trust them by default
  • The three main attack classes: file disclosure, SSRF, and entity expansion DoS
  • How to audit your codebase for vulnerable parser configurations
  • Concrete, copy-paste-ready fixes for Python, Java, Node.js, and PHP
  • Additional hardening steps beyond disabling external entities

What XXE Actually Does to Your Application

XML supports a feature called external entities β€” references that instruct the parser to fetch content from a URI or local file path and inline it into the document. This was designed for legitimate use cases like reusable document fragments. In practice, it is a foot-gun that attackers have been pulling for two decades.

When a parser processes an untrusted XML document with external entity resolution enabled, the attacker controls what URI or file path gets fetched. The result lands inside the parsed document tree, often flowing directly into your application's output, logs, or error messages. No authentication required β€” just the ability to submit XML.

Anatomy of an XXE Payload

A basic XXE exploit looks deceptively simple. The attacker defines an entity in the document's DOCTYPE declaration and then references it somewhere in the document body.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root>
  <data>&xxe;</data>
</root>

When the parser resolves &xxe;, it reads /etc/passwd and substitutes the file contents inline. If your application then echoes the <data> element in any response β€” an error message, an API reply, a generated report β€” the attacker sees the file.

Blind XXE works the same way, but instead of a file:// URI, the attacker points to an attacker-controlled server via http://. The outbound request itself is the signal: it proves the parser resolved the entity, and out-of-band techniques can exfiltrate data through DNS lookups or HTTP parameters.

Three Attack Scenarios Worth Understanding

File Disclosure

The canonical attack reads sensitive local files. /etc/passwd is the proof-of-concept, but the real prizes are private keys, application configuration files, environment files containing credentials, and source code. On Windows targets, paths like file:///C:/Windows/win.ini or file:///C:/inetpub/wwwroot/web.config are equally viable.

File disclosure often works against endpoints you would not expect β€” document conversion APIs, XML-based configuration importers, even SVG upload handlers. SVG is XML; a malicious SVG submitted to an image processing endpoint is a valid XXE attack surface.

Server-Side Request Forgery via XXE

Point the external entity at an internal HTTP endpoint rather than a file, and XXE becomes SSRF. This is especially damaging in cloud environments where the instance metadata service sits at a well-known address. An attacker can craft a payload that reaches your cloud provider's metadata endpoint to retrieve instance credentials β€” the same class of attack covered in more depth in the guide on SSRF vulnerabilities and blocking metadata endpoint abuse.

<!ENTITY xxe SYSTEM "http://169.254.169.254/latest/meta-data/iam/security-credentials/">

If the parser resolves this and your application reflects any part of the result, the attacker now has temporary AWS credentials. Game over for that IAM role.

Billion Laughs: Entity Expansion DoS

This variant does not fetch anything external. It nests entity references so that a small document expands to gigabytes of in-memory data during parsing. The classic form defines entities that reference each other multiplicatively, causing the parser to consume all available RAM and CPU.

<!DOCTYPE bomb [
  <!ENTITY a "AAAA...">
  <!ENTITY b "&a;&a;&a;&a;&a;&a;&a;&a;&a;&a;">
  <!ENTITY c "&b;&b;&b;&b;&b;&b;&b;&b;&b;&b;">
  <!ENTITY d "&c;&c;&c;&c;&c;&c;&c;&c;&c;&c;">
]>
<root>&d;</root>

This attack requires no network access and bypasses firewalls entirely. A misconfigured parser will attempt to resolve the full expansion. A properly hardened parser will reject it before it touches the heap.

Finding XXE Vulnerabilities Before Attackers Do

Start by mapping every place your application ingests XML. That includes more than obvious XML APIs β€” look for SOAP services, file importers (DOCX, XLSX, SVG, and RSS all use XML internally), configuration file parsers, and any library that accepts a document type without explicitly validating it.

In code review, grep for parser instantiation patterns and check whether security features are explicitly set:

grep -rn "DocumentBuilderFactory\|SAXParser\|XMLReader\|libxml\|etree\|lxml" src/

For dynamic testing, send the classic file:///etc/passwd payload to any endpoint that accepts XML and watch the response. For blind XXE, use a collaborator service (Burp Collaborator, interactsh) and observe out-of-band DNS or HTTP hits. A hit confirms entity resolution is active even if nothing appears in the HTTP response.

Keep in mind that XXE shares an attack surface relationship with insecure deserialization β€” both exploit trust in structured data formats. The tactics for tracing exploitation paths covered in the article on insecure deserialization and gadget chain attacks can inform your threat modeling here.

Disabling External Entities by Parser and Language

The fix is almost always the same: configure your parser to reject external entity declarations and DOCTYPE definitions entirely. The specifics differ by language and library, so here are the configurations you actually need.

Python (lxml and xml.etree)

Python's built-in xml.etree.ElementTree is vulnerable to entity expansion attacks by default in older versions, but it rejects external entities. The safer choice is to use defusedxml, a drop-in replacement for most stdlib XML modules that disables all dangerous features.

import defusedxml.ElementTree as ET

# Safe: defusedxml raises on DOCTYPE, external entities, and entity expansion
tree = ET.parse("input.xml")
root = tree.getroot()

If you are using lxml directly, you must configure the parser explicitly:

from lxml import etree

parser = etree.XMLParser(
    resolve_entities=False,
    no_network=True,
    load_dtd=False,
    forbid_dtd=True,
)
tree = etree.parse("input.xml", parser)

Setting forbid_dtd=True is the most aggressive and recommended option. If your documents legitimately use internal DTDs, use load_dtd=False and resolve_entities=False together as a minimum.

Java (DocumentBuilderFactory and SAXParser)

Java's XML APIs are notoriously verbose, and the safe configuration requires setting several features explicitly. The defaults are insecure on every major JDK version.

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

// Disable DOCTYPE declarations entirely
dbf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);

// Belt-and-suspenders: also disable external entities
dbf.setFeature("http://xml.org/sax/features/external-general-entities", false);
dbf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

dbf.setXIncludeAware(false);
dbf.setExpandEntityReferences(false);

DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(inputStream);

The disallow-doctype-decl feature is the most important line. If your documents never need a DOCTYPE, set that flag and stop there. The remaining flags cover edge cases where the parser falls through to a different processing path.

The same pattern applies to SAXParserFactory and XMLInputFactory (StAX). Each factory class has its own feature strings, but the Apache and SAX namespaced features shown above work across the most common implementations.

Node.js (libxmljs and fast-xml-parser)

Node.js does not ship a built-in XML parser. The two most common choices are libxmljs2 and fast-xml-parser.

const libxmljs = require('libxmljs2');

// Pass noent: false to prevent entity substitution
// Pass dtdload: false to prevent loading external DTDs
const doc = libxmljs.parseXml(xmlString, {
  noent: false,
  dtdload: false,
  dtdvalid: false,
  nonet: true,
});

For fast-xml-parser, external entity resolution is not supported by design, but you should still avoid processing raw DOCTYPE declarations from untrusted input. Strip or reject documents containing a DOCTYPE header before they reach the parser:

const { XMLParser } = require('fast-xml-parser');

if (/<!DOCTYPE/i.test(xmlInput)) {
  throw new Error('DOCTYPE declarations are not permitted');
}

const parser = new XMLParser({ processEntities: false });
const result = parser.parse(xmlInput);

PHP (libxml)

PHP uses libxml under the hood for SimpleXML, DOMDocument, and related functions. Disable external entity loading before any parse operation:

libxml_set_external_entity_loader(null);
libxml_disable_entity_loader(true); // PHP < 8.0

$doc = new DOMDocument();
$doc->loadXML($xmlInput, LIBXML_NOENT | LIBXML_DTDLOAD);

Note that libxml_disable_entity_loader() was deprecated in PHP 8.0 because external entity loading is disabled by default from that version onward. If you are still on PHP 7.x, call it explicitly every time before parsing untrusted input β€” not once at application startup, because some libraries reset it.

Validating Schema and Content After Parsing

Disabling entity resolution prevents the injection, but it does not validate that the document structure matches what your application expects. Add schema validation as a second layer: define a strict XSD or RelaxNG schema for any XML format you accept, and reject documents that do not conform before processing their content.

from lxml import etree

with open("schema.xsd", "rb") as f:
    schema_doc = etree.parse(f)
schema = etree.XMLSchema(schema_doc)

try:
    schema.assertValid(tree)
except etree.DocumentInvalid as e:
    raise ValueError(f"XML schema validation failed: {e}")

Schema validation cuts the attack surface further by rejecting unexpected elements and attributes that your code was not designed to handle β€” a useful defense-in-depth measure against novel XML-based attacks, not just XXE.

If you are building APIs that could theoretically accept either JSON or XML, consider whether XML support is necessary at all. Stripping XML support and accepting only JSON eliminates the entire attack class. This is the same principle as reducing the attack surface by removing unused features β€” a concept that applies equally to locking down GraphQL introspection and batching in API security hardening.

Common Pitfalls That Leave You Exposed

Fixing only one parser in the codebase. Teams often fix the obvious endpoint but miss XML parsing happening inside dependencies β€” PDF generators, Office document converters, or email libraries. Audit every dependency that touches structured document formats.

Applying the fix at the wrong scope. In PHP, resetting libxml_disable_entity_loader per-request matters because some third-party libraries toggle it. A global call at bootstrap is not enough. Wrap every untrusted parse operation.

Overlooking XSLT and XInclude. Disabling external entities does not automatically disable XInclude (xi:include) or XSLT transformations, both of which can fetch external resources through their own mechanisms. Set setXIncludeAware(false) explicitly in Java, and vet any XSLT stylesheets applied to user-supplied documents.

Assuming JSON-only APIs are safe. Some frameworks accept XML when the Content-Type header is set to application/xml, even if your handlers only read JSON. Check your framework's content negotiation and disable XML parsing at the middleware level if you do not use it. This pattern of trust assumptions about content type is similar to issues that arise in SQL injection via raw queries sneaking back into ORM codebases β€” the vulnerability hides in a code path the developer did not think was active.

Not testing after the fix. Run your XXE test payloads after applying the configuration changes. A typo in a feature flag string will silently leave the parser vulnerable with no error at startup.

Wrapping Up: Next Steps

XXE is a well-understood vulnerability with reliable, low-effort fixes. There is no excuse for a production system to remain vulnerable once you know where the XML parsing happens. Here is what to do next:

  1. Inventory every XML parse point in your application and its dependencies. Use grep and dependency scanning together β€” a library you did not write may be parsing XML on your behalf.
  2. Apply the parser hardening config for your language from the examples above. At minimum, disable DOCTYPE declarations. Prefer defusedxml in Python and the disallow-doctype-decl feature in Java.
  3. Add a DOCTYPE rejection check at the input boundary β€” before the document reaches any parser β€” as a belt-and-suspenders measure.
  4. Run XXE test payloads against every XML-accepting endpoint in your staging environment, including file upload endpoints that may accept SVG or DOCX.
  5. Add schema validation for any XML format you control, so only structurally valid documents proceed past the parser.

XXE sits in a family of injection attacks where the root cause is trusting structured input too completely. If you are hardening systematically, also review how your application handles content security policies β€” the guide on CSP bypasses that render your policy useless covers the same theme of security controls that look correct on the surface but fail in practice.

Frequently Asked Questions

How can I tell if my XML parser is vulnerable to XXE injection?

Send a test payload that defines an external entity pointing to a local file like /etc/passwd and reference it in the document body. If the file contents appear in the response or in server logs, your parser is resolving external entities. For blind XXE, point the entity at an attacker-controlled server and watch for outbound DNS or HTTP requests.

Does switching from XML to JSON completely eliminate XXE risk?

Yes, XXE is specific to XML parsers β€” JSON parsers have no concept of external entities or DOCTYPE declarations, so switching eliminates the attack class entirely. The caveat is that you must also confirm no code path in your framework silently parses XML when given a specially crafted request, even if your handlers only read JSON.

Is XXE still a real threat now that most frameworks are updated?

Yes. While modern frameworks often ship with safer defaults, the vulnerability persists in legacy codebases, third-party libraries, and anywhere developers configure parsers manually. File formats like SVG, DOCX, and RSS are XML under the hood, making upload handlers and document processors common blind spots.

What is the difference between XXE and a Billion Laughs attack?

XXE in the classic sense fetches external content via a URI or file path, exfiltrating data or triggering SSRF. A Billion Laughs attack uses only internal recursive entity references to cause exponential memory expansion, crashing the parser. Both abuse the XML entity system, but Billion Laughs is a pure denial-of-service attack that requires no network access.

Can XXE be exploited in a REST API that uses XML only as a fallback format?

Absolutely β€” this is one of the most commonly missed surfaces. If your REST framework automatically parses the request body as XML when the Content-Type header is application/xml, an attacker can exploit XXE even if your application logic never intentionally supports XML. Disable XML content negotiation at the middleware level if you do not need it.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.