Confluence XHTML Parsing

Version 5.1 by Raphaël Jakse on 2026/03/23 13:05

Overview

Handling specific to the Confluence XHTML syntax
Handling specific to the Confluence XHTML syntax
ConfluenceXHTMLParser
converts Confluence XHTML to wikimodel events
ConfluenceXHTMLParserconverts Confluence XHTML...
Standard XhtmlParser
converts XHTML to wikimodel events
Standard XhtmlParserconverts XHTML to wikim...
javax.xml.parsers.SAXParser
parses the XML and generate SAX events
javax.xml.parsers.SAXParserparses the XM...
Confluence XML Filters
fix attributes and whitespace issues
Confluence XML Filtersfix attributes and...
handle the XML elements
handle the XML elements
Standard tag handlers
Standard tag handl...
Custom Confluence tag handlers
Custom Confluence...
ConfluenceXWikiGeneratorListener
converts wikimodel events
to XWiki syntax events
ConfluenceXWikiGeneratorListenerconverts...
ConfluenceConverterListener
Applies transforms common with the old Confluence syntax, only when using the Confluence XML Filter
ConfluenceConverterListenerApplies transf...

The Confluence XHTML syntax (read our documentation of it) is parsed using an in-house parser. This parser is defined in the confluence-syntax-xhtml module, in the ConfluenceXHTMLParser class. It was formerly part of XWiki standard, original written by Thomas in 2013 (https://jira.xwiki.org/browse/XRENDERING-312), and was separated as a contrib module later, in 2017 (https://jira.xwiki.org/browse/CONFLUENCE-1).

ConfluenceXHTMLParser creates a wikimodel XHTML parser (org.xwiki.rendering.wikimodel.xhtml.XhtmlParser) and tweaks it:

  • it disables namespaces
  • it adds filters
  • it adds many tag handlers:
    • handlers for regular XHTML tags for which we need specific handling
    • handlers for Confluence specific tags

The resulting wikimodel events are then transformed into XWiki syntax events by the ConfluenceXWikIGeneratorListener class, which extends XHTMLXWikiGeneratorListener.

When the conversion happens during a Confluence XML package import, XWiki events are handled by ConfluenceConverterListener, which does some book keeping and further adjustements. ConfluenceConverterListener is also used with the old Confluence syntax.

If you need some common handling of things between the old Confluence syntax and the Confluence XHTML syntax, you should probably do this in ConfluenceConverterListener. However, this is not ideal for syntax transformations: those will not be applied outside a Confluence XML import, which might not be ideal depending on what you want to do.
For now, we have no means to do common handling between the two syntaxes outside a Confluence XML import.

We currently do things in ConfluenceConverterListener that probably belongs to ConfluenceXWikiGeneratorListener:
* removal of dummy "auto-cursor-target" paragraphs
* heading anchor generation
* macro conversion

Link and image conversion does belong to ConfluenceConverterListener, although some of this is also done in some tag handlers.

Wikimodel in a nutshell

We are not going to give a complete description of wikimodel here. One can read more about it on the original project's website, although note that XWiki has been using its own version for a long time now.

Wikimodel is a way to represent wiki documents in a syntax-independent way. It provides two representations:

  • Wiki Document Event Model (WEM): an event-based model (like SAX for XML), suitable for streaming things
  • Wiki Document Object Model (WOM): a tree-based model, suitable to manipulate entire documents directly

Here, we are only concerned with the WEM, because that's what's currently used in the confluence XHTML module.

The basic idea is that we are parsing Confluence XHTML into wikimodel events, that are then translated to XWiki syntax. This allows supporting a lot of syntax conversions without having to implement each pair for syntax, both ways.

Interactions with the Confluence XML Filter

The basic idea is that when the Confluence XML filter needs a conversion (because it's filtering a document and needs to send its body and converts it to the XWiki syntax), it calls the Confluence XHTML parser and gets the result. But some interactions happen the other way around, where the Confluence XHTML parser needs some information from the Confluence XML filter, or provides it information about the parsing through a side channel.

These side channels are:

  • defined as interfaces in the Confluence XHTML syntax module
  • implemented in the Confluence XML Filter side
  • provided through setters (setXXX methods) in ConfluenceXHTMLParser

ConfluenceXHTMLParser implements fallbacks for each of these side channels because it also needs to work standalone, outside anConfluence package import. In particular, it can be used as a syntax on its own right in XWiki, and needs to work in this setting, at least for when we failed to convert a document to the wiki syntax for some reason (although when that happens, it's likely that the document won't render well, unless we fixed the bug that caused the inability to convert the document, or if the bug is in the XWiki syntax renderer, and these cases are far from being theoretical).

Here are such mechanisms:

  • ConfluenceReferenceConverter: provides a way to convert links to Confluence things (documents, spaces, attachments, anchors users) to XWiki links. The Confluence XML filter has information to perform this conversion:

  • ConfluenceURLConverter: provides a way to convert absolute links to Confluence things. The Confluence XML filter has information to perform this conversion, mainly the baseURLs parameter that says which urls are Confluence URLs belonging to the wiki which the export being handled comes from. It also provides an extension mechanism, where outside extensions can provide URL converters. This matches the architecture of Confluence, which allows plugins to define endpoints.

  • ConfluenceMacroSupport: provides a way to know if a given macro should be handled as inline or block macros. The implementation of this interface in the Confluence XML filter determines this in two ways:

    • It asks Macro Converters

    • failing this, it asks target macro implementations present in the wiki

These mechanisms are available inside the tag handlers.

In addition to the side channels, the listener that actually receives the wiki events is provided this way. When Confluence XHTML is used as a syntax, the events will be sent to the code that renders the page in XWiki. When used as part of a Confluence package import, it will be the listener provided by the Confluence XML Filter and which sends the events to the Filter Stream API.

Filters

XhtmlParser is a streamed, SAX-like parser: it produces events as the XML code is parsed. This theoretically allows efficient parsing, with behavior akin to what you have with bash pipes:

  • several modules can be connected to each others and modules at the end don't have to wait for the previous modules to finish their handling completely, they can handle the events as soon as they are produced
  • minimal memory usage is needed even for very large documents, which makes things more robust and less prone to over memory consumption: the memory consumption doesn't depend on the size of the document, allowing the conversion of very large documents

Of course this is in practice only partially true: we cheat a lot with QueueListener objects which which we store event to replay them and alter them to work around many issues. More on that later.

We have some custom filters sitting in between the raw XML parsing and the tag handling:

  • ConfluenceAttributeXMLFilter: this does a few things:
    • It converts data-highlight-colour attributes into style attributes
    • it removes useless rowspan=1 and colspan=1 attributes that would uselessly clutter the XWiki syntax
  • ConfluenceXHTMLWhitespaceXMLFilter: it tweaks the whitespace handling. In particular, it makes sure whitespaces between a few specific empty elements are not eaten.

Tag handlers

Tag handlers extend org.xwiki.rendering.wikimodel.xhtml.handler and are registered for a particular tag name.

When the XML parser encounters an element, it checks whether there is a tag handler registered for this element and calls it. Many handlers are registered for the common HTML elements by default in the XHTML parser. ConfluenceXHTMLParser overrides some of these with custom handlers, and also registers custom handlers for the Confluence specific elements. Handlers registered by the Confluence XHTML, as one could expect, take precedence over the standard ones.

Warning

If no handler is registered for an element, it will be parsed as if the open and close tags of this element were not present. Its content will be parsed normally, and the open and close tags of this element will be completely ignored. This happens silently, no warning can be produced for this. This is a current limitation of XhtmlParser.

List of tag handlers

The above heading is deceptive. Listing the handlers here would be risking ending up with out of date information, and the Java code is already quite readable for this specific purpose.

The supported tags with the corresponding tag handler can be found at these two places, in this order:

We comment some interesting behavior related to some handlers.

Macro conversion

A few XML elements in Confluence represent macros:

  • ac:macro
  • ac:structured-macro
  • ac:adf-node

With the related tags to define macro parameters and macro contents. The differences are abstracted away: all these tag handlers generate the same macro events.

When the syntax is used as part of a Confluence XML package import,  the macro events are then caught in ConfluenceConverterListener, which calls macro converters.

Information

This means that when using Confluence XHTML as a syntax, we don't convert the macros. This is intentional: in XWiki, which macros are available doesn't depend on the specific syntax being used. Converting macros while using one of the confluence syntaxes on the wiki would not be consistent with the behavior of the other syntaxes.

We are bypassing wikimodel when it comes to lists

To import complex confluence content involving lists, we are completely bypassing wikimodel. Handlers for li, ul and ol send macro events in their beginElement and endElement methods. These macro events are then turned into proper lists in ConfluenceXWikiGeneratorListener.

Get Connected