Wiki source code of Confluence XHTML Parsing
Last modified by Raphaël Jakse on 2026/03/23 13:11
Hide last authors
| author | version | line-number | content |
|---|---|---|---|
| |
1.1 | 1 | {{toc/}} |
| 2 | |||
| |
7.1 | 3 | == Overview == |
| |
1.1 | 4 | |
| 5 | {{diagram/}} | ||
| 6 | |||
| |
8.2 | 7 | The Confluence XHTML syntax (read our [[documentation of it>>xwiki:documentation.extensions.dev.confluence.xhtml-syntax.WebHome]]) is parsed using an in-house parser. This parser is defined in the confluence-syntax-xhtml module, in the [[ConfluenceXHTMLParser class>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-syntax-xhtml/src/main/java/org/xwiki/contrib/confluence/parser/xhtml/internal/ConfluenceXHTMLParser.java]]. It was formerly part of XWiki standard, original written by Thomas in 2013 ([[https:~~/~~/jira.xwiki.org/browse/XRENDERING-312>>https://jira.xwiki.org/browse/XRENDERING-312]]), and was separated as a contrib module later, in 2017 ([[https:~~/~~/jira.xwiki.org/browse/CONFLUENCE-1>>https://jira.xwiki.org/browse/CONFLUENCE-1]]). |
| |
1.1 | 8 | |
| 9 | ConfluenceXHTMLParser creates a wikimodel XHTML parser (org.xwiki.rendering.wikimodel.xhtml.XhtmlParser) and tweaks it: | ||
| 10 | |||
| 11 | * it disables namespaces | ||
| 12 | * it adds filters | ||
| 13 | * it adds many tag handlers: | ||
| 14 | ** handlers for regular XHTML tags for which we need specific handling | ||
| 15 | ** handlers for Confluence specific tags | ||
| 16 | |||
| 17 | The resulting wikimodel events are then transformed into XWiki syntax events by the ConfluenceXWikIGeneratorListener class, which extends XHTMLXWikiGeneratorListener. | ||
| 18 | |||
| 19 | When the conversion happens during a Confluence XML package import, XWiki events are handled by ConfluenceConverterListener, which does some book keeping and further adjustements. ConfluenceConverterListener is also used with the old Confluence syntax. | ||
| 20 | |||
| 21 | (% class="box infomessage" %) | ||
| 22 | ((( | ||
| 23 | If you need some common handling of things between the old Confluence syntax and the Confluence XHTML syntax, you should probably do this in ConfluenceConverterListener. However, this is not ideal for syntax transformations: those will not be applied outside a Confluence XML import, which might not be ideal depending on what you want to do. | ||
| 24 | For now, we have no means to do common handling between the two syntaxes outside a Confluence XML import. | ||
| 25 | \\We currently do things in ConfluenceConverterListener that probably belongs to ConfluenceXWikiGeneratorListener: | ||
| 26 | ~* removal of dummy "auto-cursor-target" paragraphs | ||
| 27 | ~* heading anchor generation | ||
| 28 | ~* macro conversion | ||
| 29 | \\Link and image conversion does belong to ConfluenceConverterListener, although some of this is also done in some tag handlers. | ||
| 30 | ))) | ||
| 31 | |||
| |
7.1 | 32 | === Wikimodel in a nutshell === |
| |
1.1 | 33 | |
| 34 | We are not going to give a complete description of wikimodel here. One can read more about it on the [[original project's website>>https://wikimodel.sourceforge.net/architecture.html]], although note that XWiki has been using its own version for a long time now. | ||
| 35 | |||
| 36 | Wikimodel is a way to represent wiki documents in a syntax-independent way. It provides two representations: | ||
| 37 | |||
| 38 | * Wiki Document Event Model (WEM): an event-based model (like SAX for XML), suitable for streaming things | ||
| 39 | * Wiki Document Object Model (WOM): a tree-based model, suitable to manipulate entire documents directly | ||
| 40 | |||
| 41 | Here, we are only concerned with the WEM, because that's what's currently used in the confluence XHTML module. | ||
| 42 | |||
| 43 | The basic idea is that we are parsing Confluence XHTML into wikimodel events, that are then translated to XWiki syntax. This allows supporting a lot of syntax conversions without having to implement each pair for syntax, both ways. | ||
| 44 | |||
| |
7.1 | 45 | == Interactions with the Confluence XML Filter == |
| |
1.1 | 46 | |
| 47 | The basic idea is that when the Confluence XML filter needs a conversion (because it's filtering a document and needs to send its body and converts it to the XWiki syntax), it calls the Confluence XHTML parser and gets the result. But //some// interactions happen the other way around, where the Confluence XHTML parser needs some information from the Confluence XML filter, or provides it information about the parsing through a side channel. | ||
| 48 | |||
| 49 | These side channels are: | ||
| 50 | |||
| 51 | * defined as interfaces in the Confluence XHTML syntax module | ||
| 52 | * implemented in the Confluence XML Filter side | ||
| 53 | * provided through setters (setXXX methods) in ConfluenceXHTMLParser | ||
| 54 | |||
| 55 | ConfluenceXHTMLParser implements fallbacks for each of these side channels because it also needs to work standalone, outside anConfluence package import. In particular, it can be used as a syntax on its own right in XWiki, and needs to work in this setting, at least for when we failed to convert a document to the wiki syntax for some reason (although when that happens, it's likely that the document won't render well, unless we fixed the bug that caused the inability to convert the document, or if the bug is in the XWiki syntax renderer, and these cases are far from being theoretical). | ||
| 56 | |||
| 57 | Here are such mechanisms: | ||
| 58 | |||
| 59 | * ((( | ||
| 60 | ConfluenceReferenceConverter: provides a way to convert links to Confluence things (documents, spaces, attachments, anchors users) to XWiki links. The Confluence XML filter has information to perform this conversion: | ||
| 61 | |||
| 62 | * ((( | ||
| 63 | Through the index built when we parsed the package we are currently importing | ||
| 64 | ))) | ||
| 65 | * ((( | ||
| 66 | Through Confluence resolvers, which can query the wiki to find previously imported documents ([[https:~~/~~/jira.xwiki.org/browse/CONFLUENCE-312, >>https://jira.xwiki.org/browse/CONFLUENCE-312]][[https:~~/~~/jira.xwiki.org/projects/CONFLUENCE/issues/CONFLUENCE-329, >>https://jira.xwiki.org/projects/CONFLUENCE/issues/CONFLUENCE-329]][[https:~~/~~/jira.xwiki.org/projects/CONFLUENCE/issues/CONFLUENCE-283)>>https://jira.xwiki.org/projects/CONFLUENCE/issues/CONFLUENCE-283]] | ||
| 67 | ))) | ||
| 68 | ))) | ||
| 69 | * ((( | ||
| 70 | ConfluenceURLConverter: provides a way to convert absolute links to Confluence things. The Confluence XML filter has information to perform this conversion, mainly the baseURLs parameter that says which urls are Confluence URLs belonging to the wiki which the export being handled comes from. It also provides an extension mechanism, where outside extensions can provide URL converters. This matches the architecture of Confluence, which allows plugins to define endpoints. | ||
| 71 | ))) | ||
| 72 | * ((( | ||
| 73 | ConfluenceMacroSupport: provides a way to know if a given macro should be handled as inline or block macros. The implementation of this interface in the Confluence XML filter determines this in two ways: | ||
| 74 | |||
| 75 | * ((( | ||
| 76 | It asks Macro Converters | ||
| 77 | ))) | ||
| 78 | * ((( | ||
| 79 | failing this, it asks target macro implementations present in the wiki | ||
| 80 | ))) | ||
| 81 | ))) | ||
| 82 | |||
| 83 | These mechanisms are available inside the [[tag handlers>>||anchor="HTaghandlers"]]. | ||
| 84 | |||
| 85 | In addition to the side channels, the listener that actually receives the wiki events is provided this way. When Confluence XHTML is used as a syntax, the events will be sent to the code that renders the page in XWiki. When used as part of a Confluence package import, it will be the listener provided by the Confluence XML Filter and which sends the events to the Filter Stream API. | ||
| 86 | |||
| |
7.1 | 87 | == Filters == |
| |
1.1 | 88 | |
| 89 | XhtmlParser is a streamed, SAX-like parser: it produces events as the XML code is parsed. This theoretically allows efficient parsing, with behavior akin to what you have with bash pipes: | ||
| 90 | |||
| 91 | * several modules can be connected to each others and modules at the end don't have to wait for the previous modules to finish their handling completely, they can handle the events as soon as they are produced | ||
| 92 | * minimal memory usage is needed even for very large documents, which makes things more robust and less prone to over memory consumption: the memory consumption doesn't depend on the size of the document, allowing the conversion of very large documents | ||
| 93 | |||
| 94 | Of course this is in practice only partially true: we cheat a lot with QueueListener objects which which we store event to replay them and alter them to work around many issues. More on that later. | ||
| 95 | |||
| 96 | We have some custom filters sitting in between the raw XML parsing and the tag handling: | ||
| 97 | |||
| 98 | * ConfluenceAttributeXMLFilter: this does a few things: | ||
| 99 | ** It converts data-highlight-colour attributes into style attributes | ||
| 100 | ** it removes useless rowspan=1 and colspan=1 attributes that would uselessly clutter the XWiki syntax | ||
| 101 | * ConfluenceXHTMLWhitespaceXMLFilter: it tweaks the whitespace handling. In particular, it makes sure whitespaces between a few specific empty elements are not eaten. | ||
| 102 | |||
| |
7.1 | 103 | == Tag handlers == |
| |
1.1 | 104 | |
| 105 | Tag handlers extend org.xwiki.rendering.wikimodel.xhtml.handler and are registered for a particular tag name. | ||
| 106 | |||
| 107 | When the XML parser encounters an element, it checks whether there is a tag handler registered for this element and calls it. Many handlers are registered for the common HTML elements by default in the XHTML parser. ConfluenceXHTMLParser overrides some of these with custom handlers, and also registers custom handlers for the Confluence specific elements. Handlers registered by the Confluence XHTML, as one could expect, take precedence over the standard ones. | ||
| 108 | |||
| 109 | {{warning}} | ||
| 110 | If no handler is registered for an element, it will be parsed as if the open and close tags of this element were not present. Its content will be parsed normally, and the open and close tags of this element will be completely ignored. This happens silently, no warning can be produced for this. This is a current limitation of XhtmlParser. | ||
| 111 | {{/warning}} | ||
| 112 | |||
| |
7.1 | 113 | === List of tag handlers === |
| |
1.1 | 114 | |
| 115 | The above heading is deceptive. Listing the handlers here would be risking ending up with out of date information, and the Java code is already quite readable for this specific purpose. | ||
| 116 | |||
| 117 | The supported tags with the corresponding tag handler can be found at these two places, in this order: | ||
| 118 | |||
| 119 | * Confluence specific handlers: [[https:~~/~~/github.com/xwiki-contrib/confluence/tree/master/confluence-syntax-xhtml/src/main/java/org/xwiki/contrib/confluence/parser/xhtml/internal/ConfluenceXHTMLParser.java>>https://github.com/xwiki-contrib/confluence/tree/master/confluence-syntax-xhtml/src/main/java/org/xwiki/contrib/confluence/parser/xhtml/internal/ConfluenceXHTMLParser.java]] | ||
| 120 | * Standard handlers: [[https:~~/~~/github.com/xwiki/xwiki-rendering/tree/master/xwiki-rendering-wikimodel/src/main/java/org/xwiki/rendering/wikimodel/xhtml/impl/XhtmlHandler.java>>https://github.com/xwiki/xwiki-rendering/tree/master/xwiki-rendering-wikimodel/src/main/java/org/xwiki/rendering/wikimodel/xhtml/impl/XhtmlHandler.java]] | ||
| 121 | |||
| 122 | We comment some interesting behavior related to some handlers. | ||
| 123 | |||
| |
7.1 | 124 | === Macro conversion === |
| |
1.1 | 125 | |
| 126 | A few XML elements in Confluence represent macros: | ||
| 127 | |||
| 128 | * ac:macro | ||
| 129 | * ac:structured-macro | ||
| 130 | * ac:adf-node | ||
| 131 | |||
| 132 | With the related tags to define macro parameters and macro contents. The differences are abstracted away: all these tag handlers generate the same macro events. | ||
| 133 | |||
| 134 | When the syntax is used as part of a Confluence XML package import, the macro events are then caught in ConfluenceConverterListener, which calls macro converters. | ||
| 135 | |||
| 136 | {{info}} | ||
| 137 | This means that when using Confluence XHTML as a syntax, we don't convert the macros. This is intentional: in XWiki, which macros are available doesn't depend on the specific syntax being used. Converting macros while using one of the confluence syntaxes on the wiki would not be consistent with the behavior of the other syntaxes. | ||
| 138 | {{/info}} | ||
| 139 | |||
| |
7.1 | 140 | === We are bypassing wikimodel when it comes to lists === |
| |
1.1 | 141 | |
| 142 | To import complex confluence content involving lists, we are completely bypassing wikimodel. Handlers for li, ul and ol send macro events in their beginElement and endElement methods. These macro events are then turned into proper lists in ConfluenceXWikiGeneratorListener. |