Confluence Import Process

Last modified by Vincent Massol on 2026/04/08 13:49

Explanation

Confluence XML is implemented as an input filter stream. ConfluenceInputFilterStream is instantiated with input properties describing what to import and how.

ConfluenceInputFilterStream then sets up ConfluenceXMLPackage, which will extract the confluence package and index it.

ConfluenceXMLPackage is built in such a way it is able to handle huge export package:

  • the XML parsing is streamed
  • instead of keeping everything in RAM, individual objects are written in individual Apache Commons Configuration Properties files in a temporary directory (we could probably use some database engine like SQLite for this, it would be possibly even more efficient)

ConfluenceInputFilterStream is built in the same spirit: it browses things from the package and send them streamed using the filter stream API. If the output filter that is used is also built like this, the whole process is a pipeline that can handle huge imports, and it is normally the case.

More precisely, ConfluenceInputFilterStream:

  • Imports users
  • Imports groups
  • Browses spaces

For each space:

  • imports the home page
  • imports the orphans (which are pages with no parents which are not the home page)
  • imports the space blog
  • imports the space templates
  • import permissions from the space permissions and the home page permission

For each page:

  • imports revisions
    • imports page metadata (dates, author, title, ...)
    • imports page permissions
    • imports content (by instantiating the corresponding syntax, see details about the Confluence XHTML Parser, it is also possible for a page to be in old confluence syntax or to contain plain text that's not to be converted (for pages that are used as some data storage)
    • imports comments
    • imports attachments
    • imports labels as tags
  • imports children (recursive operation)

And then, the confluence package is closed (property files are removed, by default synchronously but a parameter can make this process asynchronous so you don't need to wait for the clean up to end for the job to end. There is also a parameter to avoid the cleanup altogether, this is useful for debugging purposes mostly: this lets you inspect the extracted properties and skip the parsing phase if you are to run several imports of the same package).

The starting point is the read method in ConfluenceInputFilterStream. This is where any new developer should probably go to start hacking on Confluence XML.

Get Connected