Explore Confluence Exports
Steps
You might need to explore a confluence package to understand a bug, or to figure out how to implement a new feature. For instance, we needed to figure out how to implement space template conversion, and notice that templates are actually PageTemplate objects that look like Page objects.
Exploring a Confluence package can be challenging mainly because of it size: it can be huge. We also don't have much tooling to explore confluence packages, so we'll have to do with standard tooling.
Transferring data
Your Confluence package could be living on a remote server, and transferring a huge file might be your first challenge. Here are a few strategies you can use to reduce the amount of transferring which is needed.
First, the confluence package is a zip file which size might be mostly taken by attachments that don't compress well. Often enough, you are only interested in the entities.xml file (or, more rarely, the exportDescriptor.properties file). You can see their sizes like with unzip -l (and specify a specific file or else you'll be spammed with info about all the files in the archive):
unzip -l 5.9.zip entities.xml
Archive: 5.9.zip
Length Date Time Name
--------- ---------- ----- ----
288084 2016-05-10 17:04 entities.xml
--------- -------
288084 1 file
unzip -l 5.9.zip exportDescriptor.properties
Archive: 5.9.zip
Length Date Time Name
--------- ---------- ----- ----
569 2016-05-10 17:04 exportDescriptor.properties
--------- -------
569 1 file
If small enough, you can print the whole file with unzip -p:
unzip -p 5.9.zip exportDescriptor.properties#Tue May 10 17:04:43 CEST 2016
buildNumber=6169
ao.data.version.com.atlassian.mywork.mywork-confluence-host-plugin=4.0.1
ao.data.version.min.com.atlassian.confluence.plugins.confluence-space-ia=5.0
exportType=all
createdByBuildNumber=6442
backupAttachments=true
defaultUsersGroup=confluence-users
ao.data.list=com.atlassian.mywork.mywork-confluence-host-plugin, com.atlassian.confluence.plugins.confluence-space-ia
ao.data.version.min.com.atlassian.mywork.mywork-confluence-host-plugin=1.1.30
ao.data.version.com.atlassian.confluence.plugins.confluence-space-ia=13.0.4
The entities.xml file can be huge, though. Here, you have a few options:
- grep the relevant parts, especially if you are on a restricted connection, or if you are using Admin Tools to run commands (see next section)
- create an archive containing only entities.xml
The entities.xml, although huge, should compress very well and this might allow you to get the full file:
unzip -p 5.9.zip entities.xml | gzip > 5.9.entities.xml.gz
Grep the relevant parts
If you have access to a Confluence package from the command line (locally, on a remote ssh server, or through the Admin tools extension), grepping the relevant parts can be a very efficient way to get what you need.
For instance, if you need the body content of a page, or other things related to a particular page, you can grep its page id (which you can find from a migrated document in its ConfluencePageClass object):
unzip -p 5.9.zip entities.xml | grep -C5 196616<property name="bodyType">2</property>
</object>
<object class="BodyContent" package="com.atlassian.confluence.core">
<id name="id">458755</id>
<property name="body"><![CDATA[<p><br />You can start a discussion by simply leaving a comment on a page, like this one.</p><p>Why not give it a try?</p><p>Go to the bottom of this page and start typing in the comment area. When you're finished just press save! </p><p>Don't just confine your comments to the bottom of the page - highlight some text on the page to add an inline comment like this:</p><p><ac:image><ri:attachment ri:filename="Step8-01.png" /></ac:image></p><p><strong>Hint:</strong> You can mention another user in a page or comment by typing @ and then the user's name. <br />The user will be notified that you mentioned them.</p><h1 style="text-align: center;"><span style="color: rgb(51,51,51);"><br /></span></h1><h1 style="text-align: center;"><span style="color: rgb(51,51,51);"><br /></span></h1><h1 style="text-align: center;"><ac:link><ri:page ri:content-title="Learn the wonders of autoconvert (step 7 of 9)" /><ac:link-body><ac:image ac:height="40" ac:width="106"><ri:attachment ri:filename="prev.jpg"><ri:page ri:content-title="Let's edit this page (step 3 of 9)" /></ri:attachment></ac:image></ac:link-body></ac:link> <ac:link><ri:page ri:content-title="Welcome to Confluence" /><ac:link-body><ac:image><ri:attachment ri:filename="home.jpg"><ri:page ri:content-title="Let's edit this page (step 3 of 9)" /></ri:attachment></ac:image></ac:link-body></ac:link> <ac:link><ri:page ri:content-title="Share your page with a team member (step 9 of 9)" /><ac:link-body><ac:image><ri:attachment ri:filename="next.jpg"><ri:page ri:content-title="Let's edit this page (step 3 of 9)" /></ri:attachment></ac:image></ac:link-body></ac:link></h1><p><span style="color: rgb(51,51,51);"><br /></span></p>]]></property>
<property name="content" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</property>
<property name="bodyType">2</property>
</object>
<object class="BodyContent" package="com.atlassian.confluence.core">
<id name="id">458753</id>
--
<id name="id">426027</id>
<property name="destinationPageTitle"><![CDATA[Learn the wonders of autoconvert (step 7 of 9)]]></property>
<property name="lowerDestinationPageTitle"><![CDATA[learn the wonders of autoconvert (step 7 of 9)]]></property>
<property name="destinationSpaceKey"><![CDATA[ds]]></property>
<property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
<property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</property>
<property name="creationDate">2015-10-20 11:05:06.966</property>
<property name="lastModificationDate">2016-05-10 15:00:04.075</property>
</object>
<object class="OutgoingLink" package="com.atlassian.confluence.links">
--
<id name="id">426032</id>
<property name="destinationPageTitle"><![CDATA[Share your page with a team member (step 9 of 9)]]></property>
<property name="lowerDestinationPageTitle"><![CDATA[share your page with a team member (step 9 of 9)]]></property>
<property name="destinationSpaceKey"><![CDATA[ds]]></property>
<property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
<property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</property>
<property name="creationDate">2015-10-20 11:05:06.966</property>
<property name="lastModificationDate">2016-05-10 15:00:04.075</property>
</object>
<object class="OutgoingLink" package="com.atlassian.confluence.links">
<id name="id">426033</id>
<property name="destinationPageTitle"><![CDATA[Welcome to Confluence]]></property>
<property name="lowerDestinationPageTitle"><![CDATA[welcome to confluence]]></property>
<property name="destinationSpaceKey"><![CDATA[ds]]></property>
<property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
<property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</property>
<property name="creationDate">2015-10-20 11:05:06.966</property>
<property name="lastModificationDate">2016-05-10 15:00:04.075</property>
</object>
<object class="OutgoingLink" package="com.atlassian.confluence.links">
<id name="id">426034</id>
<property name="destinationPageTitle"><![CDATA[Tell people what you think in a comment (step 8 of 9)]]></property>
<property name="lowerDestinationPageTitle"><![CDATA[tell people what you think in a comment (step 8 of 9)]]></property>
<property name="destinationSpaceKey"><![CDATA[ds]]></property>
<property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
<property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</property>
<property name="creationDate">2015-10-20 11:05:06.966</property>
<property name="lastModificationDate">2016-05-10 15:00:04.075</property>
</object>
<object class="OutgoingLink" package="com.atlassian.confluence.links">
--
<id name="id">426023</id>
<property name="destinationPageTitle"><![CDATA[Let's edit this page (step 3 of 9)]]></property>
<property name="lowerDestinationPageTitle"><![CDATA[let's edit this page (step 3 of 9)]]></property>
<property name="destinationSpaceKey"><![CDATA[ds]]></property>
<property name="lowerDestinationSpaceKey"><![CDATA[ds]]></property>
<property name="sourceContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</property>
<property name="creationDate">2015-10-20 11:05:06.966</property>
<property name="lastModificationDate">2016-05-10 15:00:04.075</property>
</object>
<object class="OutgoingLink" package="com.atlassian.confluence.links">
--
<property name="version">1</property>
<property name="creationDate">2015-10-20 11:05:06.966</property>
<property name="lastModificationDate">2016-05-10 15:00:04.075</property>
<property name="versionComment"><![CDATA[]]></property>
<property name="originalVersionId"/><property name="contentStatus"><![CDATA[current]]></property>
<property name="containerContent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</property>
</object>
<object class="Attachment" package="com.atlassian.confluence.pages">
<id name="id">196624</id>
<property name="space" class="Space" package="com.atlassian.confluence.spaces"><id name="id">360449</id>
--
</element>
<element class="Page" package="com.atlassian.confluence.pages"><id name="id">196618</id>
</element>
<element class="Page" package="com.atlassian.confluence.pages"><id name="id">196617</id>
</element>
<element class="Page" package="com.atlassian.confluence.pages"><id name="id">196616</id>
</element>
<element class="Page" package="com.atlassian.confluence.pages"><id name="id">196620</id>
</element>
<element class="Page" package="com.atlassian.confluence.pages"><id name="id">196611</id>
</element>
--
<element class="Attachment" package="com.atlassian.confluence.pages"><id name="id">196634</id>
</element>
</collection>
</object>
<object class="Page" package="com.atlassian.confluence.pages">
<id name="id">196616</id>
<property name="position">7</property>
<property name="parent" class="Page" package="com.atlassian.confluence.pages"><id name="id">196614</id>
</property>
<collection name="ancestors" class="java.util.List"><element class="Page" package="com.atlassian.confluence.pages"><id name="id">196614</id>
</element>The -C (Context) flag of grep says how many lines before and after the match are output. We see in this example that we get a lot of garbage, but our body content is there. -C5 is usually good enough because the body content is usually all in one line, and a BodyContent object doesn't have a lot of properties and usually contains the owning content in its content property.
Grep parameters of interests are also:
- -A for specifying the number of lines you want After the match
- -B for specifying the number of lines you want Before the match
- -F to say that you want exact matches and not the limited regex format grep uses by default
- -E or -P for using more powerful regex engines
Open entities.xml in a text editor that doesn't break on huge content
Many editors will crash when trying to open a huge entities.xml file. We've successfully opened multi-hundred megabyte files with KWrite, the simple version of Kate, the text editor from KDE. This makes it quite comfortable to work with the Confluence export. Of course, some operations are (very slow). A few advise to survive:
- You'll be performing a lot of searches. If your editor searches as you type, don't type your search string in the search box, type it elsewhere and then copy-paste it so only one search is performed.
- Search and replace can work, including regex-based search and replace, but some may freeze your editor. Be extra careful not to use regexes that are expensive to evaluate.
- Some replace operations might be very expensive as well depending on how your editor is implemented. For example, inserting or removing lines many times may freeze your editor.
- Replace All might be way less painful for you than validating each replace and repeatedly waiting for the UI to respond
Open entities.xml in a Web browser
if your entities.xml is big (a few kilos, a few megabytes), but not huge (empirically, under 20M), you can consider opening the entities.xml file in Firefox.
This will allow you to query or clean up the file using the Javascript DOM API, which is quite comfortable as well. For instance:
Get all pages objects:
document.querySelectorAll("object.Page")Remove all ConfluenceBandanaRecord objects:
[].foreach.call(document.querySelectorAll("object.Page"), o => o.remove())Print the titles of all current pages:
[].filter.call(document.querySelectorAll("object.Page"), o => o.querySelector("[name='contentStatus']").textContent == 'current').map(o => o.querySelector("[name='title']").textContent).join("\n")Use XQuery (and XSL)
With tools like Xidel (written in C, likely packaged in your linux distribution's repositories if you are using one of the big Linux distros) or Xee (written in Rust, which you'll be able to download from its releases page), if your entities.xml is big but not huge, you can use XQuery to query your package and get information quite efficiently with XQuery, a query language tailored for XML.
With this out of the way, if you are indeed able to use XQuery, is it nice because it allows you to script and document your operations and make them reproducible. It lets you automatize a lot of things that you would otherwise need to do manually repeatedly, which can be tedious.
We will not explain XQuery here, dedicated tutorial will be more suitable for this. However, the following examples will allow you to get started with XQuery on Confluence packages.
Get the object of a page with a specific id:
(
for $x in doc("entities.xml")/hibernate-generic/object
where $x/id = "71150445"
return $x
)[position() = 1]Get the historical versions of a page:
for $x in doc("entities.xml")/hibernate-generic/object
where $x/id = "71150445"
return $x/collection[@name='historicalVersions']//id/text()
Or:
for $x in doc("entities.xml")/hibernate-generic/object[id/text()='71150445']/collection[@name='historicalVersions']//id/text()
return $x
Or:
for $revId in doc("entities.xml")/hibernate-generic/object[id/text()='71150445']/collection[@name='historicalVersions']//id/text()
return (
for $x in doc("entities.xml")/hibernate-generic/object[id/text()=$revId]
return {data($x/property[@name="version"]/text())}
)You can structure the output a bit:
for $revId in doc("entities.xml")/hibernate-generic/object[id/text()='71150445']/collection[@name='historicalVersions']//id/text()
return (
for $x in doc("entities.xml")/hibernate-generic/object[id/text()=$revId]
return <page><version>{data($x/property[@name="version"]/text())}</version><id>{data($x/id/text())}</id></page>
)