XML Module

Last modified by Michael Hamann on 2024/07/05 18:01

cogOffers XML and HTML/XHTML manipulation and cleaning APIs
TypeJAR
Category
Developed by

XWiki Development Team

Rating
0 Votes
LicenseGNU Lesser General Public License 2.1
Bundled With

XWiki Standard

Description

Features

  • XML Utility methods
  • HTML Utility methods
  • HTML Cleaner: cleans HTML and produces valid XHTML 1.1 or XWiki 14.0-rc-1+  XHTML 5 (if configured) content.
  • Factory to create optimised XMLReader instances. This gives us a level of indirection versus using directly javax.xml.parsers.SAXParserFactory. We use that for example to verify if we're using Xerces and if so we configure it to cache parsed DTD grammars for better performance.
  • XWiki 12.8-rc-1+ XMLAttributeValue class to help add values to an XML attribute.
  • XWiki 14.6-rc-1+ HTMLElementSanitizer component to check if HTML elements and attributes/attribute values are considered safe for user-generated content. 
  • XWiki 14.10.4+, 15.0-rc-1+ An HTMLScriptService to use the HTMLElementSanitizer in scripts.

HTML Cleaning

The HTML Cleaner is pretty powerful: it uses HTMLCleaner to produce valid XML and then has a series of transformations to make the resulting XML valid XHTML 1.1 content (see the test suite).

Example:

// Initialize Rendering components and allow getting instances
EmbeddableComponentManager componentManager = new EmbeddableComponentManager();
componentManager.initialize(this.getClass().getClassLoader());

HTMLCleaner cleaner = componentManager.lookup(HTMLCleaner.class);
String xhtml = HTMLUtils.toString(cleaner.clean(new StringReader("this <b>is</b> bold")));
Assert.assertEquals("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
    + "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n"
    + "<html><head></head><body>"
    + "<p>this <strong>is</strong> bold</p>"
    + "</body></html>\n", xhtml);

XWiki 14.6-rc-1+

In restricted mode, a sanitizer filter is enabled that uses HTMLElementSanitizer to validate that all used elements and attributes are safe.

HTMLElementSanitizer can also be used directly, it's method isElementAllowed(String) checks if an HTML element is safe, the method isAttributeAllowed(String elementName, String attributeName, String value) does the same for attributes and values. The sanitizer supports quite some configuration options to customize its behavior in xwiki.properties:

#-------------------------------------------------------------------------------------
# HTML Sanitization
#-------------------------------------------------------------------------------------

#-# [Since 14.6RC1]
#-# The HTML sanitization strategy to use for user-generated content to avoid JavaScript injection. The following
#-# strategies are available by default:
#-# - secure (default): Only allows known elements and attributes that are considered safe. The following options
#-#                     allow customizing its behavior.
#-# - insecure:         Allows everything including JavaScript. Use this only if you absolutely trust everybody who can
#-#                     write wiki syntax (in particular, all users, but also anonymous users commenting when enabled).
# xml.htmlElementSanitizer = secure

#-# [Since 14.6RC1]
#-# Comma-separated list of additional tags that should be allowed by the HTML sanitizer. These tags will be allowed
#-# in addition to the already extensive built-in list of tags that are considered safe. Use with care to avoid
#-# introducing security issues. By default, the following tags are allowed:
#-# HTML tags: https://github.com/xwiki/xwiki-commons/blob/99484d48e899a68a1b6e33d457825b776c6fe8c3/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/HTMLDefinitions.java#L63-L74
#-# SVG tags: https://github.com/xwiki/xwiki-commons/blob/b11eae9d82cb53f32962056b5faa73f3720c6182/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/SVGDefinitions.java#L91-L102
#-# MathML tags: https://github.com/xwiki/xwiki-commons/blob/b11eae9d82cb53f32962056b5faa73f3720c6182/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/MathMLDefinitions.java#L62-L64
# xml.htmlElementSanitizer.extraAllowedTags =

#-# [Since 14.6RC1]
#-# Comma-separated list of additional attributes that should be allowed by the HTML sanitizer. These attributes will
#-# be allowed in addition to the already extensive built-in list of attributes that are considered safe. This option
#-# is useful if your content uses attributes that are invalid in HTML. Use with care to avoid introducing security
#-# issues. By default, the following attributes are allowed:
#-# HTML attributes: https://github.com/xwiki/xwiki-commons/blob/99484d48e899a68a1b6e33d457825b776c6fe8c3/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/HTMLDefinitions.java#L76-L91
#-# SVG attributes: https://github.com/xwiki/xwiki-commons/blob/b11eae9d82cb53f32962056b5faa73f3720c6182/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/SVGDefinitions.java#L66-L89
#-# MathML attributes: https://github.com/xwiki/xwiki-commons/blob/b11eae9d82cb53f32962056b5faa73f3720c6182/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/MathMLDefinitions.java#L73-L79
#-# XML attributes: https://github.com/xwiki/xwiki-commons/blob/b11eae9d82cb53f32962056b5faa73f3720c6182/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/SecureHTMLElementSanitizer.java#L135
# xml.htmlElementSanitizer.extraAllowedAttributes =

#-# [Since 14.6RC1]
#-# Comma-separated list of tags that should be forbidden. This takes precedence over any tags allowed by default or
#-# configured above. This can be used, for example, to forbid video or audio elements.
# xml.htmlElementSanitizer.forbidTags =

#-# [Since 14.6RC1]
#-# Comma-separated list of attributes that should be forbidden. This takes precedence over any attributes allowed by
#-# default or configured above. This can be used, for example, to forbid inline styles by forbidding the "style"
#-# attribute.
# xml.htmlElementSanitizer.forbidAttributes =

#-# [Since 14.6RC1]
#-# If unknown protocols shall be allowed. This means all protocols like "xwiki:" will be allowed in links, however,
#-# script and data-URIs will still be forbidden (for data-URIs see also below).  By default, unknown protocols are
#-# allowed.
# xml.htmlElementSanitizer.allowUnknownProtocols = true

#-# [Since 14.6RC1]
#-# If unknown protocols are disallowed (see above), the (Java) regular expression URIs are matched against.
#-# The default values is ^(?:(?:f|ht)tps?|mailto|tel|callto|cid|xmpp):
# xml.htmlElementSanitizer.allowedUriRegexp = ^(?:(?:f|ht)tps?|mailto|tel|callto|cid|xmpp):

#-# [Since 14.6RC1]
#-# Comma-separated list of additional tags on which data-URIs should be allowed in "src", "xlink:href" or "href".
#-# Adding "a" here, for example, would allow linking to data-URIs which is disabled by default due to the potential of
#-# security issues. Modern browsers should mitigate them, though, see for example
#-# https://blog.mozilla.org/security/2017/11/27/blocking-top-level-navigations-data-urls-firefox-59/ so you could
#-# use this to allow defining images, PDF files or files to be downloaded inline as data-URIs in links.
# xml.htmlElementSanitizer.extraDataUriTags =

#-# [Since 14.6RC1]
#-# Comma-separated list of additional attributes that are considered safe for arbitrary content including
#-# script-URIs, on these attributes the above-mentioned URI-checks aren't used. Use with care to avoid introducing
#-# security issues.
# xml.htmlElementSanitizer.extraURISafeAttributes =

XWiki 15.10+ A special filter with name restrictedFilterDetector can be added to the list of filters to detect if restricted mode would affect the outcome of HTML cleaning by removing elements, attributes or comments. It's result is available as a boolean (value true/false) attribute with name restrictedFiltering on the document element of the filtered content. This can be used to determine automatically if HTML content could also be used with restricted cleaning without risking any breakage. This filter cannot be used in restricted mode.

To use the HTML Cleaner, you need the following dependency in your Maven pom.xml (available in Maven's Central Repository):

<dependency>
 <groupId>org.xwiki.commons</groupId>
 <artifactId>xwiki-commons-xml</artifactId>
 <version>3.2-milestone-3</version>
</dependency>

Get Connected