Core

Last modified by Marius Dumitru Florea on 2022/11/25 10:52

Schema
Encoding Dynamic Field Names
Dynamic Field Aliases
Faceting on Object Properties
Sorting on Object Properties
Manipulate links

Schema

Shared Fields

Some fields need to be shared by all indexed entities. The wiki, space and name information is shared because each indexed entity is in our case either a document or held by a document.

Name	Description
id	A unique identifier of the entity among all the indexed entities
type	The type of entity that is indexed. E.g. document, attachment, object, object property etc.
wiki
space	Deprecated since 7.2, use the spaces multivalued field instead. The local space reference. For a document A.B.C.Page the value of this field is A.B.C. This field is analyzed and thus used for free text search.
spaces	The space names. E.g. for a document A.B.C.Page the value is ['A', 'B', 'C']. This field is analyzed and thus mostly used for free text search.
space_exact	We index the local space reference (e.g. A.B\.1.C) verbatim for exact matching.
space_facet	We also need a dedicated field for hierarchical faceting on nested spaces. This field is used to implement a 'facet.prefix'-based drill down. E.g. for a document A.B.C.Page this field will hold ['0/A.', '1/A.B.', '2/A.B.C.'].
space_prefix	This field is used to match descendant documents. A query such as space_prefix:A.B will match the documents from space A.B and all its descendants (like A.B.C). This is possible because this field holds the local references of all the ancestor spaces of a document (i.e. all the prefixes of the space reference). E.g. for a document A.B.C.Page this field will hold ['A', 'A.B', 'A.B.C']. As a consequence, searching for space_prefix:A.B will match A.B.C.Page. NOTE: We don't use the PathHierarchyTokenizer because it doesn't support specifying an escaping character. We compute the values ourselves at index time as a workaround.
name	The document name. This field is analyzed and thus mostly used for free text search.
name_exact	We also need to store the document name verbatim for faceting (exact matching). This facet is useful for attachments and objects for instance.
locale
locales	The list of locales covered by this entry. Dynamically determined from the list of enabled locales and the various locales of the document.
language	Contains only the language part of the locale
hidden	Whether the entity is hidden on not. Only documents can be made hidden explicitly. The attachments, objects and object properties are hidden if the document that holds them is hidden.
links	XWiki 14.8+ the reference of the resources where the various links found in that entity are leading to
links_extended	XWiki 14.8+ contains links plus all the references parents to make easier to search or cleanup links from any entity level (per wiki, per space, per document, etc.)

Document Fields

First of all we need to index the document title, content and meta data.

Name	Description
fullname
title_*	The localized title, indexed based on the document locale. E.g. title_ro
title_sort	We need a dedicated field for sort because analyzed fields cannot be used for sort.
doccontent_*	The rendered document content (transformations are not executed). E.g. doccontent_pt_BR . NOTE: The reason we added the 'doc' prefix instead of keeping just 'content' is because we wanted to be able to use a different boost value for the document content than for the object (objcontent) and the attachment content (attcontent, see the 'qf' parameter in solrconfig.xml).
doccontentraw_*
version	We need to index the document version (revision, e.g. '2.4') to be able to detect when the index is not up to date (not in sync with the database). This check is performed at XWiki startup for instance (see IndexerJob#addMissing).
comment_*	The localized version summary. A brief description of the changes made in the latest version. E.g. comment_en
doclocale	Contains the technical locale of the document (i.e. empty for default entry)
author	The last author. This field is used for faceting (exact matching).
author_display	The last author, this time analyzed and thus used for free text search.
author_display_sort
creator	The document creator, stored verbatim for faceting (exact matching)
creator_display	The document creator, this time analyzed and thus used for free text search.
date
creationDate

Then, in order to avoid joins, we need to index the objects. We try to make the structured data flat using dynamic fields.

Name	Description
class/object	The type of objects stored by this document. E.g. [Blog.BlogPostClass, XWiki.TagClass, ..]
objcontent_*	This field collects the values from all the properties of all the objects found on the indexed document. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
object.aSpace.aClass_*	Dynamic multiValued field indexing the entire content of the objects of the specified type. All values are indexed as localized text, using the document locale. E.g. object.XWiki.TagClass_fr
property.aSpace.aClass.aPropertyName_*	Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.Blog.BlogPostClass.published_boolean, property.Blog.BlogPostClass.publishDate_date, property.Blog.BlogPostClass.category_string, property.Blog.BlogPostClass.summary_en
property.aSpace.aClass.aPropertyName_sort*	Dedicated field for sorting on property values. We need this because Solr doesn't support sorting on multiValued fields. E.g. property.Blog.BlogPostClass.publishDate_sortDate

Notice that we index the property name only in the objcontent field (mixed with the property value). We don't have a dedicated field for this, i.e. the object property names appear only on the field names and not on the field values. Do we need to index the property names? We index the class names because we want to filter documents of a given type. Is there a real use case when we need to find documents that have objects with a given property?

Non-string XObject properties should be indexed based on their type. This means we'll be able to write type-specific constraints in Solr query (e.g. ranges) for Boolean, Number (int, long, float, double) and Date properties:

property.Blog.BlogPostClass.publishDate:[NOW-1MONTH TO NOW]

We can achieve this by suffixing the field name with the type name: property.Blog.BlogPostClass.publishDate_date. But in order to use just the field name in the Solr query need Dynamic Field Aliases.

Note that only the String and the TextArea properties should be indexed as localized text (depending on the document locale). For the rest of the string-based properties (Access Right, List of Users, List of Groups, DBList, etc.) we should use the "string" Solr field type to index/store the property value verbatim in order to be able to perform exact matches on these properties. For StaticList, we need to index the raw value (what is saved in the database) as string (so verbatim), and the display value (what is specified in the XClass) as localized text (so analysed).

A problem with dynamic fields is that we can get invalid field names. The documentation says:

field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. Names with both leading and trailing underscores (e.g. _version_) are reserved.

The class name will surely have a dot and it can also contain other invalid characters (if it's not a standard XWiki class).

Another problem is that Solr doesn't support dynamic fields as default fields, i.e. as fields that are matched when you search for free text (without field:value in the query). This is not a problem for the search results, as dynamic fields like object.* and property.* are copied and aggregated in objcontent which is a default field. The issue is that we can't know what is exactly the XClass property that was matched, we just know that the searched free text was found inside an object.

When searching for documents we should also take the attachments into account.

Name	Description
filename	The name of files attached to this document. E.g. ['todo.txt', 'image.png']
filename_exact	We also need to store the file names as verbatim (without analysing it) for exact/prefix matching.
mimetype	The list of attachment media types. E.g. ['text/plain', 'image/png']
attauthor	The absolute references of the users that uploaded the last version of each of the document attachments. Used for faceting (exact matching). E.g. ['math:XWiki.mflorea', 'gang:XWiki.vmassol', 'xwiki:XWiki.evalica']
attauthor_display	Same as attauthor but indexes the real user name instead of the reference (alias) and it is used for free text search. E.g. ['Thomas Mortagne', 'Florea Marius Dumitru']
attdate	The dates when the attachments have been uploaded (their last version).
attcontent_*	The content of each attachment, indexed based on the document locale. E.g. attcontent_en : ['content of first attachment', 'content of second attachment']
attsize	The size of each attachment in bytes.

All the attachment fields I listed are multivalued. The problem we have with this solution is that the relation between the fields of the same attachment is keep only in the form of the value index (e.g. 3rd attachment size corresponds to 3rd attachment name) which can't be used in queries. In other words, we won't be able to query for documents that have a text/plain file which contains a given word. We will be able to query for documents that have a text/plan file and a file (not necessarily the same!) which contains the given word.

Another problem is that Solr / Lucene doesn't tell us the index of the value that has been matched from a multivalued field like attcontent so we won't know which attachment has been matched (e.g. if Solr would tell us that the 2nd value from attcontent is matched then we would know the 2nd attachment is matched).

Other solutions for indexing the attachments inside the document rows are:

use a dynamic field, e.g. attachment.image.png_*, but we'll hit invalid field names immediately because the file name can contain almost any character
aggregate all the information from each attachment in a static multivalued field (e.g. attachment: ['data of 1st attachment', 'data of 2nd attachment', ..])
aggregate the information from all attachments in a static single valued field

None of these solutions fix the problems we mentioned above and we can search for attachments only (type:ATTACHMENT) if the relation between the attachment fields is important (i.e. we're looking for an attachment that must have 2 or more fields matching some criteria).

Object Fields

Name	Description
class	The object type. E.g. Blog.BlogPostClass
number	The object number, identifies an object when there are multiple objects of the same type on a document.
objcontent_*	This field collects the values from all the properties of the indexed object. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
property.aName_*	Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.published_boolean, property.publishDate_date, property.category_string, property.summary_en

Attachment Fields

Name	Description
filename	The attachment file name. E.g. ['todo.txt']
filename_exact	We also need to store the attachment file name as verbatim (without analysing it) for exact/prefix matching.
filename_sort	The attachment file name used for sorting
mimetype	The attachment media type. E.g. ['text/plain']
attversion	We need to index the attachment version (revision) to be able to detect when the Solr index is out of date (not in sync with the database). E.g. 1.2
attauthor	The absolute reference of the user that uploaded the last version of the attachment. Used for faceting (exact matching). E.g. ['gang:XWiki.vmassol']
attauthor_display	The real name of the user that uploaded the last version of the attachment. Used for free text search. E.g. ['Ecaterina Moraru']
attauthor_display_sort	Same as attauthor_display but used for sorting (single valued).
attdate	The date when the last version of the attachment was uploaded.
attdate_sort	We need a dedicated field for sort because the corresponding field is multiValued (attdate is reused on document rows, see above, and a document can have multiple attachments) and Solr doesn't support sorting on multiValued fields
attcontent_*	The content of the last version of the attachment, indexed based on the document locale. E.g. attcontent_en
attsize	The size, in bytes, of the last version of the attachment
attsize_sort	Needed for sort because attsize is multiValued. See attdate_sort.

Encoding Dynamic Field Names

We need to support special characters in dynamic field names. One solution is to use an encoding scheme similar to the URL-encoding. We cannot use directly the URL-encoding because '+' (plus) and '%' (percent) have special meaning in Solr query syntax. Also, we don't want to encode Unicode letters.

E.g. "Somé Spâce.Bob's Claß" would be encoded as "Somé$20Spâce.Bob$27s$20Claß"

Also, it would be nice to be able to extract the class and property reference from a field name in order to display the location where the search text has been found. We can't use the default class / property reference serialization syntax because '\' and '^' have special meaning in the Solr query syntax. One solution is to implemented a simple serialization syntax that uses only '.' as entity separator and the dot is escaped by repeating it.

E.g. "wiki:Some\.Space.My\.Class^color" would be serialized as "wiki.Some..Space.My..Class.color"

Dynamic Field Aliases

We have a few dynamic fields, such as object.* and property.*, that are multilingual fields so they are indexed in multiple languages. We need support for dynamic aliases (for dynamic fields) so that we can write:

object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text AND object.XWiki.TagClass:news

and it will be expanded into

object:Blog.BlogPostClass AND
(property.Blog.BlogPostClass.title_en:text OR property.Blog.BlogPostClass.title_fr:text OR ...) AND
(object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR ...)

Faceting on Object Properties

We need to be able to add facets on an XObject property using the Query Module API:

#set ($discard = $query.bindValue('facet.field', ['someOtherField', 'property.Test.TestClass.staticList1_string']))

The 'string' suffix means the property was indexed/stored verbatim (without being analysed). Read above to understand why we suffix the field name with the data type. The facet can be triggered with this query:

object:Test.TestClass

Sorting on Object Properties

We should also be able to sort the document search results based on a property value using the Query Module API:

#set ($discard = $query.bindValue('sort', 'property.Test.TestClass.staticList1_sortString asc'))

The 'sortString' suffix is the dynamic type that is used for sorting. Other types are 'sortBoolean', 'sortInt', 'sortLong', 'sortDouble', 'sortFloat' and 'sortDate'. Note that Solr doesn't support sorting on multivalued fields. The documentation says:

Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer).

If you try to sort on a multivalued field you'll get:

Caused by: org.apache.solr.common.SolrException: can not sort on multivalued field: property.Test.TestClass.staticList1_string
at org.apache.solr.schema.SchemaField.checkSortability(SchemaField.java:155)

That's why we need dedicated 'sortXXX' fields that are single valued. The consequence is that only the last value of a property is used for sorting (you can have multiple values either because the property supports multiple selection or because there are multiple objects of the same type on the indexed document). Note that XObject properties are indexed using multivalued dynamic fields (we cannot know beforehand what properties a user-defined XClass will have and if a property supports multiple selection or if a document can have multiple objects of a given type).

Another option for sorting on fields that have multiple values could be to use a function but I can't find one that returns a single value from a multiValued field.

#set ($discard = $query.bindValue('sort', 'aFunctionThatSelectsOneValue(property.Test.TestClass.staticList1_string) asc'))

Manipulate links

Here are some examples of indexed links.

Example:

Wiki content in document `xwiki:Main.WebHome`:

[[doc:Space.Document]]
[[attach:Space.OtherDocument@Attachment]]
[[page:Page1/Page2]]

Resulting index:

links:
- document:xwiki:Space.Document
- attachment:xwiki:Space.OtherDocument@Attachment
- page:xwiki:Page1/Page2
links_extended:
- document:xwiki:Space.Document
- attachment:xwiki:Space.OtherDocument@Attachment
- page:xwiki:Page1/Page2
- wiki:xwiki
- space:xwiki:Space
- document:xwiki:Space.OtherDocument
- page:xwiki:Page1

Core

Schema

Shared Fields

Document Fields

Object Fields

Attachment Fields

Encoding Dynamic Field Names

Dynamic Field Aliases

Faceting on Object Properties

Sorting on Object Properties

Manipulate links

Quick Links

My Recent Modifications

About

About

Support

Platform

User Guide

Admin Guide

Developer Guide

Projects

XWiki

Extensions

Other

Contribute

Status

Practices

Under the Hood

Get Involved

Get Connected