Core

Last modified by Marius Dumitru Florea on 2022/11/25 10:52

Schema

Shared Fields

Some fields need to be shared by all indexed entities. The wiki, space and name information is shared because each indexed entity is in our case either a document or held by a document.

NameDescription
idA unique identifier of the entity among all the indexed entities
typeThe type of entity that is indexed. E.g. document, attachment, object, object property etc.
wiki
spaceDeprecated since 7.2, use the spaces multivalued field instead. The local space reference. For a document A.B.C.Page the value of this field is A.B.C. This field is analyzed and thus used for free text search.
spacesThe space names. E.g. for a document A.B.C.Page the value is ['A', 'B', 'C']. This field is analyzed and thus mostly used for free text search.
space_exactWe index the local space reference (e.g. A.B\.1.C) verbatim for exact matching.
space_facetWe also need a dedicated field for hierarchical faceting on nested spaces. This field is used to implement a 'facet.prefix'-based drill down. E.g. for a document A.B.C.Page this field will hold ['0/A.', '1/A.B.', '2/A.B.C.'].
space_prefixThis field is used to match descendant documents. A query such as space_prefix:A.B will match the documents from space A.B and all its descendants (like A.B.C). This is possible because this field holds the local references of all the ancestor spaces of a document (i.e. all the prefixes of the space reference). E.g. for a document A.B.C.Page this field will hold ['A', 'A.B', 'A.B.C']. As a consequence, searching for space_prefix:A.B will match A.B.C.Page. NOTE: We don't use the PathHierarchyTokenizer because it doesn't support specifying an escaping character. We compute the values ourselves at index time as a workaround.
nameThe document name. This field is analyzed and thus mostly used for free text search.
name_exactWe also need to store the document name verbatim for faceting (exact matching). This facet is useful for attachments and objects for instance.
locale
localesThe list of locales covered by this entry. Dynamically determined from the list of enabled locales and the various locales of the document.
languageContains only the language part of the locale
hiddenWhether the entity is hidden on not. Only documents can be made hidden explicitly. The attachments, objects and object properties are hidden if the document that holds them is hidden.
linksXWiki 14.8+ the reference of the resources where the various links found in that entity are leading to
links_extendedXWiki 14.8+ contains links plus all the references parents to make easier to search or cleanup links from any entity level (per wiki, per space, per document, etc.)

Document Fields

First of all we need to index the document title, content and meta data.

NameDescription
fullname
title_*The localized title, indexed based on the document locale. E.g. title_ro
title_sortWe need a dedicated field for sort because analyzed fields cannot be used for sort.
doccontent_*The rendered document content (transformations are not executed). E.g. doccontent_pt_BR . NOTE: The reason we added the 'doc' prefix instead of keeping just 'content' is because we wanted to be able to use a different boost value for the document content than for the object (objcontent) and the attachment content (attcontent, see the 'qf' parameter in solrconfig.xml).
doccontentraw_*
versionWe need to index the document version (revision, e.g. '2.4') to be able to detect when the index is not up to date (not in sync with the database). This check is performed at XWiki startup for instance (see IndexerJob#addMissing).
comment_*The localized version summary. A brief description of the changes made in the latest version. E.g. comment_en
doclocaleContains the technical locale of the document (i.e. empty for default entry)
authorThe last author. This field is used for faceting (exact matching).
author_displayThe last author, this time analyzed and thus used for free text search.
author_display_sort
creatorThe document creator, stored verbatim for faceting (exact matching)
creator_displayThe document creator, this time analyzed and thus used for free text search.
date
creationDate

Then, in order to avoid joins, we need to index the objects. We try to make the structured data flat using dynamic fields.

NameDescription
class/objectThe type of objects stored by this document. E.g. [Blog.BlogPostClass, XWiki.TagClass, ..]
objcontent_*This field collects the values from all the properties of all the objects found on the indexed document. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
object.aSpace.aClass_*Dynamic multiValued field indexing the entire content of the objects of the specified type. All values are indexed as localized text, using the document locale. E.g. object.XWiki.TagClass_fr
property.aSpace.aClass.aPropertyName_*Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.Blog.BlogPostClass.published_boolean, property.Blog.BlogPostClass.publishDate_date, property.Blog.BlogPostClass.category_string, property.Blog.BlogPostClass.summary_en
property.aSpace.aClass.aPropertyName_sort*Dedicated field for sorting on property values. We need this because Solr doesn't support sorting on multiValued fields. E.g. property.Blog.BlogPostClass.publishDate_sortDate

Notice that we index the property name only in the objcontent field (mixed with the property value). We don't have a dedicated field for this, i.e. the object property names appear only on the field names and not on the field values. Do we need to index the property names? We index the class names because we want to filter documents of a given type. Is there a real use case when we need to find documents that have objects with a given property?

Non-string XObject properties should be indexed based on their type. This means we'll be able to write type-specific constraints in Solr query (e.g. ranges) for Boolean, Number (int, long, float, double) and Date properties:

property.Blog.BlogPostClass.publishDate:[NOW-1MONTH TO NOW]

We can achieve this by suffixing the field name with the type name: property.Blog.BlogPostClass.publishDate_date. But in order to use just the field name in the Solr query need Dynamic Field Aliases.

Note that only the String and the TextArea properties should be indexed as localized text (depending on the document locale). For the rest of the string-based properties (Access Right, List of Users, List of Groups, DBList, etc.) we should use the "string" Solr field type to index/store the property value verbatim in order to be able to perform exact matches on these properties. For StaticList, we need to index the raw value (what is saved in the database) as string (so verbatim), and the display value (what is specified in the XClass) as localized text (so analysed).

A problem with dynamic fields is that we can get invalid field names. The documentation says:

 field names should consist of alphanumeric or underscore characters only and not start with a digit.  This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed.  Names with both leading and trailing underscores (e.g. _version_) are reserved.

The class name will surely have a dot and it can also contain other invalid characters (if it's not a standard XWiki class).

Another problem is that Solr doesn't support dynamic fields as default fields, i.e. as fields that are matched when you search for free text (without field:value in the query). This is not a problem for the search results, as dynamic fields like object.* and property.* are copied and aggregated in objcontent which is a default field. The issue is that we can't know what is exactly the XClass property that was matched, we just know that the searched free text was found inside an object.

When searching for documents we should also take the attachments into account.

NameDescription
filenameThe name of files attached to this document. E.g. ['todo.txt', 'image.png']
filename_exactWe also need to store the file names as verbatim (without analysing it) for exact/prefix matching.
mimetypeThe list of attachment media types. E.g. ['text/plain', 'image/png']
attauthorThe absolute references of the users that uploaded the last version of each of the document attachments. Used for faceting (exact matching). E.g. ['math:XWiki.mflorea', 'gang:XWiki.vmassol', 'xwiki:XWiki.evalica']
attauthor_displaySame as attauthor but indexes the real user name instead of the reference (alias) and it is used for free text search. E.g. ['Thomas Mortagne', 'Florea Marius Dumitru']
attdateThe dates when the attachments have been uploaded (their last version).
attcontent_*The content of each attachment, indexed based on the document locale. E.g. attcontent_en : ['content of first attachment', 'content of second attachment']
attsizeThe size of each attachment in bytes.

All the attachment fields I listed are multivalued. The problem we have with this solution is that the relation between the fields of the same attachment is keep only in the form of the value index (e.g. 3rd attachment size corresponds to 3rd attachment name) which can't be used in queries. In other words, we won't be able to query for documents that have a text/plain file which contains a given word. We will be able to query for documents that have a text/plan file and a file (not necessarily the same!) which contains the given word.

Another problem is that Solr / Lucene doesn't tell us the index of the value that has been matched from a multivalued field like attcontent so we won't know which attachment has been matched (e.g. if Solr would tell us that the 2nd value from attcontent is matched then we would know the 2nd attachment is matched).

Other solutions for indexing the attachments inside the document rows are:

  • use a dynamic field, e.g. attachment.image.png_*, but we'll hit invalid field names immediately because the file name can contain almost any character
  • aggregate all the information from each attachment in a static multivalued field (e.g. attachment: ['data of 1st attachment', 'data of 2nd attachment', ..])
  • aggregate the information from all attachments in a static single valued field

None of these solutions fix the problems we mentioned above and we can search for attachments only (type:ATTACHMENT) if the relation between the attachment fields is important (i.e. we're looking for an attachment that must have 2 or more fields matching some criteria).

Object Fields

NameDescription
classThe object type. E.g. Blog.BlogPostClass
numberThe object number, identifies an object when there are multiple objects of the same type on a document.
objcontent_*This field collects the values from all the properties of the indexed object. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
property.aName_*Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.published_boolean, property.publishDate_date, property.category_string, property.summary_en

Attachment Fields

NameDescription
filenameThe attachment file name. E.g. ['todo.txt']
filename_exactWe also need to store the attachment file name as verbatim (without analysing it) for exact/prefix matching.
filename_sortThe attachment file name used for sorting
mimetypeThe attachment media type. E.g. ['text/plain']
attversionWe need to index the attachment version (revision) to be able to detect when the Solr index is out of date (not in sync with the database). E.g. 1.2
attauthorThe absolute reference of the user that uploaded the last version of the attachment. Used for faceting (exact matching). E.g. ['gang:XWiki.vmassol']
attauthor_displayThe real name of the user that uploaded the last version of the attachment. Used for free text search. E.g. ['Ecaterina Moraru']
attauthor_display_sortSame as attauthor_display but used for sorting (single valued).
attdateThe date when the last version of the attachment was uploaded.
attdate_sortWe need a dedicated field for sort because the corresponding field is multiValued (attdate is reused on document rows, see above, and a document can have multiple attachments) and Solr doesn't support sorting on multiValued fields
attcontent_*The content of the last version of the attachment, indexed based on the document locale. E.g. attcontent_en
attsizeThe size, in bytes, of the last version of the attachment
attsize_sortNeeded for sort because attsize is multiValued. See attdate_sort.

Encoding Dynamic Field Names

We need to support special characters in dynamic field names. One solution is to use an encoding scheme similar to the URL-encoding. We cannot use directly the URL-encoding because '+' (plus) and '%' (percent) have special meaning in Solr query syntax. Also, we don't want to encode Unicode letters.

E.g. "Somé Spâce.Bob's Claß" would be encoded as "Somé$20Spâce.Bob$27s$20Claß"

Also, it would be nice to be able to extract the class and property reference from a field name in order to display the location where the search text has been found. We can't use the default class / property reference serialization syntax because '\' and '^' have special meaning in the Solr query syntax. One solution is to implemented a simple serialization syntax that uses only '.' as entity separator and the dot is escaped by repeating it.

E.g. "wiki:Some\.Space.My\.Class^color" would be serialized as "wiki.Some..Space.My..Class.color"

Dynamic Field Aliases

We have a few dynamic fields, such as object.* and property.*, that are multilingual fields so they are indexed in multiple languages. We need support for dynamic aliases (for dynamic fields) so that we can write:

object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text AND object.XWiki.TagClass:news

and it will be expanded into

object:Blog.BlogPostClass AND
(property.Blog.BlogPostClass.title_en:text OR property.Blog.BlogPostClass.title_fr:text OR ...) AND
(object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR ...)

Faceting on Object Properties

We need to be able to add facets on an XObject property using the Query Module API:

#set ($discard = $query.bindValue('facet.field', ['someOtherField', 'property.Test.TestClass.staticList1_string']))

The 'string' suffix means the property was indexed/stored verbatim (without being analysed). Read above to understand why we suffix the field name with the data type. The facet can be triggered with this query:

object:Test.TestClass

Sorting on Object Properties

We should also be able to sort the document search results based on a property value using the Query Module API:

#set ($discard = $query.bindValue('sort', 'property.Test.TestClass.staticList1_sortString asc'))

The 'sortString' suffix is the dynamic type that is used for sorting. Other types are 'sortBoolean', 'sortInt', 'sortLong', 'sortDouble', 'sortFloat' and 'sortDate'. Note that Solr doesn't support sorting on multivalued fields. The documentation says:

 Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer).

If you try to sort on a multivalued field you'll get:

Caused by: org.apache.solr.common.SolrException: can not sort on multivalued field: property.Test.TestClass.staticList1_string
    at org.apache.solr.schema.SchemaField.checkSortability(SchemaField.java:155)

That's why we need dedicated 'sortXXX' fields that are single valued. The consequence is that only the last value of a property is used for sorting (you can have multiple values either because the property supports multiple selection or because there are multiple objects of the same type on the indexed document). Note that XObject properties are indexed using multivalued dynamic fields (we cannot know beforehand what properties a user-defined XClass will have and if a property supports multiple selection or if a document can have multiple objects of a given type).

Another option for sorting on fields that have multiple values could be to use a function but I can't find one that returns a single value from a multiValued field.

#set ($discard = $query.bindValue('sort', 'aFunctionThatSelectsOneValue(property.Test.TestClass.staticList1_string) asc'))

Manipulate links

Here are some examples of indexed links.

Example:

Wiki content in document `xwiki:Main.WebHome`:

[[doc:Space.Document]]
[[attach:Space.OtherDocument@Attachment]]
[[page:Page1/Page2]]

Resulting index:

  • links:
    • document:xwiki:Space.Document
    • attachment:xwiki:Space.OtherDocument@Attachment
    • page:xwiki:Page1/Page2
  • links_extended:
    • document:xwiki:Space.Document
    • attachment:xwiki:Space.OtherDocument@Attachment
    • page:xwiki:Page1/Page2
    • wiki:xwiki
    • space:xwiki:Space
    • document:xwiki:Space.OtherDocument
    • page:xwiki:Page1

Get Connected