Wiki source code of Core

Last modified by Marius Dumitru Florea on 2022/11/25 10:52

Show last authors
1 {{toc/}}
2
3 = Schema
4
5 == Shared Fields ==
6
7 Some fields need to be shared by all indexed entities. The //wiki//, //space// and //name// information is shared because each indexed entity is in our case either a document or held by a document.
8
9 |=Name|=Description
10 |id|A unique identifier of the entity among all the indexed entities
11 |type|The type of entity that is indexed. E.g. document, attachment, object, object property etc.
12 |wiki|\\
13 |space|{{warning}}Deprecated since 7.2, use the ##spaces## multivalued field instead.{{/warning}} The local space reference. For a document ##A.B.C.Page## the value of this field is ##A.B.C##. This field is analyzed and thus used for free text search.
14 |spaces|The space names. E.g. for a document ##A.B.C.Page## the value is ['A', 'B', 'C']. This field is analyzed and thus mostly used for free text search.
15 |space_exact|We index the local space reference (e.g. ##A.B\.1.C##) verbatim for exact matching.
16 |space_facet|We also need a dedicated field for [[hierarchical faceting>>https://wiki.apache.org/solr/HierarchicalFaceting]] on nested spaces. This field is used to implement a 'facet.prefix'-based drill down. E.g. for a document ##A.B.C.Page## this field will hold ['0/A.', '1/A.B.', '2/A.B.C.'].
17 |space_prefix|This field is used to match descendant documents. A query such as ##space_prefix:A.B## will match the documents from space ##A.B## and all its descendants (like ##A.B.C##). This is possible because this field holds the local references of all the ancestor spaces of a document (i.e. all the prefixes of the space reference). E.g. for a document ##A.B.C.Page## this field will hold ['A', 'A.B', 'A.B.C']. As a consequence, searching for ##space_prefix:A.B## will match ##A.B.C.Page##. NOTE: We don't use the ##PathHierarchyTokenizer## because it doesn't support specifying an escaping character. We compute the values ourselves at index time as a workaround.
18 |name|The document name. This field is analyzed and thus mostly used for free text search.
19 |name_exact|We also need to store the document name verbatim for faceting (exact matching). This facet is useful for attachments and objects for instance.
20 |locale|\\
21 |locales|The list of locales covered by this entry. Dynamically determined from the list of enabled locales and the various locales of the document.
22 |language|Contains only the language part of the locale
23 |hidden|Whether the entity is hidden on not. Only documents can be made hidden explicitly. The attachments, objects and object properties are hidden if the document that holds them is hidden.
24 |links|{{version since="14.8"}}the reference of the resources where the various links found in that entity are leading to{{/version}}
25 |links_extended|{{version since="14.8"}}contains **links** plus all the references parents to make easier to search or cleanup links from any entity level (per wiki, per space, per document, etc.){{/version}}
26
27 == Document Fields ==
28
29 First of all we need to index the document title, content and meta data.
30
31 |=Name|=Description
32 |fullname|\\
33 |title_*|The localized title, indexed based on the document locale. E.g. title_ro
34 |title_sort|We need a dedicated field for sort because analyzed fields cannot be used for sort.
35 |doccontent_*|The rendered document content (transformations are not executed). E.g. doccontent_pt_BR . NOTE: The reason we added the 'doc' prefix instead of keeping just 'content' is because we wanted to be able to use a different boost value for the document content than for the object (##objcontent##) and the attachment content (##attcontent##, see the 'qf' parameter in ##solrconfig.xml##).
36 |doccontentraw_*|\\
37 |version|We need to index the document version (revision, e.g. '2.4') to be able to detect when the index is not up to date (not in sync with the database). This check is performed at XWiki startup for instance (see IndexerJob#addMissing).
38 |comment_*|The localized version summary. A brief description of the changes made in the latest version. E.g. comment_en
39 |doclocale|Contains the technical locale of the document (i.e. empty for default entry)
40 |author|The last author. This field is used for faceting (exact matching).
41 |author_display|The last author, this time analyzed and thus used for free text search.
42 |author_display_sort|\\
43 |creator|The document creator, stored verbatim for faceting (exact matching)
44 |creator_display|The document creator, this time analyzed and thus used for free text search.
45 |date|\\
46 |creationDate|\\
47
48 Then, in order to avoid joins, we need to index the objects. We try to make the structured data flat using dynamic fields.
49
50 |=Name|=Description
51 |class/object|The type of objects stored by this document. E.g. [Blog.BlogPostClass, XWiki.TagClass, ..]
52 |objcontent_*|This field collects the values from all the properties of all the objects found on the indexed document. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
53 |object.aSpace.aClass_*|Dynamic multiValued field indexing the entire content of the objects of the specified type. All values are indexed as localized text, using the document locale. E.g. object.XWiki.TagClass_fr
54 |property.aSpace.aClass.aPropertyName_*|Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.Blog.BlogPostClass.published_boolean, property.Blog.BlogPostClass.publishDate_date, property.Blog.BlogPostClass.category_string, property.Blog.BlogPostClass.summary_en
55 |property.aSpace.aClass.aPropertyName_sort*|Dedicated field for sorting on property values. We need this because Solr doesn't support sorting on multiValued fields. E.g. property.Blog.BlogPostClass.publishDate_sortDate
56
57 Notice that we index the property name only in the ##objcontent## field (mixed with the property value). We don't have a dedicated field for this, i.e. the object property names appear only on the field names and not on the field values. Do we need to index the property names? We index the class names because we want to filter documents of a given type. Is there a real use case when we need to find documents that have objects with a given property?
58
59 Non-string XObject properties should be indexed based on their type. This means we'll be able to write type-specific constraints in Solr query (e.g. ranges) for Boolean, Number (int, long, float, double) and Date properties:
60
61 {{code language="none"}}
62 property.Blog.BlogPostClass.publishDate:[NOW-1MONTH TO NOW]
63 {{/code}}
64
65 We can achieve this by suffixing the field name with the type name: ##property.Blog.BlogPostClass.publishDate_date##. But in order to use just the field name in the Solr query need [[Dynamic Field Aliases>>||anchor="HDynamicFieldAliases"]].
66
67 Note that only the String and the TextArea properties should be indexed as localized text (depending on the document locale). For the rest of the string-based properties (Access Right, List of Users, List of Groups, DBList, etc.) we should use the "string" Solr field type to index/store the property value verbatim in order to be able to perform exact matches on these properties. For StaticList, we need to index the raw value (what is saved in the database) as string (so verbatim), and the display value (what is specified in the XClass) as localized text (so analysed).
68
69 A problem with dynamic fields is that we can get invalid field names. The documentation says:
70
71 > field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. Names with both leading and trailing underscores (e.g. _version_) are reserved.
72
73 The class name will surely have a dot and it can also contain other invalid characters (if it's not a standard XWiki class).
74
75 Another problem is that Solr doesn't support dynamic fields as default fields, i.e. as fields that are matched when you search for free text (without field:value in the query). This is not a problem for the search results, as dynamic fields like ##object.*## and ##property.*## are copied and aggregated in ##objcontent## which is a default field. The issue is that we can't know what is exactly the XClass property that was matched, we just know that the searched free text was found inside an object.
76
77 When searching for documents we should also take the attachments into account.
78
79 |=Name|=Description
80 |filename|The name of files attached to this document. E.g. ['todo.txt', 'image.png']
81 |filename_exact|We also need to store the file names as verbatim (without analysing it) for exact/prefix matching.
82 |mimetype|The list of attachment media types. E.g. ['text/plain', 'image/png']
83 |attauthor|The absolute references of the users that uploaded the last version of each of the document attachments. Used for faceting (exact matching). E.g. ['math:XWiki.mflorea', 'gang:XWiki.vmassol', 'xwiki:XWiki.evalica']
84 |attauthor_display|Same as ##attauthor## but indexes the real user name instead of the reference (alias) and it is used for free text search. E.g. ['Thomas Mortagne', 'Florea Marius Dumitru']
85 |attdate|The dates when the attachments have been uploaded (their last version).
86 |attcontent_*|The content of each attachment, indexed based on the document locale. E.g. attcontent_en : ['content of first attachment', 'content of second attachment']
87 |attsize|The size of each attachment in bytes.
88
89 All the attachment fields I listed are multivalued. The problem we have with this solution is that the relation between the fields of the same attachment is keep only in the form of the value index (e.g. 3rd attachment size corresponds to 3rd attachment name) which can't be used in queries. In other words, we won't be able to query for documents that have a text/plain file which contains a given word. We will be able to query for documents that have a text/plan file and a file (not necessarily the same!) which contains the given word.
90
91 Another problem is that Solr / Lucene doesn't tell us the index of the value that has been matched from a multivalued field like ##attcontent## so we won't know which attachment has been matched (e.g. if Solr would tell us that the 2nd value from ##attcontent## is matched then we would know the 2nd attachment is matched).
92
93 Other solutions for indexing the attachments inside the document rows are:
94
95 * use a dynamic field, e.g. attachment.image.png_*, but we'll hit invalid field names immediately because the file name can contain almost any character
96 * aggregate all the information from each attachment in a static multivalued field (e.g. attachment: ['data of 1st attachment', 'data of 2nd attachment', ..])
97 * aggregate the information from all attachments in a static single valued field
98
99 None of these solutions fix the problems we mentioned above and we can search for attachments only (##type:ATTACHMENT##) if the relation between the attachment fields is important (i.e. we're looking for an attachment that must have 2 or more fields matching some criteria).
100
101 == Object Fields ==
102
103 |=Name|=Description
104 |class|The object type. E.g. ##Blog.BlogPostClass##
105 |number|The object number, identifies an object when there are multiple objects of the same type on a document.
106 |objcontent_*|This field collects the values from all the properties of the indexed object. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
107 |property.aName_*|Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.published_boolean, property.publishDate_date, property.category_string, property.summary_en
108
109 == Attachment Fields ==
110
111 |=Name|=Description
112 |filename|The attachment file name. E.g. ['todo.txt']
113 |filename_exact|We also need to store the attachment file name as verbatim (without analysing it) for exact/prefix matching.
114 |filename_sort|The attachment file name used for sorting
115 |mimetype|The attachment media type. E.g. ['text/plain']
116 |attversion|We need to index the attachment version (revision) to be able to detect when the Solr index is out of date (not in sync with the database). E.g. 1.2
117 |attauthor|The absolute reference of the user that uploaded the last version of the attachment. Used for faceting (exact matching). E.g. ['gang:XWiki.vmassol']
118 |attauthor_display|The real name of the user that uploaded the last version of the attachment. Used for free text search. E.g. ['Ecaterina Moraru']
119 |attauthor_display_sort|Same as ##attauthor_display## but used for sorting (single valued).
120 |attdate|The date when the last version of the attachment was uploaded.
121 |attdate_sort|We need a dedicated field for sort because the corresponding field is multiValued (##attdate## is reused on document rows, see above, and a document can have multiple attachments) and Solr doesn't support sorting on multiValued fields
122 |attcontent_*|The content of the last version of the attachment, indexed based on the document locale. E.g. attcontent_en
123 |attsize|The size, in bytes, of the last version of the attachment
124 |attsize_sort|Needed for sort because ##attsize## is multiValued. See ##attdate_sort##.
125
126 = Encoding Dynamic Field Names =
127
128 We need to support special characters in dynamic field names. One solution is to use an encoding scheme similar to the URL-encoding. We cannot use directly the URL-encoding because '+' (plus) and '%' (percent) have special meaning in Solr query syntax. Also, we don't want to encode Unicode letters.
129
130 {{code language="none"}}
131 E.g. "Somé Spâce.Bob's Claß" would be encoded as "Somé$20Spâce.Bob$27s$20Claß"
132 {{/code}}
133
134 Also, it would be nice to be able to extract the class and property reference from a field name in order to display the location where the search text has been found. We can't use the default class / property reference serialization syntax because '\' and '^' have special meaning in the Solr query syntax. One solution is to implemented a simple serialization syntax that uses only '.' as entity separator and the dot is escaped by repeating it.
135
136 {{code language="node"}}
137 E.g. "wiki:Some\.Space.My\.Class^color" would be serialized as "wiki.Some..Space.My..Class.color"
138 {{/code}}
139
140 = Dynamic Field Aliases =
141
142 We have a few dynamic fields, such as ##object.*## and ##property.*##, that are multilingual fields so they are indexed in multiple languages. We need support for dynamic aliases (for dynamic fields) so that we can write:
143
144 {{code language="none"}}
145 object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text AND object.XWiki.TagClass:news
146 {{/code}}
147
148 and it will be expanded into
149
150 {{code language="none"}}
151 object:Blog.BlogPostClass AND
152 (property.Blog.BlogPostClass.title_en:text OR property.Blog.BlogPostClass.title_fr:text OR ...) AND
153 (object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR ...)
154 {{/code}}
155
156 = Faceting on Object Properties =
157
158 We need to be able to add facets on an XObject property using the [[Query Module>>extensions:Extension.Query Module]] API:
159
160 {{code language="none"}}
161 #set ($discard = $query.bindValue('facet.field', ['someOtherField', 'property.Test.TestClass.staticList1_string']))
162 {{/code}}
163
164 The 'string' suffix means the property was indexed/stored verbatim (without being analysed). Read above to understand why we suffix the field name with the data type. The facet can be triggered with this query:
165
166 {{code language="none"}}
167 object:Test.TestClass
168 {{/code}}
169
170 = Sorting on Object Properties =
171
172 We should also be able to sort the document search results based on a property value using the [[Query Module>>extensions:Extension.Query Module]] API:
173
174 {{code language="none"}}
175 #set ($discard = $query.bindValue('sort', 'property.Test.TestClass.staticList1_sortString asc'))
176 {{/code}}
177
178 The 'sortString' suffix is the dynamic type that is used for sorting. Other types are 'sortBoolean', 'sortInt', 'sortLong', 'sortDouble', 'sortFloat' and 'sortDate'. Note that Solr doesn't support sorting on multivalued fields. The [[documentation>>http://wiki.apache.org/solr/CommonQueryParameters#sort]] says:
179
180 > Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer).
181
182 If you try to sort on a multivalued field you'll get:
183
184 {{code language="none"}}
185 Caused by: org.apache.solr.common.SolrException: can not sort on multivalued field: property.Test.TestClass.staticList1_string
186 at org.apache.solr.schema.SchemaField.checkSortability(SchemaField.java:155)
187 {{/code}}
188
189 That's why we need dedicated 'sortXXX' fields that are single valued. The consequence is that only the last value of a property is used for sorting (you can have multiple values either because the property supports multiple selection or because there are multiple objects of the same type on the indexed document). Note that XObject properties are indexed using multivalued dynamic fields (we cannot know beforehand what properties a user-defined XClass will have and if a property supports multiple selection or if a document can have multiple objects of a given type).
190
191 Another option for sorting on fields that have multiple values could be to use a [[function>>http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function]] but I can't find one that returns a single value from a multiValued field.
192
193 {{code language="none"}}
194 #set ($discard = $query.bindValue('sort', 'aFunctionThatSelectsOneValue(property.Test.TestClass.staticList1_string) asc'))
195 {{/code}}
196
197 = Manipulate links
198
199 Here are some examples of indexed links.
200
201 Example:
202
203 Wiki content in document `xwiki:Main.WebHome`:
204
205 {{code language="false"}}
206 [[doc:Space.Document]]
207 [[attach:Space.OtherDocument@Attachment]]
208 [[page:Page1/Page2]]
209 {{/code}}
210
211 Resulting index:
212
213 * **links**:
214 ** ##document:xwiki:Space.Document##
215 ** ##attachment:xwiki:Space.OtherDocument@Attachment##
216 ** ##page:xwiki:Page1/Page2##
217 * **links_extended**:
218 ** ##document:xwiki:Space.Document##
219 ** ##attachment:xwiki:Space.OtherDocument@Attachment##
220 ** ##page:xwiki:Page1/Page2##
221 ** //##wiki:xwiki##//
222 ** //##space:xwiki:Space##//
223 ** //##document:xwiki:Space.OtherDocument##//
224 ** //##page:xwiki:Page1##//

Get Connected