Index for the LLM Application (BETA)

Last modified by Michael Hamann on 2024/05/08 10:08

cogProvide a UI for configuring and managing the knowledge index.
TypeXAR
CategoryApplication
Developed byUnknown
Rating
0 Votes
LicenseGNU Lesser General Public License 2.1
Compatibility

16.2.0 and above

Description

The knowledge index provides a way to configure and manage an index of collections of documents that can provide additional context to chats in the LLM Application. A technique called Retrieval Augmented Generation (RAG) is used to provide context to the chat completion. Context chunks that are semantically similar to the chat message are retrieved from the index and provided to the completion model. This allows the completion model to use knowledge from the index to generate more factual completions.

There are two ways to manage collections and the documents in them: through the UI that is provided in this extension or through the REST API. Right now, the extension primarily supports indexing external content, but in the future, it will also support indexing existing content that is already in the wiki.

Managing Collections

The AI application provides a simple interface for managing collections and their documents. When opening the AI application as admin, a link is provided at the bottom to open the Collections overview.

Collections overview, showing two collections in a table

Here you can add a new collection or edit or delete an existing one. In the view of a collection, you can edit all properties, see the documents in the collection and add new documents. The name of the page of a document is a random value that is derived from the name of the collection. It doesn't use the name directly in order to support long document names, e.g., coming from an external application that might not be valid document names in XWiki. Renaming documents isn't supported at the moment as the primary use case is indexing external content. In a future version, it will be possible to index existing content of the wiki instead of managing the content separately.

View of a single collection with various properties and two documents in a table below it

The following properties can be edited:

  • Title: The display name of the collection
  • Embedding Model: The reference to the embedding model that can be configured in the models section of the AI application. This model is used to generate embeddings of parts (chunks) of the documents in the collection in order to support similarity search. By embedding the chat message in the same way, the similarity of the chat message to the documents in the collection can be calculated and the most relevant documents can be found.
  • Chunking Method: The method used to split the documents into chunks. Currently, only the maxChars method is supported, which splits the document into chunks of a maximum size with an overlap.
  • Chunking Max Size: The maximum size of a chunk in characters
  • Chunking Overlap Offset: The overlap of chunks in characters
  • Allow Guests: If guests can query this collection (as part of a chat completion)
  • Query Groups: The list of groups that are allowed to query this collection (as part of a chat completion). This is only used when guests aren't allowed to query the collection.
  • Rights Check Method: The method used for checking access rights of individual documents during queries. Supported values: public, external. See also the Authorization section below.
  • URL of the external rights check method: The URL that is used to check rights for all found documents when the external rights check method is selected. This property is only displayed when the external rights check method is selected.

Managing Documents

In the view of a collection, you can see the documents in the collection and add new documents.

View of a single document "WAISE Intro"

You can configure the following properties of documents:

  • Title: The title of the document
  • Language: The language of the document (currently not used)
  • URL: The URL of the document, used to display a link to the original resource when the document is used as context in a chat
  • Mime Type: The mime type of the document, currently not used, could be used in the future to use a chunking method that is specific to the mime type
  • Content: The content of the document that is indexed

Additionally, you can attach files to a document and they will also be indexed if the content can be extracted by Apache Tika.

REST API

The REST API provides a convenient way to manage both collections and documents inside collections from external tools. This can be used in other applications to add, update and delete content that is indexed in the LLM application whenever content is added, changed or deleted the application.

Collections Resource

GET /wikis/{wikiName}/aiLLM/collections

Returns a list of all collection IDs that the current user can access as an array of strings.

Collection Resource

A collection object has the following properties, all properties are strings unless otherwise mentioned:

  • id: id of the collection
  • title: pretty name of the collection
  • embedding_model: the embedding model as reference of the page that defines the model
  • chunking_method: the chunking method, should be maxChars
  • chunking_max_size: (int) the maximum size of a chunk
  • chunking_overlap_offset: (int) the overlap of chunks
  • document_spaces: (list of strings) the list of XWiki spaces indexed by this collection, currently not used
  • allow_guests: (boolean) if guests can query this collection (as part of a chat completion)
  • query_groups: (list of strings) the list of groups that are allowed to query this collection (as part of a chat completion). This is only used when guests aren't allowed to query the collection.
  • rights_check_method: the method used for checking access rights of individual documents during queries. Supported values: public, external. See also the Authorization section below.
  • rights_check_method_configuration: (object) options of the rights check method, values depend on the selected rights check method. The "external" rights check method supports the url parameter which is the URL that is used to check rights for all found documents.

GET /wikis/{wikiName}/aiLLM/collections/{collectionName}

Returns the collection of the given name.

PUT /wikis/{wikiName}/aiLLM/collections/{collectionName}

Creates or updates the collection of the given name. The body of the request is a collection, if not all properties are specified, only the specified properties are updated.

DELETE /wikis/{wikiName}/aiLLM/collections/{collectionName}

Deletes the collection of the given name

GET /wikis/{wikiName}/aiLLM/collections/{collectionName}/documents

Returns a list of document IDs that are contained in the collection of the given name. This resource supports the following query parameters:

  • start: the index of the first document to retrieve (default: 0)
  • number: the number of documents to retrieve (default: -1)

Document Resource

A document object has the following properties, all properties are strings unless otherwise mentioned:

  • id: id of the document
  • title: pretty name of the document
  • language: the language of the document (currently not used)
  • url: the URL of the document, used to display a link to the original resource when the document is used as context in a chat
  • collection: the name of the collection the document is part of
  • mimetype: the mime type of the document, currently not used, could be used in the future to use a chunking method that is specific to the mime type
  • content: the content of the document that is indexed

GET /wikis/{wikiName}/aiLLM/collections/{collectionName}/documents/{documentID}

Returns the document of the given name in the collection of the given name.

PUT /wikis/{wikiName}/aiLLM/collections/{collectionName}/documents/{documentID}

Creates or updates the document of the given name in the collection of the given name. The body of the request is a document, if not all properties are specified, only the specified properties are updated.

DELETE /wikis/{wikiName}/aiLLM/collections/{collectionName}/documents/{documentID}

Deletes the document of the given name in the collection of the given name.

Authorization

At query time, regular XWiki access rights aren't checked. Instead, access to the indexed can be controlled at two levels:

  • The collection: It can be controlled which groups can query a collection in the configuration of the collection. A collection can also be allowed for guests, in this case no check is performed.
  • The document: On every collection, a method for checking rights for individual documents can be configured. After retrieving relevant chunks of context information, this rights checking method is asked for authorization for every retrieved document.

By default, two rights check methods are provided:

  • Public: this method just allows access to all documents. It is best suited when all users who have access to the collection should also be allowed to access all documents in it.
  • External: this method queries an external API via HTTP to check which documents can be accessed. It supports configuring a URL that is contacted on every query.

External Authorization Checks

On every query, a POST request with the following contents is submitted to the configured URL:

If you need to pass additional parameters like the collection name you can specify them as URL parameters.

An example request for a single document for the standard admin user could look like this:

{"document_ids":["Performance"],"xwiki_username":"xwiki:XWiki.Admin"}

The response needs to be an object mapping document IDs to true or false, meaning that the user has or hasn't access to the respective document. To grant access to the "Performance" document in the example, the endpoint would need to reply {"Performance":true}. Any missing values or errors are interpreted as no access.

Custom Authorization Checks

The authorization system in the LLM application is designed to be fully extensible. To implement a custom authorization method, a component of type org.xwiki.contrib.llm.authorization.AuthorizationManagerBuilder needs to be implemented. If the authorization method needs any configuration, three additional parts need to be implemented:

  • A class to store the configuration that is serializable and unserializable using Jackson. This class is mainly used to represent the configuration in the REST API for collections.
  • An XClass to store the configuration in the wiki page of the collection
  • A sheet to display the aforementioned XClass

The following shows an example component implemented in Groovy that can be used with the Script Component Extension:

package org.xwiki.contrib.llm.internal;

import javax.inject.Named
import javax.inject.Singleton
import org.xwiki.component.annotation.Component
import org.xwiki.contrib.llm.authorization.AuthorizationManager
import org.xwiki.contrib.llm.authorization.AuthorizationManagerBuilder
import org.xwiki.contrib.llm.Collection
import org.xwiki.model.reference.EntityReference
import com.xpn.xwiki.objects.BaseObject
import org.xwiki.model.reference.LocalDocumentReference

/**
 * A custom {@link AuthorizationManagerBuilder} that always returns true for all document ids.
 *
 * @version $Id$
 * @since 0.3
 */

@Component
@Named("custom")
@Singleton
class CustomAuthorizationManagerBuilder implements AuthorizationManagerBuilder {
   static class ConfigurationType {
        String property

       ConfigurationType(String property) {
           this.property = property
       }
   }

   @Override
    AuthorizationManager build(BaseObject configurationObject) {
       return { documentIds ->
            documentIds.collectEntries { id -> [id, true] }
       }
   }

   @Override
    EntityReference getConfigurationClassReference() {
       return new LocalDocumentReference('Custom Right Check', 'Class')
   }

   @Override
    EntityReference getConfigurationSheetReference() {
       return new LocalDocumentReference('Custom Right Check', 'Sheet')
   }

   @Override
    Class getConfigurationType() {
       return ConfigurationType.class
   }

   @Override
    Object getConfiguration(BaseObject object) {
       return new ConfigurationType(object.getStringValue("property"))
   }

   @Override
   void setConfiguration(BaseObject object, Object configuration) {
        object.setStringValue("property", configuration.property)
   }
}

 This code assume an XClass Custom Right Check.Class with a property string "property" and a sheet Custom Right Check.Sheet to display this property.

The sheet should contribute one or several dt and dd elements that are displayed in the configuration of the collection. A possible sheet would be the following.

{{velocity}}
#set ($object = $doc.getObject('Custom Right Check.Class', true))
#set ($editing = $xcontext.action == 'edit')
#set ($discard = $doc.use($object))
{{html clean="false"}}<dt #if (!$editing && $hasEdit)
        class="editableProperty"
        data-property="$escapetool.xml($services.model.serialize($object.getPropertyReference('property')))"
        data-object-policy="updateOrCreate"
        data-property-type="object"#end>
      <label#if ($editing) for="Custom Right Check.Class_0_property"#end>
        $escapetool.xml($doc.displayPrettyName('property', false, false))
      </label>
    </dt>
{{/html}}

{{html clean="false"}}<dd>{{/html}}

$doc.display('property')

{{html clean="false"}}</dd>{{/html}}
{{/velocity}}

It is important to use the updateOrCreate object policy as there is no separate code for adding the object. Further, this sheet will be applied also on documents that don't contain the configuration object yet, so it is important that the sheet creates it (first line of the code). This sheet is then dynamically loaded when the respective right check method is selected. When another value is selected, the sheet could be hidden again. The sheet thus shouldn't assume that when it is used it is also visible. In case any validation shall be performed, the sheet should verify that it is actually visible before performing any validation.

Get Connected