Hyland Connect

angelborroy · ‎09-13-2023

Alfresco Search Enterprise is the Search Engine available for Alfresco Enterprise deployments that uses an external Elasticsearch 7.x or OpenSearch 1.x service. This blog post covers implementation details for the Indexing and Reindexing components.

Indexing

The Repo Event Channel is a topic populated by Repository that delivers a copy of every incoming message to any subscriber. Since a message represents a node event, following scenarios need to be addressed:

a node has been created
- it includes content: metadata and content indexing components are both involved
- it doesn't include any content: only metadata indexing is involved
a node has been updated
- the update relates to one or more properties: only metadata indexing is involved
- specifically the content has been updated: only content indexing is involved
- the update relates to one or more properties and content, too: metadata and content indexing components are both involved
a node has been deleted
- only metadata indexing is involved, to remove the document from the index

If we define the indexing components (metadata, content and path) as direct subscribers of the event channel, it won't be possible to scale up them: if there are multiple instances of the content indexing component, each of them will receive a copy of the same event node related with a specific node id; that means each instance will activate the same indexing workflow for the same node, resulting in a lot of useless process duplication.

Note that permissions (ACLs) are indexed by metadata indexing, as they are part of the incoming message from Repo. Documents in search index include metadata, permissions, content and path together. Once the document has been created in the index by metadata indexing, content and path can be updated.

The Live Indexing Mediator is in charge to

act as a singleton subscriber instance of the event channel. Being the main entry point of the overall subsystem, we need to make sure every message is properly delivered to this component. For that reason,
- in order to avoid a single point of failure the associated topic is supposed to be durable
- in order to avoid a single point of failure the component could be set in a cluster mode: the important thing, regardless the deployment mode, is that the component must always act as a singleton instance. In other words, there cannot be multiple mediator active instances subscribing the same topic.
filtering out attributes we don't want to index
avoid to trigger the content management pipeline in case the content has been marked as not indexed
dispatch the event node message to one or both live indexing message channels
avoid unnecessary processing: for example, if an incoming event is related with a node property change (no content) then the mediator would consume the message from the event channel and then move forward only the metadata indexing chain.

The mediator is consuming events from event2 topic in ActiveMQ. Default value can be customized using following property:

alfresco.event.topic = activemq:topic:alfresco.repo.event2

There are three different queues used by the mediator to place new messages for metadata, content, and path. Live Indexing component is consuming these messages to perform required action, like indexing new metadata or requesting a transformation to text so the content can be indexed. Default values can be customized using following properties:

alfresco.metadata.event.channel = activemq:queue:org.alfresco.search.metadata.event
alfresco.content.event.channel = activemq:queue:org.alfresco.search.content.event
alfresco.path.event.channel = activemq:queue:org.alfresco.search.path.event

Blacklisted attributes

The Mediation component relies on a configuration file which acts as a blacklist containing

The list of node types to be excluded from indexing
The list of node aspects to be excluded from indexing
The list of node types with content excluded from content indexing
The list of property names to be excluded from metadata indexing

The blacklist file path / reference can be specified through usual Spring configuration capabilities. That means:

a property called alfresco.mediation.filter-file in the module application.properties
a system property -Dalfresco.mediation.filter-file

The default value of that property is classpath:mediation-filter.yml, it points to a file included in the bundle which provides following rules:

mediation:
  nodeTypes:
  contentNodeTypes:
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken
    - alfcmis:nodeRef
    - cmis:isImmutable
    - cmis:isLatestVersion
    - cmis:isMajorVersion
    - cmis:isLatestMajorVersion
    - cmis:isVersionSeriesCheckedOut
    - cmis:versionSeriesCheckedOutBy
    - cmis:versionSeriesCheckedOutId
    - cmis:checkinComment
    - cmis:contentStreamId
    - cmis:isPrivateWorkingCopy
    - cmis:allowedChildObjectTypeIds
    - cmis:sourceId
    - cmis:targetId
    - cmis:policyText
    - trx:password
    - pub:publishingEventPayload

There is no support for regular expressions to specify values for the different categories, every excluded type, aspect, or field must be included individually.

There is no filtering property for path indexing, but setting cm:indexControl aspect can be used to avoid a folder hierarchy to be indexed.

Note that in addition to this filtering process in Search Enterprise side, Repository configuration is ignoring a set of types, aspects and associations that are defined using following properties:

repo.event2.filter.nodeTypes=sys:*, fm:*, cm:thumbnail, cm:failedThumbnail, cm:rating, rma:rmsite include_subtypes
repo.event2.filter.nodeAspects=sys:*
repo.event2.filter.childAssocTypes=rn:rendition

Instead of using a different value for specifying a blacklist file (classpath:mediation-filter.yml) you can provide your own mediation-filter.yml by including and prepending it to the application classpath.

By means of that file, the admin can define a list of fields that won't be sent to Elasticsearch or OpenSearch.

A field having a match in such blacklist could be:

a metadata attribute, and in that case the mediation will remove it from the node event definition: as a consequence of that the indexed document won't have the corresponding field and the metadata indexer won't sent it to Elasticsearch
a content attribute, and that case the content processing chain won't never be executed. In other words, no message will be dispatched to the Content Event Channel and no transformation will be invoked at all

Reindexing

The Reindexing component has the responsibility to re-index the full repository or a portion of Alfresco nodes.

This may be useful when:

the search index has been corrupted/lost
a significant change happened in the content model
after an ACS upgrade

Reindexing app is built on top of Spring Batch framework:

JDBC Item Reader gets a large number of records from Alfresco database
Async Item Processor transforms the data building Elasticsearch/OpenSearch documents and extracts the content through the transformer service
ES Item Writer sends data in a batched fashion to Elasticsearch/OpenSearch. If the document already exists on the index, it will be updated.

Before running the Reindexing app, generating a JSON map of namespace to prefix is required. The project Alfresco Model Namespace-Prefix Mapping can be used to create this reindex.prefixes-file.json file. This external file can be specified using following environment variable:

alfresco.reindex.prefixes-file=file:reindex.prefixes-file.json

Database settings are set using default Spring properties:

spring.datasource.url=jdbc:postgresql://localhost:5432/alfresco
spring.datasource.username=alfresco
spring.datasource.password=alfresco
spring.datasource.hikari.maximumPoolSize=20
# Based on the DataSource configuration an implementation for accessing the repo database is created.
# Sometimes it might happened that it is not possible to autodetect the correct database type.
# This optional property allows you to disable the auto-detection and to specify the database type directly.
# Supported values: postgresql, mysql, mariadb, sqlserver, oracle
alfresco.dbType=

Reindexing values can be specified using following properties:

alfresco.reindex.jobName=reindexByIds
alfresco.reindex.batchSize=100
alfresco.reindex.pagesize=100
alfresco.reindex.concurrentProcessors=10
alfresco.reindex.fromId=0
alfresco.reindex.toId=20000000000
alfresco.reindex.fromTime=190001010000
alfresco.reindex.toTime=203012312359

Enabling or disabling features can be also configured by properties:

alfresco.reindex.metadataIndexingEnabled = true
alfresco.reindex.contentIndexingEnabled = true
alfresco.reindex.pathIndexingEnabled = true

>> Additional instructions to scale up reindexing process are available in official documentation.

Hyland Connect

The Architecture of Search Enterprise 3