Index Version 1

resplin · ‎6 Jun 2015

The official documentation is at: http://docs.alfresco.com

Note: This design has been superceeded with Index Version 2 as of Release 1.4.

Requirement

A mechanism is required to search against the property, full text, content
and semi-structured data in the repository. The structural data is in two forms:
the parent - child relationship between nodes; and the location within
hierarchies used for categarisation.

The persistence of the data may be separate from the index used to search and locate data.

For example, indexing external content, separating the storage of content from other information.

Why Lucene?

The intention is to use lucene as the index and search engine.

It allows the production of an unstructured index with potentially repeating fields.
Each field in the index can be optionally:

indexed (available for search)
stored (available in the documents returned by the search)
tokenised (stored and indexed as is or tokenised)

Not all documents need to index the same fields.
This is a good match to the extensible content model.

It is not clear if we should use lucene to store document content as well as to be able to use it for indexing.
If delayed indexing or non storage of one attribute requires propertie to be obtained via the node service then all properties will be returned.

Lucene seems an obvious choice as it resolves the following issues:

Databases have varying support for full text search
- Databases have varying support for indexing foreign content for full text search. It may have to be stored in the database.
- Somedata bases have to store the content they index
Databases have varying support for hierarchical queries
- The implementation is not performant
- There is no support
- Bridge tables can be used but there may be no triggers to suport management of the trigger tables.
It would be preferable to excute queries in one place and not have to merge result sets.
Ideally read access permissions would be applied during data access but it could be a post filter.

Lucene has disadvantages as:

It does not support join
It duplicates information held in the repository
- This shuld be controlled by the data dictionary

We should tokenise each field/attribute according to its type definition.
For example path should be treated in a special way.
We should map to the same analyser on the query side.
Integers etc. need to be stored and tokenised in a form that will allow lexographical ordering. Similarly for date. Timestamps need to be indexed as dates and treated specially in queries.

The data dictionary should control the indexing behaviour.

Issues

Search Issues

Prototypes

Implementation Plan

Search - Plan

Recovery

There are two scenarios

JTA
nonJTA

When we are not in the two-phase commit world we have to do more detailed error recovery.
With JTA we will know if we need to recover and just need to know what to do.

For each store we need to keep the following when we prepare a transaction

The things to delete
The delta to merge in
If we have managed recovery (JTA or nonJTA)

If we find a nonJTA TX that still has info we need to determine

did everything fail
did hibernate commit and the index update fail
did everything succeed

In the JTA world we are told what to recover.

To test the index state:

Compare deleted objects in the delta with the database and the index
The same for added objects
- This will decided one way or the other.
If we have only updates just recreate the delta and update
- This will rollback or update to the required state with out checking
- We could scan the index and regenerate the entries until one is different
- We could rebuild everything as no change is valid
- We may well have to build the reverse in any case
- Just do for the first pass

If an index is absent we have to rebuild.

If an index is partially corrupted by deleting an index segment then the index will effectively be broken and should be rebuilt from scratch.

In the non-JTA world we would not commit the index befroe the database. There is no need to back out a change from the index.

JTA

Support for JTA.

Should switch to the spring pattern for keeping transactional resources.

XAIndexer

Produced by all internal factories.

XAResource getXAResource()

Registration

Conditional on being a JTA or Hibernate transaction manager

JTA

Enlist as resource
register spring synchronisation to do only a NodeService.save();
- beforeCommit()
  - NodeService.save()
- beforeCompletion()
- afterCompletion()

NodeService.save()

integrity first pass
rules
integrity second pass
index flush
optional hibernate flush

nonJTA

- beforeCommit()
  - NodeService.save()
  - indexer prepare();
- beforeCompletion()
- afterCompletion()
  - indexer post action - commit or rollback

This implies we have one synchronisation that optionally does the indexer stuff.

We should be done before the Spring synchronisation</pre>

Integration with lucene

We have modified lucene 1.4.3 to address a number of minor issues and enhancememnts.
These are described here. Lucene Extensions and Issues

Index Version 1

Index Version 1

Table of Contents

Requirement

Why Lucene?

Issues

Prototypes

Implementation Plan

Recovery

JTA

XAIndexer

Registration

Integration with lucene

We use cookies on this site to enhance your user experience