Indexer Changes

resplin · ‎6 Jun 2015

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com

Search Indexing

Proposals to speed up indexing

A quick summary of implementations

Solr

Apache Lucene subproject project including replication of Lucene indexes.
They intend to create sub indexes and merge them into one main index in the background.
This is not done yet.
Command pattern driven
Mostly Lucene committers here
Lucene based

Compass

Managing index segments
Ugly interaction with Lucene
State is in memory about which segments belong to which TX
Merge segments in the background
(Have not looked at it for a while)
Lucene based

Reference Implementation

Maintains sub indexes and merges them in the background.
Merge can fail as deletions may occur - waste time
Volatile in memory index written to disk periodically.
not clear when the volatile index is available or when any background indexing of content happens
Command pattern driven
Search affected by multiple readers but they do not need to be opened all the time
Don't see how they could do in TX searching
Lucene based

Our current

Optimises on the fly addition to one index.
Does not support sharing index readers (which get more expensive to create as the index gets bigger)
We have to chuck readers as stuff is changed.
We do not have access to control how these are merged etc
Lucene 1.4.3 tie up
our own Lucene fixes
our own Lucene extensions
Lucene based

Sub index pattern in general

Using sub indexing and Background merging seems to be the preferred approach
All intend to use shared index readers with ref counting
No mention of TX, isolation and in TX search
All manage and share index readers
No one has addressed delete issues

Why would we write it?

Better performance for our indexing compared with what we have now

Better/faster recovery by supporting prepare etc
- Know what was committing when we restart ...
- Could check if prepared TXs exist in DB and commit
- Could add XA Support if required

Design so we can share one index (then a cluster could use the same DB and one index)
- Avoids some Lucene issues
- Avoids replication in some set ups

Overlay support
- An overlay is an index + a deletion list
- We already do this to overlay an index delta
- Avoids delete issue
- NEED TO SORT THIS OUT WITH THE WCM GUYS

Scale and clustering
- Possible we can turn off Lucene locking completely if we have a TX/Status file and use NIO locking.

We can use the Lucene standard jar
- Our issues are steadily going in (and being hit by other people including ref imp users!)

Could better support back up (and we have somewhere to lock out changes while doing it)

Joins
- We need to manage multiple readers as multireaders to do in query joins (PATH)
- Would also be required for other joins
- Reusing /ref counting index searchers should improve this and other searching

Support in TX search

Better management of isolation levels

Can do cross store searching at the same time

Overview Design

An index (for a store) is made up of

One info file which can be used for locking (nio)
One object instance to support indexing per server per store

1 main index
- Always optimised

many sub indexes
- Always optimised

many overlays
- Overlays have a deletion list
- Some are real overlays
  - Not optimised
- Some use overlays to commit an index
- Some have delayed index done - some not
  - Optimised
- Deltas
  - TX change set
  - not optimised
  - Mix of in memory and on disk
  - Must be on disk with its deletions at commit time
  - Changes as the TX indexes

Prepare

Set preparing in lock file
Flush delta to disk
Flush delete list to disk
Set prepared

Now it is an overlay
Prepare is the mostly costly bit.
Lucene is most efficient at writing small files only
Only affects the preparing thread with how long it takes.

Commit

Commit in the info file
Rebuild readers list

Commit is then very fast

RollBack

remove from info file
Tidy up (could be done in the background ...as we always now what is of interest)

Rollback is fast

Could duplicate individual status in sub folders to rebuild info if required

FTS

Do FTS when moving from an overlay to sub index

Background

FTS
- Add FTS delayed index attributes to overlays
merge
- overlays to sub indexes
  - do deletes (affects main index and sub indexes)
    - do in overlay order
  - build a new optimised sub index
  - commit
  - swap over index readers
  - build new reader to share and reuse
- sub indexes to main index

Deletes done in transactions have no effect on merging - these are added in at merge time.

Isolations levels

Serializable - repeatable read + prepare checks
Read committed - current main index reader + in TX overlay
Repeatable read - fixed main index reader + in TX overlay
Icky read - include other transactions
- Only support read committed at the start

Could have small indexes in memory for performance

Indexer Changes

Indexer Changes

Table of Contents

Proposals to speed up indexing

A quick summary of implementations

Solr

Compass

Reference Implementation

Our current

Sub index pattern in general

Why would we write it?

Overview Design

We use cookies on this site to enhance your user experience