Indexer Changes

cancel
Showing results for 
Search instead for 
Did you mean: 

Indexer Changes

resplin
Intermediate
0 0 1,121

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



Search Indexing


Proposals to speed up indexing


A quick summary of implementations


Solr


  • Apache Lucene subproject project including replication of Lucene indexes.
  • They intend to create sub indexes and merge them into one main index in the background.
  • This is not done yet.
  • Command pattern driven
  • Mostly Lucene committers here
  • Lucene based

Compass


  • Managing index segments
  • Ugly interaction with Lucene
  • State is in memory about which segments belong to which TX
  • Merge segments in the background
  • (Have not looked at it for a while)
  • Lucene based

Reference Implementation


  • Maintains sub indexes and merges them in the background.
  • Merge can fail as deletions may occur - waste time
  • Volatile in memory index written to disk periodically.
  • not clear when the volatile index is available or when any background indexing of content happens
  • Command pattern driven
  • Search affected by multiple readers but they do not need to be opened all the time
  • Don't see how they could do in TX searching
  • Lucene based




Our current


  • Optimises on the fly addition to one index.
  • Does not support sharing index readers (which get more expensive to create as the index gets bigger)
  • We have to chuck readers as stuff is changed.
  • We do not have access to control how these are merged etc
  • Lucene 1.4.3 tie up
  • our own Lucene fixes
  • our own Lucene extensions
  • Lucene based

Sub index pattern in general


  • Using sub indexing and Background merging seems to be the preferred approach
  • All intend to use shared index readers with ref counting
  • No mention of TX, isolation and in TX search
  • All manage and share index readers
  • No one has addressed delete issues




Why would we write it?


  • Better performance for our indexing compared with what we have now
  • Better/faster recovery by supporting prepare etc
    • Know what was committing when we restart ...
    • Could check if prepared TXs exist in DB and commit
    • Could add XA Support if required




  • Design so we can share one index (then a cluster could use the same DB and one index)
    • Avoids some Lucene issues
    • Avoids replication in some set ups




  • Overlay support
    • An overlay is an index + a deletion list
    • We already do this to overlay an index delta
    • Avoids delete issue
    • NEED TO SORT THIS OUT WITH THE WCM GUYS




  • Scale and clustering
    • Possible we can turn off Lucene locking completely if we have a TX/Status file and use NIO locking.




  • We can use the Lucene standard jar
    • Our issues are steadily going in (and being hit by other people including ref imp users!)




  • Could better support back up (and we have somewhere to lock out changes while doing it)




  • Joins
    • We need to manage multiple readers as multireaders to do in query joins (PATH)
    • Would also be required for other joins
    • Reusing /ref counting index searchers should improve this and other searching




  • Support in TX search
  • Better management of isolation levels
  • Can do cross store searching at the same time

Overview Design


An index (for a store) is made up of


  • One info file which can be used for locking (nio)
  • One object instance to support indexing per server per store




  • 1 main index
    • Always optimised
  • many sub indexes
    • Always optimised
  • many overlays
    • Overlays have a deletion list
    • Some are real overlays
      • Not optimised
    • Some use overlays to commit an index
    • Some have delayed index done - some not
      • Optimised
    • Deltas
      • TX change set
      • not optimised
      • Mix of in memory and on disk
      • Must be on disk with its deletions at commit time
      • Changes as the TX indexes



Prepare


  • Set preparing in lock file
  • Flush delta to disk
  • Flush delete list to disk
  • Set prepared

Now it is an overlay
Prepare is the mostly costly bit.
Lucene is most efficient at writing small files only
Only affects the preparing thread with how long it takes.



Commit


  • Commit in the info file
  • Rebuild readers list

Commit is then very fast



RollBack


  • remove from info file
  • Tidy up (could be done in the background ...as we always now what is of interest)

Rollback is fast



Could duplicate individual status in sub folders to rebuild info if required



FTS


  • Do FTS when moving from an overlay to sub index



Background


  • FTS
    • Add FTS delayed index attributes to overlays
  • merge
    • overlays to sub indexes
      • do deletes (affects main index and sub indexes)
        • do in overlay order
      • build a new optimised sub index
      • commit
      • swap over index readers
      • build new reader to share and reuse
    • sub indexes to main index

Deletes done in transactions have no effect on merging - these are added in at merge time.



Isolations levels


  • Serializable - repeatable read + prepare checks
  • Read committed - current main index reader + in TX overlay
  • Repeatable read - fixed main index reader + in TX overlay
  • Icky read - include other transactions
    • Only support read committed at the start



Could have small indexes in memory for performance