Index Version 2

cancel
Showing results for 
Search instead for 
Did you mean: 

Index Version 2

resplin
Intermediate
0 0 1,414

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



Search Indexing

This page explains the design of Alfresco's Lucene indexes and how they relate to each other.  Version 2 has been introduced in Alfresco Version 1.4. 


Design


The index structure is made up indexes and deltas.


  • Indexes are simply Lucene indexes that are concatenated in order.
  • Deltas Indexes are lucene indexes plus a list of nodes that should be deleted from previously indexed information.
  • IndexInfo file - the type, transactional status and order of these elements is stored in the indexInfo file along with other auxiliary information.

All the information from a transaction is stored as an index delta. The deletions should be applied to any previous indexes or deltas and then the index concatenated to the result.


Example


For example, if the first transaction to an index committed nodes 1, 2, 3 and 4 it would contain a deletion list for 1, 2, 3 and 4 and then the index containing the information for nodes 1, 2, 3 and 4

Index 

Delta 1:
Deletions:    1  2  3  4 
Index:        1  2  3  4 

The overall index is 1, 2, 3 and 4;

If the next transaction deletes node 1, updates node 2 and adds node 5.

Index

Delta 1:
Deletions:    1  2  3  4 
Index:        1  2  3  4 

Delta 2:
Deletions:    1  2        5
Index:           2        5 


The overall Index is  3, 4, 2 and 5

Lets say the next transaction updates node 4 deletes node 5 and adds node 6. This creates Delta Index 3.

Index

Delta 1:
Deletions:    1  2  3  4 
Index:        1  2  3  4 

Delta 2:
Deletions:    1  2        5
Index:           2        5 

Delta 3:
Deletions:             4  5  6
Index:                 4     6

The overall Index is 3, 2, 4 and 6


Merging the Delta Indexes


This index could be simplified in two ways
- applying the deletions
- merging into a bigger index

The first stage would be to apply the deletion defined in Delta 1 to give:

Index

Index 1:
Index:        1  2  3  4 

Delta 2:
Deletions:    1  2        5
Index:           2        5 

Delta 3:
Deletions:             4  5  6
Index:                 4     6

Then to apply the deletions from Delta 2 to give:

Index 

Index 1':
Index:              3  4 

Index 2:   
Index:           2        5 

Delta 3:
Deletions:             4  5  6
Index:                 4     6



Then to apply the deletions from Delta 3 to give :

Index

Index 1:
Index:              3   

Index 2':   
Index:           2         

Index 3:
Index:                 4     6

These indexes could then be merged to produce

Index

Index 4:
Index            3  2  4  6

The order of additions to the index being preserved.

The application of deletes and merging of indexes can be done in the background.


Other Notes


In this design, prepare has to position a delta in the list. Once done, its position is fixed. Creating the index delta and storing the deletes is done by the transaction thread. Commit is then just a status change which needs to be persisted to the IndexInfo file.

For recovery a duplicate index info file is created to allow for failure while writing one or the other.

A Lucene reader is required that applies deletes to one reader and appends another.

The components of the index remain unchanged unless they have deletions applied or are merged together.
IndexReaders are valid until these actions occur. They can be cached and reused until they go out of date.
This is managed via reference counting and an out of date flag.

After merges unused index information is cleaned up in the background.

The merge and deletion algorithm aims for a fixed number of index files waiting to be merged or deltas waiting to have deletions applied. 

NIO is used for file locking, RandomAccessFile buffering and flushing to disk. This is sufficient to provide locking for sharing indexes between mutiple repositories. Additional work is required to verify and update index state as altered by other readers and to remove pending operations that hang because the JVM dies or looses communication.


To do


  • Shared Indexer
    • Heartbeat
    • recovery when one indexer is lost
      • marking
      • restarting
      • reconnecting (may have been rolled back by another indexer ....commit will just fail)
      • Reporting and warnings
    • Blocked merging of index deltas




  • API
    • Split out keys for delete etc
    • Contribute to lucene




  • XA




  • Fix against unit tests




  • Index backup
    • Don't stop indexing write/reads




  • Tidy up unknown directories




  • Expose lock for external actions?




  • Federated Search
    • overlays for federated seraches?




  • In memory merges for small merge operations.




  • No need to force the delta index to disk to do the search?




Why version 2?


Version 1 has several issues which are addressed by version 2


  • IPC index lock and holding of index locks
    • Version 2 users nio locking and disables lucene locking. This gives correct IPC and does not leave outstanding locks
  • Commit performance
    • V1 does index merging at commit time
    • V2 commits a delta and deals with bulk merging etc in the background. Commit is very fast. Improves concurrency.
  • Recovery, state and potential support for XA
    • V2 holds index and transaction state allowing recovery and potential support for XA
  • Shared indexer
    • V1 two repositories can not use the same index
      • well it is possible but there are lock issues and no way to clear confused locks automatically
    • V2 Supports shared indexing and uses nio to do this.




  • Performance
    • V1 mutipe merges on commits
    • V2 background merges can merge more - which is more efficient
  • Use of reference counting searches
    • V1 re-reads index info too often
    • V2 manages what can be reused and supports reference counting for auto closing when unused and out of date
      • Improved search performance
  • Overlays
    • V2 specifically supports the idea of transactions as overlays.
    • There is no reason why we can not have long running overlays.