Search - Prototype 2

resplin · ‎6 Jun 2015

The official documentation is at: http://docs.alfresco.com

Revised Index structure

UUID (The unique identifier for the node)
FTS (The full text search entry for the node)
PATH (The full path to the node) If there are multiple paths this can be repeated)
QNAME (The fully qualified name of the node)
ANCESTORID (A repeated field containing the IDS of all the nodes ancestors including itself)
LEVEL (The depth of this node relative to the root - may be repeated)
WORKSPACEID (The id of the workspace)
Attributes as (Name=@ns:name Value=value and also Name=@ns: and Name=name)

To Add

These property types will require special indexing and tokenisation

QName
Path
Category
Security

To execute complex structure expressions a new type of query is required.

Simple tokeniser
- Path is
  - Depth
  - NameSpace + Name (repeating)
  - Optional end marker followed by other paths

Impact of renaming
Impact of restructuring
A bridge table does not make much sense
- Still have a big up date problem - better to split the path in a different way ...
Cold use indirection for top level hierarchies
Can not serach across tow indexes tha index the same docs without pulling out all docs amd joining on the primary key.
Do not see a sensible way of partitioning below the store level

5 Million Paths on my laptop
- Returning 1 or 2 million result sets on simple paths in 1-3 seconds

The indexing performance could be due to the heavy common terms in attributes and similar paths.

Different machines all the same java and command line options

My laptop
- Write to in memory index up to 20000 times (CPU limited)
- 2.06 ms per doc
- Merge 10 times to make 200000 (IO limited)
- 1.11 ms per doc
Same laptop spec + mandrake 10.1
- 5.13 ms/doc
- 0.67 ms/doc
Modo
- 2.03 ms/doc
- 0.39 ms/doc

Getting a document out of the above index

Todo: