PATH (The full path to the node) If there are multiple paths this can be repeated)
QNAME (The fully qualified name of the node)
NAME (The fully qualified name of the node)
ANCESTORID (A repeated field containing the IDS of all the nodes ancestors including itself)
A_PATH_NAME (The tokenised path with only names includes [depth] at the start)
A_PATH_NS (The tokenised path with only namespaces)
A_PATH_QNAME (The tokenised path with the fully qualified node names)
REL_PATH_NAME (The tokenised path with only names)
REL_PATH_NS (The tokenised path with only namespaces)
REL_PATH_QNAME (The tokenised path with the fully qualified node names)
LEVEL (The depth of this node relative to the root - may be repeated)
WORKSPACEID (The id of the workspace)
Attributes as (Name=@ns:name Value=value and also Name=@ns: and Name=name)
Role based read access control
Yes these are required as you can not do wild-card search in phrase queries or at the start using the standard query parser (for example to find all names spaces)
Full paths are slow and the * matching is greedy which means it would also be difficult to use.
420,000 docs with 50 attributes ~320 Meg
Indexes the attributes in about 3 ms/doc
FTs in about 45 ms/doc
I suggest we split indexing for full text search and indexing for attributes.
It would make sense to have an index per workspace to keep concurrent access to minimum. This would still have to managed, as would the behaviour at transaction boundaries and the atomicity of indexing.
We need to control Reader-Delete and Writer.*() to avoid concurrent and inconsistent operation. We should not rely on the lucene lock mechanism.
As Lucene stores positional information we should only need one index as Namespace Name Namespace Name We have name space information at odd positions, and names at even positions and we have implied depth. The maximum term gives us our depth. We could include this as a separate entry. So we could do absolute path elements as ANDS. Followed by a sequence of and 'thing at +n' after last token, 'anywhere greater than the last token' and then 'no next token'. This should give us a powerful absolute and relative query path.
Example XPath queries and lucene translations against the prototype index
+A_PATH_NAME:'a b' +REL_PATH_NAME:'c d' +WORKSPACEID:'default'
+A_PATH_NAME:'a b' +REL_PATH_NAME:'c d' +WORKSPACEID:'default' + post filter on the results
Need to remove /a/b/c/d as too short - could also add LEVEL > 5 (there are still use cases that would require a filter and can not be done using level - absolute path+relative path+ all with wild card elements) Path filter - probably do anyway The next proposed index structure and new query element solve this.