A mechanism is required to search against the property, full text, content and semi-structured data in the repository. The structural data is in two forms: the parent - child relationship between nodes; and the location within hierarchies used for categarisation.
The persistence of the data may be separate from the index used to search and locate data.
For example, indexing external content, separating the storage of content from other information.
The intention is to use lucene as the index and search engine.
It allows the production of an unstructured index with potentially repeating fields. Each field in the index can be optionally:
indexed (available for search)
stored (available in the documents returned by the search)
tokenised (stored and indexed as is or tokenised)
Not all documents need to index the same fields. This is a good match to the extensible content model.
It is not clear if we should use lucene to store document content as well as to be able to use it for indexing. If delayed indexing or non storage of one attribute requires propertie to be obtained via the node service then all properties will be returned.
Lucene seems an obvious choice as it resolves the following issues:
Databases have varying support for full text search
Databases have varying support for indexing foreign content for full text search. It may have to be stored in the database.
Somedata bases have to store the content they index
Databases have varying support for hierarchical queries
The implementation is not performant
There is no support
Bridge tables can be used but there may be no triggers to suport management of the trigger tables.
It would be preferable to excute queries in one place and not have to merge result sets.
Ideally read access permissions would be applied during data access but it could be a post filter.
Lucene has disadvantages as:
It does not support join
It duplicates information held in the repository
This shuld be controlled by the data dictionary
We should tokenise each field/attribute according to its type definition. For example path should be treated in a special way. We should map to the same analyser on the query side. Integers etc. need to be stored and tokenised in a form that will allow lexographical ordering. Similarly for date. Timestamps need to be indexed as dates and treated specially in queries.
The data dictionary should control the indexing behaviour.