Review Of Full Text Search Query Syntax

resplin · ‎6 Jun 2015

The official documentation is at: http://docs.alfresco.com

Design discussion for the FTS in 3.0. Refer to Full Text Search Query Syntax for the implemented FTS.

Background

Currently we expose the default Lucene query parser syntax for full text search support.
This excludes the availability of some advanced Lucene features, such as Span queries, which we would like to expose. It also ties us to the Lucene query syntax. This may not embed well if we were to expose a SQL query language. It is also additional work to upgrade as we have out own customisations to the query parser.

The Data Dictionary (DD) defines indexing behaviour by binding a Lucene tokeniser to index types. So there is only basic control of indexing per property (indexing on/off, tokenisation on/off).

Possible DD extensions for indexing a property

This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we would have to store in the index to support this.

Indexed: as it is now

Index priority: FTS importance - not really of much use as lucene can not do an update and requires a delete and add. Indexing twice will be enough! We could prioritise documents to be FTS indexed in preference to others.

Pluggable indexers in addition to the core: Add an API to allow configurable index extensions. Add your own fields to the Lucene index.; Will require support to do a cascade update for the most common use case (tag a file is in some path); Performance improvements and caching for path searches may work just as well; The other use case is XML metadata extraction without populating alfresco properties

Orderable -: Support for sort which may over lap with use as an identifier and support FST; Included with tokenisation

Tokenised: We should be able to support FTS and SQL like pattern matching. The tokenisation requirements are different so an attribute may need indexing twice. The identifier like indexing may also be more appropriate for ordering; This will be a comma separated, case-insensitive list of what is required of tokenisation.; ID, FTS, SORT; BOTH -> FTS, SORT; TRUE -> FTS (backward compatibility and default); FALSE -> ID, SORT; ID and SORT are distinguished to separate IDs that do not support sort; Currently we support. BOTH, TRUE and FALSE

Boost: We do not set document or field boosts at index time. Changing a field boost would require a reindex so all documents with the field have the same boost setting.

case sensitivity: Depends on the analyzer

diacritics: Depends on the analyzer

stemming: Depends on the analyzer

thesaurus/synonyms: Depends on the analyzer

stop words: Depends on the analyzer

language/localisation: localisation is driven by:; the locale of each value for a multi-lingual text property; the locale set d:content types; IF UNSET, IN ORDER; the locale set on the node; the locale of the user; the server locale

wildcards: Defined by the tokeniser and tokenisation properties

position/ordering: This information is held in Lucene as offset from the previous token

window/range/distance: This information is available in the index

scope: sentence/paragraph/page/chapter; This information is not available in the index (it does not store the token type); We would have to add this information some how

anchoring: start/end/...; start is easy - end is not unless we add special support

cardinality number of occurances: This information is held in the index (and is used for scoring)

exclusion: Documents are excluded via permissions

documents like this: We do not store term vectors. This could be an option for 'more like this'; Doing this analysis on the fly would be expensive.

cross language support: We would have to tokenise again without stop words to make this sensible; This is a lot of extra work (we just use the tokens each tokeniser generates at the moment); We could use the exact text rather than the token and put these through the standard analyser with no stop words. Each language would then add the words it considers meaningful in some common form without stemming etc.

Index time versus search time: We should expose this to our analyser wrappers (so synonym generation, if we had it, could be index or search side only - and not both) FTS token generation already does this in a weak way ...

Tokenisation bundle.: On a DD property specify the name of a tokenisation bundle to use; Will pick up the tokeniser it defines by locale and property type; Allows mixed tokenisation, property specific tokenisation etc

Query time options

This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we can do at search time to support this.

boost: can be set at query time for each individual query

thesaurus/synonyms: dependent upon the analyzer

languages: as selected - more languages -> generates more language specific tokens

position/ordering: Supported within Lucene (phrases, proximity, span)

window/range/distance: Supported within Lucene (proximity, span)

scope sentence/paragraph/page/chapter: not supported

anchoring start/end/...: start is possible, end would need special index support

cardinality number of occurances: Included in the scoring; We could expose as a specific part of the query language

exclusion: via ACL

documents like this: See indexer support

cross language support: See indexer support

Pluggable indexing

Support to add customer defined indexing and search behaviour

Allow additional, user defined fields in the index.

e.g. indexing of XML content via extraction based on DTD definitions (which could be done when content or metadata is indexed)

Requires node context (path etc) to be available.

FTS Syntax

Based on Google with Lucene extensions.

Google like (Part 1)

Search for a single term: banana

Search for conjunctions (the default): big yellow banana

Search for disjunctions: big OR yellow OR banana

Search for phrases: 'Boris the monkey eating a banana'

Not: yellow banana -big

+: the term is used as is; no plurals; no synonyms; no stemming or tokenisation; the word is not treated as a stop word

Synonym expansion for a term: ~big yellow banana

Specify the field to search: Google advanced operators; field:term; fieldhrase; TYPE:'cm:content'; direct or some other exposure of Lucene fields via property QNames etc; path, aspect support

Proximity: Google separated by one or more words; big * banana

Range: [#]..[#]

Control: order; limit/paging

Notes:

To support Google + we would have to index stop words but mostly ignore them at search time.
Google + conflicts with the Lucene use of the same token for required (AND should be sufficient)
- is not allowed on its own (or reports no matches)

Lucene Extensions (Part 2)

Support AND for explicit conjunctions: big AND yellow AND banana

Wild cards for terms and within phrases

Fuzzy matches: term~

Phrase proximity: phrase~proximity

Range queries (inclusive and exclusive): {# TO #}; [# TO #]

Query time boosts: term^boost

Not: Also include ! and NOT

grouping of query elements: general; (big OR large) AND banana; field; titlebig OR large) AND banana

Further extensions (Part 3)

Explicit spans/positions

start

woof[^]

end

woof[$]

separation

yellow banana[0..2]

Occurrences banana{2}: banana{,2}; banana{2,}; banana{2,4}

Positions

Phrase (??)

Sentence (s)

yellow[^S] banana[S]

Yellow at the start of a sentence that also contains the work banana

Paragraph (p)

Support to specify languages, tokeniser and thesaurus to use for given terms: field:banana; field:<en_uk>banana

Notes:: End will require special support. The most common requirement is to find files based on the name ending pattern. This can in fact be done (and is perhaps better) against the content mimetype which is already in the index.; Positions look like a pain

Alfresco FTS

See Full_Text_Search_Query_Syntax

Alfresco FTS Query Builder

Register query languages with a a query builder that generates Alfresco FTS.
The search service will allow queries with languages like 'ui', 'rm', 'opensearch', 'google', 'share'.
The query will be processed by the query builder and appropriate definition.

This definition includes:

Components to expose
Macros for term generation
- macro expansion to complex queries
- simple field mapping
constraints

To be resolved ...

support for well known namespaces (usability)
- name does not need to be prefixed
- name:text = cm_name:text

support for property mappings to simple aliases
- name -> cm_name
- status -> my_aspect.my_property

system wide property mappings
- persistable in queries
- user mappings (which can not be persisted in saved queries)

Persisted queries
- Remove and add user preferences for field mappings (out of scope here)

TODO:
- mappings and where they are defined
- Date format handling + date functions (not included in CMIS)
- Locale handling
- Query constraints and functions e.g. TODAY + 2w

FTS vs ID
- If both are available when to use
- Exact match
- FTS match
- FTS pattern match
- SQL pattern match

Embed in CMIS

Expose direct (not embedded)

FTS vs Embedded vs RM

Selector
- Embedded -> selector, implied single selector or error
- UI -> No selector
- RM -> No selector

Fields
- Embedded - CMIS style (cm_content:'woof')
- UI - can use mappings to avoid namespacing
- RM - RM mappings?

Field collision (see context above)
- Embedded - fully specified - no issue
- UI - fully specified - no issue
- UI - 'well known' or mapped - no issue
- UI - no prefix, matches local name in more than one namespace
  - Error
  - OR together
  - Could distinguish by case
  - Namespace search order
- RM - as UI - specific mappings?

Default Field (part of context)
- Required for RM
- All ready have this idea - contextual in some way?

Simple search (part of context - could have a consistent SIMPLE field)
- A set of default fields
  - Set on the query?

Advanced
- Simple + specific

Resources

http://www.google.com/support/websearch/bin/answer.py?answer=136861
http://www.blackbeltcoder.com/Articles/data/easy-full-text-search-queries

Review Of Full Text Search Query Syntax

Review Of Full Text Search Query Syntax

Table of Contents

Background

Possible DD extensions for indexing a property

Query time options

Pluggable indexing

FTS Syntax

Google like (Part 1)

Lucene Extensions (Part 2)

Further extensions (Part 3)

Alfresco FTS

Alfresco FTS Query Builder

To be resolved ...

FTS vs Embedded vs RM

Resources

We use cookies on this site to enhance your user experience