Review Of Full Text Search Query Syntax

Showing results for 
Search instead for 
Did you mean: 

Review Of Full Text Search Query Syntax

0 0 5,249

Obsolete Pages{{Obsolete}}

The official documentation is at:

Reviewed for 3.3

Design discussion for the FTS in 3.0. Refer to Full Text Search Query Syntax for the implemented FTS.


Currently we expose the default Lucene query parser syntax for full text search support.
This excludes the availability of some advanced Lucene features, such as Span queries, which we would like to expose. It also ties us to the Lucene query syntax. This may not embed well if we were to expose a SQL query language. It is also additional work to upgrade as we have out own customisations to the query parser.

The Data Dictionary (DD) defines indexing behaviour by binding a Lucene tokeniser to index types. So there is only basic control of indexing per property (indexing on/off, tokenisation on/off).

Possible DD extensions for indexing a property

This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we would have to store in the index to support this.

as it is now

Index priority
FTS importance - not really of much use as lucene can not do an update and requires a delete and add. Indexing twice will be enough! We could prioritise documents to be FTS indexed in preference to others.

Pluggable indexers in addition to the core
Add an API to allow configurable index extensions. Add your own fields to the Lucene index.
Will require support to do a cascade update for the most common use case (tag a file is in some path)
Performance improvements and caching for path searches may work just as well
The other use case is XML metadata extraction without populating alfresco properties

Orderable -
Support for sort which may over lap with use as an identifier and support FST
Included with tokenisation

We should be able to support FTS and SQL like pattern matching. The tokenisation requirements are different so an attribute may need indexing twice. The identifier like indexing may also be more appropriate for ordering
This will be a comma separated, case-insensitive list of what is required of tokenisation.
TRUE -> FTS (backward compatibility and default)
ID and SORT are distinguished to separate IDs that do not support sort
Currently we support. BOTH, TRUE and FALSE

We do not set document or field boosts at index time. Changing a field boost would require a reindex so all documents with the field have the same boost setting.

case sensitivity
Depends on the analyzer

Depends on the analyzer

Depends on the analyzer

Depends on the analyzer

stop words
Depends on the analyzer

localisation is driven by:
the locale of each value for a multi-lingual text property
the locale set d:content types
the locale set on the node
the locale of the user
the server locale

Defined by the tokeniser and tokenisation properties

This information is held in Lucene as offset from the previous token

This information is available in the index

This information is not available in the index (it does not store the token type)
We would have to add this information some how

start is easy - end is not unless we add special support

cardinality number of occurances
This information is held in the index (and is used for scoring)

Documents are excluded via permissions

documents like this
We do not store term vectors. This could be an option for 'more like this'
Doing this analysis on the fly would be expensive.

cross language support
We would have to tokenise again without stop words to make this sensible
This is a lot of extra work (we just use the tokens each tokeniser generates at the moment)
We could use the exact text rather than the token and put these through the standard analyser with no stop words. Each language would then add the words it considers meaningful in some common form without stemming etc.

Index time versus search time
We should expose this to our analyser wrappers (so synonym generation, if we had it, could be index or search side only - and not both) FTS token generation already does this in a weak way ...

Tokenisation bundle.
On a DD property specify the name of a tokenisation bundle to use
Will pick up the tokeniser it defines by locale and property type
Allows mixed tokenisation, property specific tokenisation etc

Query time options

This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we can do at search time to support this.

can be set at query time for each individual query

dependent upon the analyzer

as selected - more languages -> generates more language specific tokens

Supported within Lucene (phrases, proximity, span)

Supported within Lucene (proximity, span)

scope sentence/paragraph/page/chapter
not supported

anchoring start/end/...
start is possible, end would need special index support

cardinality number of occurances
Included in the scoring
We could expose as a specific part of the query language

via ACL

documents like this
See indexer support

cross language support
See indexer support

Pluggable indexing

Support to add customer defined indexing and search behaviour

Allow additional, user defined fields in the index.

e.g. indexing of XML content via extraction based on DTD definitions (which could be done when content or metadata is indexed)

Requires node context (path etc) to be available.

FTS Syntax

Based on Google with Lucene extensions.

Google like (Part 1)

Search for a single term

Search for conjunctions (the default)
big yellow banana

Search for disjunctions
big OR yellow OR banana

Search for phrases
'Boris the monkey eating a banana'

yellow banana -big

the term is used as is
no plurals
no synonyms
no stemming or tokenisation
the word is not treated as a stop word

Synonym expansion for a term
~big yellow banana

Specify the field to search
Google advanced operators
fieldSmiley Tonguehrase
direct or some other exposure of Lucene fields via property QNames etc
path, aspect support

Google separated by one or more words
big * banana




  • To support Google + we would have to index stop words but mostly ignore them at search time.
  • Google + conflicts with the Lucene use of the same token for required (AND should be sufficient)
  • - is not allowed on its own (or reports no matches)

Lucene Extensions (Part 2)

Support AND for explicit conjunctions
big AND yellow AND banana

Wild cards for terms and within phrases

Fuzzy matches

Phrase proximity

Range queries (inclusive and exclusive)
{# TO #}
[# TO #]

Query time boosts

Also include ! and NOT

grouping of query elements
(big OR large) AND banana
titleSmiley Sadbig OR large) AND banana

Further extensions (Part 3)

Explicit spans/positions


yellow banana[0..2]


Phrase (??)
Sentence (s)
yellow[^S] banana[S]
Yellow at the start of a sentence that also contains the work banana

Paragraph (p)

Support to specify languages, tokeniser and thesaurus to use for given terms

End will require special support. The most common requirement is to find files based on the name ending pattern. This can in fact be done (and is perhaps better) against the content mimetype which is already in the index.
Positions look like a pain

Alfresco FTS

See Full_Text_Search_Query_Syntax

Alfresco FTS Query Builder

Register query languages with a a query builder that generates Alfresco FTS.
The search service will allow queries with languages like 'ui', 'rm', 'opensearch', 'google', 'share'.
The query will be processed by the query builder and appropriate definition.

This definition includes:

  • Components to expose
  • Macros for term generation
    • macro expansion to complex queries
    • simple field mapping
  • constraints

To be resolved ...

  • support for well known namespaces (usability)
    • name does not need to be prefixed 
    • name:text  =  cm_name:text

  • support for property mappings to simple aliases
    • name -> cm_name
    • status -> my_aspect.my_property

  • system wide property mappings
    • persistable in queries
    • user mappings (which can not be persisted in saved queries)
  • Persisted queries
    • Remove and add user preferences for field mappings (out of scope here)

  • TODO:
    • mappings and where they are defined
    • Date format handling + date functions (not included in CMIS)
    • Locale handling
    • Query constraints and functions e.g. TODAY + 2w

  • FTS vs ID
    • If both are available when to use
    • Exact match
    • FTS match
    • FTS pattern match
    • SQL pattern match

  • Embed in CMIS

  • Expose direct (not embedded)

FTS vs Embedded vs RM

  • Selector
    • Embedded -> selector, implied single selector or error
    • UI -> No selector
    • RM -> No selector
  • Fields
    • Embedded - CMIS style (cm_content:'woof')
    • UI - can use mappings to avoid namespacing
    • RM - RM mappings?
  • Field collision (see context above)
    • Embedded - fully specified - no issue
    • UI - fully specified - no issue
    • UI - 'well known' or mapped - no issue
    • UI - no prefix, matches local name in more than one namespace
      • Error
      • OR together
      • Could distinguish by case
      • Namespace search order
    • RM - as UI - specific mappings?
  • Default Field (part of context)
    • Required for RM
    • All ready have this idea - contextual in some way?
  • Simple search (part of context - could have a consistent SIMPLE field)
    • A set of default fields
      • Set on the query?
  • Advanced
    • Simple + specific