Meaning of "tokenized search"

thor_a · ‎8 Nov 2021

Hello,

sorry, if the question is too basic, but I searched for hours for an answer.

I don't understand the meaning in the docs for the exact phrase search:

"The whole phrase will be tokenized"

https://docs.alfresco.com/search-services/latest/using/#:~:text=with%20these%20operators.-,Search%20...

Thanks for explaining the "tokenized". I am looking forward to understand the difference to the exact term search, which is not clear for me:

https://docs.alfresco.com/search-services/latest/using/#search-for-an-exact-term

Thanks for any help,

Thorsten

Search for a phrase

Phrases are enclosed in double quotes. Any embedded quotes can be escaped using ``. If no field is specified then the default TEXT field will be used, as with searches for a single term.

The whole phrase will be tokenized before the search according to the appropriate data dictionary definition(s).

Search for an exact term

Note: =“multi term phrase” returns documents only with the exact phrase and terms in the exact order.

angelborroy · ‎9 Nov 2021

SOLR is using tokenization when searching: https://solr.apache.org/guide/6_6/tokenizers.html

That means that searching term is not what you are typing, but some meaningful parts of the sentence.

When searching for "Running is a sport", the real query is expanded to "run, run_is, is, is_a, a, a_sport, sport". So you are getting all the results including that tokens.

However, when using ="Running is a sport", the query returns the fields that include exactly that terms in the order specified "Running, is, a, sport".

Hyland Developer Evangelist

thor_a · ‎9 Nov 2021

Thank you very much for clarification of "tokenization"!

@angelborroy wrote:
When searching for "Running is a sport", the real query is expanded to "run, run_is, is, is_a, a, a_sport, sport".

I did not find in the solr6 tokenization doc, that "is_a" or "a_sport" has also to be seen as a token. I expected that only different words are tokens, but not all two word combinations behind each other. (Just to be sure: The underscore of your example does mean a single space, doesn't it?)

@angelborroy wrote:
So you are getting all the results including that tokens.

Does this mean, that every token you mentioned has to appear in every result document? But the order of the found tokens is not necessary? Therefore also documents are found with the following content: 'Is sport a running game'. No documents are found with this content: "Is this game a sport". Is this correct?

BTW If this is true, I don't understand why this search is called "phrase" search. Normally a phrase search implicits a certain order. It's more like a "set search"...

@angelborroy wrote:
However, when using ="Running is a sport", the query returns the fields that include exactly that terms in the order specified "Running, is, a, sport".

I am glad that I interpreted this syntax correctly. Is it possible to use it as a JSON query without problems? I could not integrate the equal sign immediately into the following syntax:

{
  "query": \{
    "query":"cm:content:('*Running is a sport*')"
  }
}

IMO the equal sign does not harmonize with cm:content. But perhaps I should omit cm:content and replace it with TEXT?

Thorsten

angelborroy · ‎9 Nov 2021

When using "=" with content (TEXT) fields, not the whole field value is considered. It will also fetch the content that includes that sentence.

Hyland Developer Evangelist

thor_a · ‎11 Nov 2021

@angelborroy wrote:
When using "=" with content (TEXT) fields, not the whole field value is considered. It will also fetch the content that includes that sentence.

I am not sure if I understand you. Do you refer to my wildcards in the example above?

Regarding the field type TEXT: Is the following definition of TEXT correct?
TEXT virtual field (Because the link refers to Alfresco Search Enterprise. I did not find any other doc.)

BTW The syntax for an exact term search with JSON is clear now. The following works:

{
  "query": {
    "query":"=cm:content:'Runnnig is a sport'"
  }
}

Thanks,

Thorsten

angelborroy · ‎12 Nov 2021

TEXT uses can be found in https://docs.alfresco.com/search-services/latest/using/#search-in-fields

When using "=" operator you have two different behaviours:

Return all the documents having the exact value for non-content properties
Return all the documents having the exact sentence in part of a Content (TEXT) property

Hyland Developer Evangelist