Meaning of "tokenized search"

cancel
Showing results for 
Search instead for 
Did you mean: 
thor_a
Member II

Meaning of "tokenized search"

Hello,

sorry, if the question is too basic, but I searched for hours for an answer. 

I don't understand the meaning in the docs for the exact phrase search: 

"The whole phrase will be tokenized"

https://docs.alfresco.com/search-services/latest/using/#:~:text=with%20these%20operators.-,Search%20...

Thanks for explaining the "tokenized". I am looking forward to understand the difference to the exact term search, which is not clear for me:

https://docs.alfresco.com/search-services/latest/using/#search-for-an-exact-term

Thanks for any help,

Thorsten

 

Search for a phrase

Phrases are enclosed in double quotes. Any embedded quotes can be escaped using ``. If no field is specified then the default TEXT field will be used, as with searches for a single term.

The whole phrase will be tokenized before the search according to the appropriate data dictionary definition(s).

Search for an exact term

Note: =“multi term phrase” returns documents only with the exact phrase and terms in the exact order.

 

5 Replies
angelborroy
Alfresco Employee

Re: Meaning of "tokenized search"

SOLR is using tokenization when searching: https://solr.apache.org/guide/6_6/tokenizers.html

That means that searching term is not what you are typing, but some meaningful parts of the sentence.

When searching for "Running is a sport", the real query is expanded to "run, run_is, is, is_a, a, a_sport, sport". So you are getting all the results including that tokens.

However, when using ="Running is a sport", the query returns the fields that include exactly that terms in the order specified "Running, is, a, sport".

Hyland Developer Evangelist
thor_a
Member II

Re: Meaning of "tokenized search"

Thank you very much for clarification of "tokenization"!


@angelborroy wrote:

When searching for "Running is a sport", the real query is expanded to "run, run_is, is, is_a, a, a_sport, sport". 

I did not find in the solr6 tokenization doc, that "is_a" or "a_sport" has also to be seen as a token. I expected that only different words are tokens, but not all two word combinations behind each other. (Just to be sure: The underscore of your example does mean a single space, doesn't it?)


@angelborroy wrote:

So you are getting all the results including that tokens.

Does this mean, that every token you mentioned has to appear in every result document? But the order of the found tokens is not necessary? Therefore also documents are found with the following content: 'Is sport a running game'. No documents are found with this content: "Is this game a sport". Is this correct?

BTW If this is true, I don't understand why this search is called "phrase" search. Normally a phrase search implicits a certain order. It's more like a "set search"...


@angelborroy wrote:

However, when using ="Running is a sport", the query returns the fields that include exactly that terms in the order specified "Running, is, a, sport".


I am glad that I interpreted this syntax correctly. Is it possible to use it as a JSON query without problems? I could not integrate the equal sign immediately into the following syntax:

{
  "query": \{
    "query":"cm:content:('*Running is a sport*')"
  }
}

IMO the equal sign does not harmonize with cm:content. But perhaps I should omit cm:content and replace it with TEXT?

Thorsten

 

angelborroy
Alfresco Employee

Re: Meaning of "tokenized search"

When using "=" with content (TEXT) fields, not the whole field value is considered. It will also fetch the content that includes that sentence.

Hyland Developer Evangelist
thor_a
Member II

Re: Meaning of "tokenized search"


@angelborroy wrote:

When using "=" with content (TEXT) fields, not the whole field value is considered. It will also fetch the content that includes that sentence.


I am not sure if I understand you. Do you refer to my wildcards in the example above?

Regarding the field type TEXT: Is the following definition of TEXT correct?
TEXT virtual field (Because the link refers to Alfresco Search Enterprise. I did not find any other doc.)

BTW The syntax for an exact term search with JSON is clear now. The following works:

{
  "query": {
    "query":"=cm:content:'Runnnig is a sport'"
  }
}

Thanks,

Thorsten

 

angelborroy
Alfresco Employee

Re: Meaning of "tokenized search"

TEXT uses can be found in https://docs.alfresco.com/search-services/latest/using/#search-in-fields

When using "=" operator you have two different behaviours:

  • Return all the documents having the exact value for non-content properties
  • Return all the documents having the exact sentence in part of a Content (TEXT) property
Hyland Developer Evangelist