Using Alfresco FullText Search as Case Insensitive

cyong · ‎9 Jul 2019

Greetings, my apologies in advance as I am more of a SQL Developer, and I am not that experienced with the Alfresco API. But I have been trying to solve this, and haven't gotten very far. We are on version 5.2.2.

We have some files with words in titles like "Estimating" and "Estimate". Currently the query to the REST API looks like this when we search for the phrase 'Estim', and returns 10 records. (Simplified to just the query section since the rest is not important for the question.)

"query": {
"query": "cm:title:'*Estim*'",
"language": "afts"
}

The problem is that if they search for 'Estima', no records are returned. (While 'Estimate' returns all 10...) After doing a lot of reading with the API, I thought that the = prefix would resolve the issue (demonstrated below), but that had the downside of making the search case sensitive. (As well as when searching for the phrase 'Estimate', the files with titles of "Estimating" no longer appear.)

"query": {
"query": "=cm:title:'*Estima*'",
"language": "afts"
}

Is there anyway to correct this behavior so that the 10 records are retrieved in all 3 scenarios and that be case-insensitive? (Ideally while using afts, if possible)

afaust · ‎10 Jul 2019

One thing to keep in mind with Alfresco searches is, that unless an FTS query is suitable to be directly transformed to and executed by an SQL statement, all queries will handled by SOLR / Lucene and be subject to its tokenisation / word stemming rules, which work differently then a LIKE condition in SQL. Another thing to keep in mind is that queries may be handled quite differently depending on the locale of the search request vs. the locale of the document on which the metadata resides - the metadta on the document will be tokenised / indexed using the locale stored as part of the metadata, while the query will be tokenised using the locale (implicitly) provided in the request. As a result, some fuzzy match conditions may not match (even if they should obviously do so in the user's eyes) because the locales have slightly different tokenisation / stemming rules in SOLR / Lucene.

So, using the = prefix turns off fuzzy matching and case insensitivity, turning the term more into an SQL LIKE condition with asterisk playing the part of the percent wildcard. A SQL LIKE is also always case sensitive and the only way to turn it into an insensitive query is by using value manipulation functions (like LOWER) in the condition, which you simply cannot do in FTS. If written even more restrictive (e.g. without a leading wildcard), a query with = prefix could actually be executed directly via SQL.

The query cm:title:Estima did not find any of the results you were looking for because the word stemming / tokenisation rules for English would not allow the term "Estima" to match either "Estimate" or "Estimating".

The query cm:title:Estimate finds both 'Estimate' and 'Estimating' because the word stemming / tokenisation rules for English cause the term "Estimate" to matche words based on "Estimate" using common English word derivation rules, for which "Estima" was insufficient.

If you would use the query cm:title:Estima* it should work like you want it to, as this will add "begins-with" condition rule to the basic stemming / tokenisation evaluations.

cyong · ‎11 Jul 2019

Hmmm, I think I understand what you are saying about the word stemming and tokenisation. However I confirmed that the documents are in english (as indicated by sys:locale property), as is the search API request. Yet, when I search for cm:title:estima* (lowercase) I am unable to find the documents that start with Estima.

Reading https://github.com/Alfresco/alfresco-repository/blob/af2e069b2eabcd5433cee39d83ec06bad6fc69a0/src/ma... it looks like the 5.2 cm:title field should be tokenized?