SOLR configuration for search tokenization

cancel
Showing results for 
Search instead for 
Did you mean: 
venur
Established Member

SOLR configuration for search tokenization

Hello all

I am looking for right documentation or steps to deal with a request we have.

We want to disable tokenization on special characters. 

I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide

9 Replies
venur
Established Member

Re: SOLR configuration for search tokenization

Any guidance, pointers will be really helpful. 

hi @angelborroy @abhinavmishra14 @afaust if you have any suggestions please share. I am stuck right now

angelborroy
Alfresco Employee

Re: SOLR configuration for search tokenization

Can you provide additional details on you requirement?

Hyland Developer Evangelist
abhinavmishra14
Advanced

Re: SOLR configuration for search tokenization

hi @venur  i have not dealt with such scenarios, I will have to check. @angelborroy  may be able to provide some guidance. As mentioned by angel, please share what exactly you want to achieve so we can try the scenario.

Found couple of links on the internet but not sure if they fit your requirement:

https://prowave.io/indexing-special-terms-using-solr

https://soft29.ru/blog/entry/alfresco-solr-enable-search-of

https://stackoverflow.com/questions/18277609/search-in-solr-with-special-characters

~Abhinav
(ACSCE, AWS SAA, GAIQ)
venur
Established Member

Re: SOLR configuration for search tokenization

Thank you for responding @angelborroy 

We are importing images and video files from a third party Dam to alfresco repo. Several images and files have special characters in their names and they are on purpose for some business use cases. 

some examples special characters as below-

$

-

_

and

!

Solr is tokenizing the names by default whenever name has these special characters and treating it as white spaces. I read in some doc that says this is a default behavior. But in our case we get a lot of search result if user tries to search for one file name with identical prefix/postfix.

For testing i tried this to show you the results i am getting 

4A7258A0-0B5B-48B4-8785-634B5243A7D2.jpeg

you see above i get all the files that I don't need in results. 
i also try with "" but result remains same.

Please can you guide how to change this default behavior 

venur
Established Member

Re: SOLR configuration for search tokenization

Thank you @abhinavmishra14 i will check also

angelborroy
Alfresco Employee

Re: SOLR configuration for search tokenization

I guess you can't change that behaviour, since they are special SOLR characters.

You may try escaping that characters in your search string:

https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-EscapingSpec...

Apart from that, I don't see any other alternative.

Hyland Developer Evangelist
venur
Established Member

Re: SOLR configuration for search tokenization

Hi @angelborroy tx for the response. 

We also thought this option, but we can't escape characters now right? after indexes are already created by Solr by bypassing special characters and considering all as whitespaces. Based of what i read so far, there won't be a index for the word at all that includes those special characters e.g. :

restored$image.png

Do you mean still solr would have one index for the whole name with special characters I mentioned? Or am i understanding something wrongly 

 

angelborroy
Alfresco Employee

Re: SOLR configuration for search tokenization

I guess you're right. I don't see any alternative out of the box to get that results including special characters.

Hyland Developer Evangelist
venur
Established Member

Re: SOLR configuration for search tokenization

Thanks @angelborroy  for response. Yeah we know its not possible by default and that is what we are looking extend. 
we are aware of default behavior, and looking for steps to change this behavior either from solr or alfresco.

Your inputs or directions will be helpful