What file types can be indexed by Solr4

cancel
Showing results for 
Search instead for 
Did you mean: 
egor
Active Member II

What file types can be indexed by Solr4

Hi,

who can tell me, what type of file types can be indexed by Solr4 which is currently used in Alfresco Community 5.2 ? Where we can get a full list of supported type of files.

Best regards,

Egor

6 Replies
afaust
Master

Re: What file types can be indexed by Solr4

Technically speaking, all files can be indexed by SOLR 4 as long as they can be represented as text. The limitation is not on SOLR 4, but on what the Repository can transform to text. E.g. by default Alfresco cannot transform video / audio to text, but community members / experts have in the past implemented addons that can transcribe spoken words in audio/video files to text, and as such you could full-text search even for those types of files. For video you could also use any sort of embedded close captioning information for indexing.

In a default Alfresco system, with LibreOffice installed and running correctly, you should expect to have all "typical" document file formats to be indexed, e.g. Microsoft Excel, Word, PowerPoint, Open Document formats, PDF, plain text, HTML, XML ... Also, anything that can indirectly be transformed to PDF can also be indexed.

egor
Active Member II

Re: What file types can be indexed by Solr4

Dear Axel, 

it is possible to provide a link on the full list of supported types of files ?

I want to know what type of CAD files the Solr4 can be indexed currently out of the box. In the company where I am working now my bosses have a plan to deploy SharePoint, but I think it is a bad idea. I know that Alfresco much easy to deploy, has a lot of functions,very customizable product and based on Open Source standardized technologies. So I need weighty arguments to protect my proposal about deploying of Alfresco Enterprise/Community.

Thanks!

afaust
Master

Re: What file types can be indexed by Solr4

There is no (documented) full list of supported types of files that I can link to, as it will be variable like I mentioned.

By default, Alfresco cannot convert / transform CAD files to text for indexing. Though there are many easy options to add this in a custom use case, e.g. via dwg2pdf. Alfresco cannot provide transformations for various formats out-of-the-box because the existing tools for such may be proprietary / under a conflicting license, and in some cases development of a custom transformation might be too expensive for too little economical benefit.

egor
Active Member II

Re: What file types can be indexed by Solr4

Thank you very much for link!

There is another one: Technical Tips & Tricks: Rendering AutoCAD drawings in Alfresco 

mehe
Senior Member II

Re: What file types can be indexed by Solr4

Hi Axel  (Axel Faust‌),

do you think that in the result of .../alfresco/service/mimetypes?mimetype=text/plain#text/plain the "Transformable from" could be a list of all content-indexable Mimetypes?

andy1
Senior Member

Re: What file types can be indexed by Solr4

Hi

Anything that has a transformation route to text/plain can be indexed. If there are alternative routes  some may may be better then others in terms of quality, time and reliability. Some transforms do not support streaming and require more memory, others have some format nuances they do not support, unusually you can just hit a nasty case that breaks the transformer. In any of these case, the fact that you can add and control the transform behaviour can really help.

The index also keeps track of why any content did not get transformed - no transform, failed transform, timed out etc. So if you have an issue and fix it you can try again.

‌ - if you have found a list of all routes to text/plain that sounds like the correct thing to me ... so long as it includes multi-stage routes.

Andy