Technically speaking, all files can be indexed by SOLR 4 as long as they can be represented as text. The limitation is not on SOLR 4, but on what the Repository can transform to text. E.g. by default Alfresco cannot transform video / audio to text, but community members / experts have in the past implemented addons that can transcribe spoken words in audio/video files to text, and as such you could full-text search even for those types of files. For video you could also use any sort of embedded close captioning information for indexing.
In a default Alfresco system, with LibreOffice installed and running correctly, you should expect to have all "typical" document file formats to be indexed, e.g. Microsoft Excel, Word, PowerPoint, Open Document formats, PDF, plain text, HTML, XML ... Also, anything that can indirectly be transformed to PDF can also be indexed.
it is possible to provide a link on the full list of supported types of files ?
I want to know what type of CAD files the Solr4 can be indexed currently out of the box. In the company where I am working now my bosses have a plan to deploy SharePoint, but I think it is a bad idea. I know that Alfresco much easy to deploy, has a lot of functions,very customizable product and based on Open Source standardized technologies. So I need weighty arguments to protect my proposal about deploying of Alfresco Enterprise/Community.
There is no (documented) full list of supported types of files that I can link to, as it will be variable like I mentioned.
By default, Alfresco cannot convert / transform CAD files to text for indexing. Though there are many easy options to add this in a custom use case, e.g. via dwg2pdf. Alfresco cannot provide transformations for various formats out-of-the-box because the existing tools for such may be proprietary / under a conflicting license, and in some cases development of a custom transformation might be too expensive for too little economical benefit.
Anything that has a transformation route to text/plain can be indexed. If there are alternative routes some may may be better then others in terms of quality, time and reliability. Some transforms do not support streaming and require more memory, others have some format nuances they do not support, unusually you can just hit a nasty case that breaks the transformer. In any of these case, the fact that you can add and control the transform behaviour can really help.
The index also keeps track of why any content did not get transformed - no transform, failed transform, timed out etc. So if you have an issue and fix it you can try again.
- if you have found a list of all routes to text/plain that sounds like the correct thing to me ... so long as it includes multi-stage routes.