Indexing XML on Alfresco 5.1.x

cancel
Showing results for 
Search instead for 
Did you mean: 
pcuvecle2
Active Member

Indexing XML on Alfresco 5.1.x

Jump to solution

Hi,

I am using Alfresco 5.1 and I have XML files to index. My XML contains tags such as

<paragraph eId="id-00000967-2e30-ecab-ad49-685fecd94436">
   <content>
      <p>Some text</p>
   </content>
</paragraph>

I would like to be able to discard XML attribute such as eId during indexing. For now if I search for eca (that is a substring of the eId) I get some results.

I've seen that I could use <charFilter class="solr.HTMLStripCharFilterFactory"/> in SOLR schema.xml but so far this does not seem to give any results.


Does someone know how to achieve this ?

Thanks !

1 Solution

Accepted Solutions
pcuvecle2
Active Member

Re: Indexing XML on Alfresco 5.1.x

Jump to solution

Answering to myself

The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.

The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it

View solution in original post

1 Reply
pcuvecle2
Active Member

Re: Indexing XML on Alfresco 5.1.x

Jump to solution

Answering to myself

The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.

The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it