Retrieve content from a document in javascript

imanez1 · ‎29 Aug 2019

Hello,

I want to retrieve some informations from a text of a pdf file (scanned files).

I started by using pdfsandwich OCR to extract the text in the images (the text is added to each page invisibly "behind" the images), what i want to do, is search that text for informations that i need, How can i do that? is it with lucene search? I'm new to this, i don't know where to start, an example will be a big help for me.

Thank you.

afaust · ‎30 Aug 2019

So, if you are considering to write scripts that run inside the Alfresco Repository application, you may want to look into the documentation of that JavaScript API, especially the part about accessing content-related attributes. But with JavaScript you will generally be limited to working with textual content files, e.g. not PDF files (which are more or less in binary form) that have a text layer added above.

BUT, if the text layer is added by OCR, Alfresco will be able to index the document using SOLR, and you can definitely use JavaScript to execute a search query for the content, and then find the document to process further via JavaScript - you just may not be able to search in the content of the PDF itself, only indirectly via its indexed text in SOLR.

Retrieve content from a document in javascript

Retrieve content from a document in javascript

Re: Retrieve content from a document in javascript

We use cookies on this site to enhance your user experience