I am facing one issue getting the nodeRef from bin file name for some further actions . I know the database SQL statement to find the nodeRef based on file path/name (.Bin) from content store which takes around 5 minutes to get one NodeRef.
Reason for NodeRef - I have a anti-virus software installed on content store which scan all the contents on the disk, if it finds any infected files, it will delete the files from the disk.
As per organisation policy, It must get deleted, but it will update the application/alfresco repo that this file have been deleted, I have developed the endpoint/web script which will be called from anti-virus software with *.Bin file name from content store.
Does anybody have any idea to do it in a better way or suggestions?
Deleting Alfresco files directly in the contetstore folder is very bad practice. You should handle this directory like a black box and access files thru the Alfresco API only or would you suggest to run a virus scanner directly on database files?
We developed an Alfresco module (ecm4u Antivirus) which hooks in Alfresco's IO to scan files on write and to prevent read access on content read until successful scan finished using virus scanner API (in most cases CLI). Infected nodes gets an aspect which handles quarantine function inside Alfresco and all the required processing (alarming, notifcation, batch API for cleanup, rescanning, handling false alarming).
Since you already may have deleted bin files you then have an inconsistent repository and need to fix that in an unsupported way. Please take into account that only binary content for nodes in workspace://Spacestore/ can be easily handled. For the other stores and for other binary data you may need to code low level java to get your data fixed (e.g. delete a single document version). Here is the sql we use in our consistency check tool to identify nodes in a specific workspace from bin file name. Since only the content_url_short column is indexed you need to cut the last 12 chars from the filename. Additionally we created an db index. The query shouldn't run longer than a second - even in big systems:
join alf_content_data cd ON (cd.content_url_id=cu.id)
join alf_node_properties as anp on ( anp.long_value=cd.id AND anp.qname_id = $CONTENT_QN_ID)
join alf_node an ON (anp.node_id = an.id AND an.store_id=$WORKSPACE_ID)
I try to run the SQL that you provided but having problem finding $CONTENT_QN_ID, Could you please help here.
select an.id, concat('$WORKSPACE_PROT',an.uuid) from alf_content_url cu join alf_content_data cd ON (cd.content_url_id=cu.id) join alf_node_properties as anp on ( anp.long_value=cd.id AND anp.qname_id = $CONTENT_QN_ID) join alf_node an ON (anp.node_id = an.id AND an.store_id=(select id from alf_store where protocol='workspace' AND identifier='SpacesStore')) where cu.content_url=(SELECT SUBSTRING ('91a6f583-5530-48e0-a14f-f04e1fb83dd5.bin', length('91a6f583-5530-48e0-a14f-f04e1fb83dd5.bin')-11))
But you or your AV solution should never ever directly modify/remove the binary files. Your "different business requirements" should be implemented in the mentioned policy code and/or be implemented using the Alfresco APIs.
BTW clamAV has a very bad detectionrate and you should rely on better AV scanners.