Find a NodeRef from File name

dharmendra_pan2 · ‎3 Feb 2020

I am facing one issue getting the nodeRef from bin file name for some further actions . I know the database SQL statement to find the nodeRef based on file path/name (.Bin) from content store which takes around 5 minutes to get one NodeRef.

Reason for NodeRef -
I have a anti-virus software installed on content store which scan all the contents on the disk, if it finds any infected files, it will delete the files from the disk.

As per organisation policy, It must get deleted, but it will update the application/alfresco repo that this file have been deleted, I have developed the endpoint/web script which will be called from anti-virus software with *.Bin file name from content store.

Does anybody have any idea to do it in a better way or suggestions?

-D

heiko_robert · ‎3 Feb 2020

Deleting Alfresco files directly in the contetstore folder is very bad practice. You should handle this directory like a black box and access files thru the Alfresco API only or would you suggest to run a virus scanner directly on database files?

We developed an Alfresco module (ecm4u Antivirus) which hooks in Alfresco's IO to scan files on write and to prevent read access on content read until successful scan finished using virus scanner API (in most cases CLI). Infected nodes gets an aspect which handles quarantine function inside Alfresco and all the required processing (alarming, notifcation, batch API for cleanup, rescanning, handling false alarming).

Since you already may have deleted bin files you then have an inconsistent repository and need to fix that in an unsupported way. Please take into account that only binary content for nodes in workspace://Spacestore/ can be easily handled. For the other stores and for other binary data you may need to code low level java to get your data fixed (e.g. delete a single document version). Here is the sql we use in our consistency check tool to identify nodes in a specific workspace from bin file name. Since only the content_url_short column is indexed you need to cut the last 12 chars from the filename. Additionally we created an db index. The query shouldn't run longer than a second - even in big systems:

select
    an.id, concat('$WORKSPACE_PROT',an.uuid)
from
    alf_content_url cu
    join alf_content_data cd ON (cd.content_url_id=cu.id)
    join alf_node_properties as anp on ( anp.long_value=cd.id AND anp.qname_id = $CONTENT_QN_ID)
    join alf_node an ON (anp.node_id = an.id AND an.store_id=$WORKSPACE_ID)
where cu.content_url_short='$CONTENTURL_SHORT'

Hope that helps

dharmendra_pan2 · ‎4 Feb 2020

Thanks for your quick reply and SQL,

We have already have the custom module developed for virus scan which trigger the policy on create or update of the contents.

https://github.com/Redpill-Linpro/alfresco-clamav

From policy, making call to anti-virus API to scan the contents of node and based on status you can remove/keep the noderef.

We have a different business requirements, just to empty the contents or do some more operations further.

-D

dharmendra_pan2 · ‎4 Feb 2020

Hello again,

I try to run the SQL that you provided but having problem finding $CONTENT_QN_ID, Could you please help here.

select an.id, concat('$WORKSPACE_PROT',an.uuid)
from alf_content_url cu
join alf_content_data cd ON (cd.content_url_id=cu.id)
join alf_node_properties as anp on ( anp.long_value=cd.id AND anp.qname_id = $CONTENT_QN_ID)
join alf_node an ON (anp.node_id = an.id AND an.store_id=(select id from alf_store where protocol='workspace' AND identifier='SpacesStore'))
where cu.content_url=(SELECT SUBSTRING ('91a6f583-5530-48e0-a14f-f04e1fb83dd5.bin', length('91a6f583-5530-48e0-a14f-f04e1fb83dd5.bin')-11))

heiko_robert · ‎5 Feb 2020

sorry - that part was missing:

We preselct the qname id for the content property to filter on. To speed up the query it is queried only once and is not part of the join

SQL_CONTENT_QN_ID="SELECT qn.id
FROM alf_qname qn
JOIN alf_namespace ns ON (ns.uri='http://www.alfresco.org/model/content/1.0' AND qn.ns_id=ns.id)
WHERE qn.local_name='content';"

heiko_robert · ‎5 Feb 2020

But you or your AV solution should never ever directly modify/remove the binary files. Your "different business requirements" should be implemented in the mentioned policy code and/or be implemented using the Alfresco APIs.

BTW clamAV has a very bad detection rate and you should rely on better AV scanners.