Improving the document retrieval performance?

sepgs2004 · ‎15 Oct 2018

We are using Alfresco Community 5.2. We did not do anything in addition to the configuration, indexing or whatever. It is the default repository as-is. We have a J2EE application that uses Apache Chemistry CMIS to access the documents from the repository.

We have a Site called abcd in there. In the document library of the site, we have several folders for each of our category of documents. There are about 10 categories. Under each category folder, there will be folders for each entity (that belongs to that category).

For example, there is a Person category, and underneath, there could be 1000s of folders, each representing a person entity record. Now inside each person entity folder, we store documents that belong to that specific person entity.

For a person entity, typically there will be less than 10 document objects under the person entity folder. The folder name is the person id. For example, the documents for person whose id is 234 would be in this path: /Sites/abcd/documentLibrary/Person/234/

From Share UI, the path to access the Person category folder: /Sites/abcd/documentLibrary/Person

Similarly we have a category called Bond

From Share UI, this is the path to access the category folder: /Sites/abcd/documentLibrary/Bond

Performance Issue

Now when we retrieve a person entity documents, the result comes in .02 or .03 seconds.

However, when we retrieve the a bond entity documents, the result takes 20 seconds to appear.

Above behavior (that is the time cost) is same when I access the documents through our J2EE application, and when I access it through a CMIS query in CMIS workbench.

In CMIS workbench, I tried running the provided groovy script to count the #of documents for each category.

For Person, there are about 30,000 documents.

For Bond category, the groovy script fails after counting about 100,000 objects. I am sure there are many more.

From this, one thing is sure. When we access the documents of a bond entity, we are dealing with millions of documents.

The CMIS query (that we use in our application and when I tested outside) is like this.

SELECT doc.*, slo.*

FROM cmis:document AS doc

JOIN slo:documentProperties AS SLOALIAS

ON doc.cmisbjectId = SLOALIAS.cmisbjectId

WHERE

IN_FOLDER(doc, 'workspace://SpacesStore/ed9866d4-c99e-4f3c-a12b-aa3f757d4198')

AND (slo.SLOALIAS:lo_link = 356310)

AND (slo.SLOALIAS:lo_category = 'Bond')

What is in the IN_FOLDER clause?

workspace://SpacesStore/ed9866d4-c99e-4f3c-a12b-aa3f757d4198

This is the folder id of the folder that correspond to one bond entity.

This folder will be underneath /Sites/abcd/documentLibrary/Bond/

Note: even if I remove the IN_FOLDER clause totally, it has the same 20 seconds time cost.

How can I go about increasing the performance, or debugging what is the cause of this 20 seconds time for accessing a bond entity documents?

Gnanasekaran Sakthivel

afaust · ‎16 Oct 2018

This performance issue has to do with your database and the general performance characteristics of a "transactional metadata query" (TMQ) in Alfresco. Your CMIS query is defined in such a simple way that Alfresco is able to execute it directly against the database as a TMQ. This means the performance is directly impacted on how long the underlying SQL query takes. Your CMIS query will result in a query with at least 4 joined tables (alf_node, alf_child_assoc, 2 x alf_node_properties) and a sub-select on at least one further table (alf_node_aspects). It is important to know that a TMQ will always query and retrieve all matches, even if you only need to the first X (i.e. 20) to display in a list and allow the user to paginate. This is due to how permission checks work with TMQs (as a post-filter step since they cannot be applied directly in the query).

Depending on the quality of your DB statistics, memory allotment and IO performance, a database may decide to run a TMQ as a very expensive table scan instead of an index-based query. This is often directly affected by the general data distribution.

In your case you say you may have 100.000 objects or more in the Bond category. How many other objects do you have in total in the entire system? E.g. I see that you have a simple property "slo:lo_category" that holds the name of the category ("Bond" in this case). Now having every every bond object carry that string value means the data distribution may be quite skewed (i.e. you have 5 different category values, and "Bond" is the most often used one), so the DB query planner ends up choosing a more expensive query plan for this part of the query.

What database are you using? Since you are using Community it must be either MySQL, PostgreSQL or MariaDB. If you are using PostgreSQL you can significantly optimise for such queries by applying partial indices to alf_node_properties. If you are using MySQL or MariaDB you should be looking into the Barracuda file format and using compressed tables / indices to affect cost estimates for IO.

Likely one of the most significant improvements might be gained by turning the slo:lo_category property from a d:text to a d:int type, and using it more like a enum-based constant. This would mean that, on a DB query planner level, the lookup would no longer have to consider all the other properties stored as d:text or transformed into a textual representation for storage (i.e. all d:date / d:datetime values). That way both the storage costs of the value is greatly reduced and the query cost will be reduced since the lookup no longer has to compete with all the other textual values (which tend to be the majority). Such a change though would be a massive change to your existing data architecture, so you might want to try out and exhaust all DB optimisation approaches first.

You may also want to check out my recorded talk and slides regaring TMQs, which includes quite a few slides on performance aspects as well.

sepgs2004 · ‎16 Oct 2018

I have been working towards your recommendations.

We use Community Edition 5.2. Yes it is PostgreSQL.

I have listened to most of your recorded talk.

I have never accessed the database directly.

I will see if I can find some brief details on the tables that you are referring to, and how to access them as an administrator. We have DBA in our team. I would have to work with them.

Can you elaborate a little bit on applying the partial indices to alf_node properties?

In your talk, also you are talking about two patches to Alfresco 5.1 and 5.2 that is needed to enable TMQ.

Also, you were suggesting some low level SQL queries to cut the down time on an already running server.

Where can I find more information on these patches or queries?

Meanwhile, I will update here if I find something towards solving the original issue

Gnanasekaran Sakthivel

afaust · ‎17 Oct 2018

The SQL patches / indices are documented. To apply them manually will require executing individual statements from patch 1 and 2 by hand.

Details about partial indices can be found in the PostgreSQL documentation. For the use case of transactional metadata queries you will typically want to duplicate any of the idx_alf_nprop_XX indices (using a different name), depending on the property value type, and using a condition on the qname_id and/or value. For the qname_id you would have to find out the ID for your slo:lo_category property first by executing a

select qn.id from alf_qname qn left join alf_namespace ns on ns.id = qn.ns_id where ns.url = '<modelUrl>' and qn.local_name = 'lo_category';

query. The resulting index might look something like

CREATE INDEX idx_alf_nprop_lo_category ON alf_node_properties (qname_id, string_value, node_id) WHERE qname_id = 1234;

sepgs2004 · ‎22 Oct 2018

I checked the existing indices on these tables. The SQL patches/indices 1 and 2 are all there. We have Alfresco community 5.2.

For example, I checked running the creation of the 2 of the indices in the patch 1. I get the following error

ERROR: relation "idx_alf_node_mdq" already exists SQL state: 42P07

ERROR: relation "idx_alf_nprop_d" already exists

However, I could not find anything related to applying this patch, in table alf_applied_patch

Does that mean we have these indices already in place, since we are on Alfresco community 5.2?

With these indices in place, I do not know what the following page is talking about...

Configuring an optional patch for upgrade | Alfresco Documentation

I will continue with adding indices for our own custom model's "key" properties.

Gnanasekaran Sakthivel

afaust · ‎22 Oct 2018

When you start with Alfresco 5.2 (i.e. this was the version when you installed Alfresco), then you already start with these indices. Only if one has upgraded from previous versions do these patches need to be applied.

sepgs2004 · ‎22 Oct 2018

If these indices are in place, the original performance issue that one of our client test site is experiencing can be improved by adding indices specific to our custom model. This is because our access queries are having these properties in the where clause of the CMIS query. Am I right?

Also, how would I know or do explain plan on the SQL for the CMIS SQL that I wrote?

Or in other words, can I see the Postgresql query that correspond to my CMIS query?

Thx Axel.

Gnanasekaran Sakthivel

afaust · ‎23 Oct 2018

If you check on my recorded talk / slides, I mentioned that you would need to perform SQL query logging in Alfresco to collect the generated SQL queries and then be able to use these to get/explain the query plan of the database. I use tools such as p6spy for this, and pipe that logging via SLF4J into a regular Alfresco Log4J logger.

sepgs2004 · ‎25 Oct 2018

From the guidance from another thread, I got the following in alfresco.log, i confirmed that our CMIS query is going to the database, as you mentioned before.

This is the log information (alfresco.log)

2018-10-25 10:25:12,062 DEBUG [org.alfresco.repo.search.impl.solr.DbOrIndexSwitchingQueryLanguage] [http-apr-8180-exec-10]

Trying db query for db-cmis for SearchParameters [

language=cmis-alfresco,

query=SELECT doc.*, SALO_ALIAS.* FROM cmis:document AS doc JOIN salo:documentProperties AS SALO_ALIAS ON doc.cmisbjectId = SALO_ALIAS.cmisbjectId WHERE IN_FOLDER(doc, 'workspace://SpacesStore/9315813f-ac89-4b05-9281-ce04eda1cf7d') AND (SALO_ALIAS.salo:lo_link = 1965) AND (SALO_ALIAS.salo:lo_category = 'Lic'),

...

2018-10-25 10:25:12,070 DEBUG [org.alfresco.repo.search.impl.solr.DbOrIndexSwitchingQueryLanguage] [http-apr-8180-exec-10] DB returned 7 results in 8ms

The above log (alfresco.log) is not the production log where exactly the performance issue is happening. But it is the same CMIS query as shown above except there will be a different node id and link value.

However, I would like to locate the postgresql query that correspond to the above CMIS query. This is to add the indices necessary to improve the performance.

I believe I found the query...

Is this the only query that correspond to the CMIS query?

2018-10-25 10:25:12 EDT [32016][5bd1d23c.7d10]: user=alfresco,app=[unknown],client=127.0.0.1 LOG: execute S_2/C_3:

select
node.id as id
from
alf_node node
where
node.type_qname_id <> $1
AND node.store_id = $2
AND node.id IN (
select aspect.node_id
from alf_node_aspects aspect
where aspect.qname_id IN ( $3 )
)
AND
(
node.type_qname_id IN ( $4 , $5 , $6 , $7 , $8 , $9 , $10 , $11 , $12 , $13 )
)
AND
(
node.id IN (
select PARENT.child_node_id from alf_child_assoc PARENT where
PARENT.parent_node_id = $14
)
AND
node.id IN (
select PROP.node_id from alf_node_properties PROP where ($15 = PROP.qname_id) AND
PROP.long_value = $16
)
AND
node.id IN (
select PROP.node_id from alf_node_properties PROP where ($17 = PROP.qname_id) AND
PROP.string_value = $18
)
)

2018-10-25 10:25:12 EDT [32016][5bd1d23c.7d10]: user=alfresco,app=[unknown],client=127.0.0.1
DETAIL: parameters: $1 = '149', $2 = '6', $3 = '243', $4 = '213', $5 = '206', $6 = '51', $7 = '318', $8 = '299', $9 = '199', $10 = '200', $11 = '242', $12 = '104', $13 = '308', $14 = '5667', $15 = '248', $16 = '1965', $17 = '246', $18 = 'Lic'

Gnanasekaran Sakthivel

afaust · ‎25 Oct 2018

This is the only query that corresponds to the core CMIS query. You may see additional SQLs for actually accessing the propertries / aspects of the results for rendering the CMIS response, but those are typically not that performance critical. As far as I can see, that SQL query matches the CMIS query perfectly, though qname_id and parent_node_id will be different between different systems. Again, to get a specific query from the production system, the easiest option may be to use p6spy. Though this blog post is a bit sparse on details, it is a good place to start.

Depending on the composition of your CMIS query, the SQL query can change quite a bit, i.e. additional joins / property conditions typically result in additional table joins...