poor TMQ peformances vs SQL

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Established Member

poor TMQ peformances vs SQL

Hi all,

I'm working with TMQ on a large repo (about 10 millions of docs) on a community 5.1.g.
When running a batch of queries using SearchService API (with transactional queryConsistency), at a regular interval the response time is beyond 3 mn (sometimes 10mn or higher!).
All other queries return almost immediately.
All queries are strictly identical, just parameters (property values) change.
We tried to run SearchService queries starting each time a new RO transactions (retryingtransactionhelper) without any positive effect.

This behavior is easily reproductible, even with no other activity on the repo (just run a batch of queries, and every 5-6 queries one is 'stuck' during minutes).
We can see the long duration SELECT query in pg_stat_activity in postgres DB.

Tests done:
- run exactly the same batch of queries directly in SQL using psql client -> performances are very good. (SQL SELECT produced by SearchService have been captured and rerun with psql script)
- write an alf webscript that connect directly in SQL (instead of using SearchService), and run exactly the same batch of queries -> performances are very good.

It means that when using SearchService, SQL queries context have something that limit in anyway the query performances...

We did the basics maintenance operations on DB (index, vacuum, analyse...), and the native SQL query performances means that the DB itself is OK.

We checked this link, but I'm still out of ideas
http://docs.alfresco.com/community/concepts/db-config-properties.html

Any advice ?
Thanks,
Vincent

16 Replies
Highlighted
Master

Re: poor TMQ peformances vs SQL

Did you check my talk about TMQs at the last BeeCon? (Slides, video)

When you checked the performance of the native query, did you check the performance of executing the prepared statements or did you just execute the query with placeholders already replaced by the specific values?

Did those queries include any queries that are related to the post-search ACL checking that is done on the search results?

Did you also do a JVM sampling session to see if there may be inefficiencies on the Java-side? There could be overhead due to a custom dynamic authority or GC-overhead at work (the latter especially considering that the issue appears every 5-6 queries only, while I assume all queries are identical / similar).

Are you executing (potentially paginated) searches against reasonably sized (expected) result sets? Querying very large results sets (1000+ results on the SQL layer before permission checking) is quite inefficient due to the overheads of permission checking.

Is the duration of searches in Alfresco including any custom processing, e.g. rendering a response template, which may require that additional data (aspects, properties...) are loaded from the DB as needed?

I also wonder about the general configuration of Alfresco and PostgreSQL. Are cache sizes configured in a way that they can deal with the amount of search results expected? Does the DB server have sufficient memory / correct relative IO cost estimates to avoid resorting to table scans in case queries are executed on very skewed data sets?

10 million documents is not a high number (I have worked with such an amount during tests on my local laptop in some past projects) and as long as the data set / queries are reasonable and the configuration is fine, there should not be TMQs running for multiple minutes.

Highlighted
Established Member

Re: poor TMQ peformances vs SQL

Axel,

Thanks for your response.
Yes I read your (valuable) information about TMQ.
We already had some exchange about this topic (https://community.alfresco.com/thread/231968-tmq-indexes-and-performances).

No, any prepared statements or post-search ACL checking have not been included in the test using native SQL queries.
ResultSet are small (10-100 results).
Yes, the response template rendering is included in the test (the same webScript runs either native SQL query or TMQ query).

We dismiss any JVM configuration problem as we use the same platform (JVM/Alfresco/Postgres) for all tests.
Am'I wrong ?

We clearly see the queries on PostgreSQL spending minutes when run from TMQ queries (every 5-6 queries), but not when using a direct sql cnx.
There must be misconfiguration of jdbc driver on alfresco side ? Don't you think so ?

Here's the code use for testing. The SQL queries produced by both tests have been separately tested with comparable performances

 This works fine:

   Class.forName(this.dbDriver); //org.postgresql.Driver
   Connection connection = null;
   connection = DriverManager.getConnection(this.dbUrl, this.dbUsername, this.dbPassword);//"jdbcSmiley Tongueostgresql://localhost:5432/alfresco", "alfresco", "****");
   if (connection != null){

   // Get alf_qname ids for QNames (SQL query)
   Map<QName, Integer> qNameSQLIds = getQNameSQLIds(connection, new ArrayList (parameterValues.keySet()));
   
   // build SQL query
   String queryString = new StringBuilder()
   .append("-- select id, local_name from alf_qname limit 10;\n")
   .append("select node.uuid\n")
   .append("from \n")
   .append("alf_node node\n")
    .append("inner join alf_node_properties nodeProp1 on node.id = nodeProp1.node_id  -- joint node to property 1\n")
    .append("inner join alf_node_properties nodeProp2 on node.id = nodeProp2.node_id  -- joint node to property 2\n")
    .append("inner join alf_node_properties nodeProp3 on node.id = nodeProp3.node_id  -- joint node to property 3\n")
    .append("inner join alf_node_properties nodeProp4 on node.id = nodeProp4.node_id  -- joint node to property 4\n")
   .append("where \n")
    .append("node.store_id = 6 AND -- spaceStore\n")
    .append("node.type_qname_id = 51 -- type content\n").toString();
   Integer paramIndex = 0;
   for (Entry<QName,Integer>  qNameSQLId : qNameSQLIds.entrySet()){
    paramIndex++;
    queryString = queryString
      +("AND nodeProp" + paramIndex + ".qname_id = '" + qNameSQLId.getValue() + "' -- select id from alf_qname where local_name = '" + qNameSQLId.getKey().getLocalName() + "';\n")
      +("AND nodeProp" + paramIndex + ".string_value = '" + parameterValues.get(qNameSQLId.getKey()) + "'\n");
   }
   queryString = queryString
   + ("order by\n")
   + ("node.audit_modified DESC;\n").toString();
   
   // Run Query
   Statement statement = connection.createStatement();
   ResultSet rs = statement.executeQuery(queryString);
    
   nodeList = new ArrayList<NodeRef>();
   while ( rs.next() ) {
    nodeList.add(new NodeRef("workspace://SpacesStore/" + rs.getString("uuid")));
   }
   // -> send nodeList to ws template
       
 This get stuck every 5-6 queries
 
   RetryingTransactionCallback<List<NodeRef>> callback = new RetryingTransactionCallback<List<NodeRef>>() {
    @Override
    public List<NodeRef> execute() throws Throwable {    
            
     SearchParameters sp = new SearchParameters();
     sp.setLanguage(searchLanguage);
     sp.addStore(new StoreRef("workspace://SpacesStore"));
     sp.setQuery(query);
     sp.setMaxItems(maxItems);
     sp.setQueryConsistency(queryConsistency);
     sp.setSkipCount(skipCount);
     sp.excludeDataInTheCurrentTransaction(false);//should be the default
     SortDefinition sortDefinition = new SortDefinition(
       SearchParameters.SortDefinition.SortType.FIELD, "@" + sortField, sortAscending);
     sp.addSort(sortDefinition);
     logger.debug(" search - query: " + sp.getQuery());
     resultSet = this.services.getSearchService().query(sp);
     logger.debug("Results found: " + resultSet.getNumberFound());
     results = resultSet.getNodeRefs();
    }
   };
   // run transaction
   RetryingTransactionHelper txnHelper = this.services.getRetryingTransactionHelper();
   return txnHelper.doInTransaction(callback, true, true); // read-only trx, new transaction required
   
   // -> send returned nodeList to ws Template
 

Highlighted
Master

Re: poor TMQ peformances vs SQL

Right, thanks for reminding me about that exchange - it is hard sometimes to track the many conversations.

One thing I am wondering: Was the SQL query you are using really obtained from postgresql logs? Because the query looks different in some ways from what Alfresco would generate for TMQ.

  • the JOIN on alf_node_properties would include the property QName ID in the ON condition
  • the JOIN is never an explicit INNER JOIN and may even be a LEFT OUTER one
  • you are missing the "not equals" check on node.type_qname_id for the "deleted" type
  • the TYPE condition in FTS is translated into a "node.type_qname_ID IN (list-of-IDs)"-style check including the requested type and its sub-types

You can check the query template in GitHub, and it is used for all DBs with no dialect-specific variants.

If you were to perform some JDBC logging, i.e. via P6Spy, you should obtain the real, proper query as requested from Alfresco. I always use that as a reliable source, regardless of the DB used at a customers (there are some Enterprise-y DBs that are a huge pain in the butt for performance analysis tasks).

Your comparison of direct SQL vs. TMQ FTS is ignoring the additional work that Alfresco is doing for TMQs. This is what I meant with post-search ACL checking. Your direct SQL simply assumes all nodes are accessible by the current user, while the FTS query will always perform permission checks unless disabled / short-circuited. Granted, when your webscript template processes the results from the native SQL query, those permission checks will be performed "on-access", but you are still skewing the data by avoiding the first batch-check operation.

Note: excludeDataInTheCurrentTransaction is a very old option and only relevant AFAIK for legacy Lucene queries. It is not used for SOLR or TMQ.

Highlighted
Established Member

Re: poor TMQ peformances vs SQL

Axel,

Thks for your note (excludeDataInTheCurrentTransaction) I'll update my code.
You're absolutely right, the query has been modified in the code (but first tests have been done using the original captured query as added at the bottom of this post).
We excluded from the possible cause the additional work that Alfresco is doing for TMQs (ACL checking and do on) because we see clearly that the time is spend on the DB query: each time a TMQ query is running during minutes, the
corresponding SQL query is running during the same time on PostgresSQL.
The SQL query used to identify long duration queries on PostgresSQL:
 select (now() - xact_start), query, * from pg_stat_activity
  order by 1 desc;

Something important I ommit to say:
When a query is stuck during long time, we open a new browser session and run exactly the same query (using TMQ): the response is sent back immediatly.

It means that is some situation (linked to the user session context ?), PostgresSQL is waiting during minutes before returning query results (waiting for resources allocation ?).
And this behavior is always reproductible using TMQ in our environment.
When running direct SQL (/!\ with openning a new connection /!\), this neven appends.

The DB connection number is never rising higher than the limit.

Original SQL query captured from TMQ query (BTW is this query produced by hibernate ?)
 select
  node.id as id
 from
  alf_node node
 where
  node.type_qname_id <> $1
  AND node.store_id = $2
  AND
  (
   node.id IN (select aspect.node_id from alf_node_aspects aspect  where aspect.qname_id IN  (  $3 ) )
   AND node.id IN ( select PROP.node_id from alf_node_properties PROP where ($4 = PROP.qname_id) AND PROP.string_value = $5 )
   AND node.id IN ( select PROP.node_id from alf_node_properties PROP where ($6 = PROP.qname_id) AND PROP.string_value = $7 )
   AND node.id IN ( select PROP.node_id from alf_node_properties PROP where ($8 = PROP.qname_id) AND PROP.string_value = $9 )
   AND node.id IN ( select PROP.node_id from alf_node_properties PROP where ($10 = PROP.qname_id) AND PROP.string_value = $11)
  )
 order by
 node.audit_modified DESC",

Any suggestion are welcomes, I'll put here any interesting test results.
Vincent

Highlighted
Established Member

Re: poor TMQ peformances vs SQL

Example of PostgreSQL long duration queries:

Highlighted
Master

Re: poor TMQ peformances vs SQL

Thanks for the update. Did you run EXPLAIN on that prepared statement to check the execution plan for it - ideally before/after each of your test runs to see if the query optimiser - for some temporary reason - chooses a different query plan? I have in some instances observed an intermittent spike (not as regular as yours, but reproducible) due to changing "data-in-memory" state and cost estimations, which would tip a query that is already close to a cost limit for a table scan over the edge every so often. Most of these issues could be solved by either increasing statistics granularity for the alf_node_properties table (specifically node_id, qname_id and string_value columns) or using partial, composite indices for very important properties (using qname_id as selector, spanning columns node_id, qname_id, and string_value).

The queries in Alfresco are not generated by Hibernate. Alfresco used to use Hibernate back in 3.x, but is now using iBATIS as the OOM framework, as it provides way more granular control to optimise for the various databases that Alfresco supports. Hibernate is still included as a library, but it is relegated only to deal with dialect detection and some transaction management. If you are still running the jBPMN workflow engine, then that part will still use Hibernate.

Highlighted
Established Member

Re: poor TMQ peformances vs SQL

Hi Axel,
I observed strange things :-)
On server side (PostgreSQL server), log have been enabled with following options:
 log_checkpoints = on
 log_connections = on
 log_disconnections = on
 log_duration = on
 log_lock_waits = on
 log_temp_files = 0
 log_statement = 'all'
We can see that the query executes very quickly (< 1s), but the next one starts to execute 5mn later.
No clue of what appends during this period, but the first query appears to be running in pg_stat_activity (waiting = false) during 5mn, no lock on DB, and consume 100% of one CPU thread.

On client side (alfresco webapp), log have been captured using p6spy, and we cleary see that some queries are executing during 5mn or more.

Modifying the default value 'db.pool.abandoned.time=300' to 180 seems to have an effect (execution time fall to 3 mn).

I suspect some kind of congestion on DB server side (temp files, cache buffer..) but I still can't find the root cause of this issue.

I'll try to add following params to server log (as per your advise to check query plan) and see

 debug_print_parse = on
 debug_print_rewritten = on
 debug_print_plan = on
 debug_pretty_print = on

Any advice are welcomed (especially on how to monitor postgreSQL resources usage like temp files, memory...)

I'll post any interesting result.
Vincent

Highlighted
Established Member

Re: poor TMQ peformances vs SQL

Hi,

Pure PostgreSQL issue, moved here: https://stackoverflow.com/questions/48077136/postgresql-queries-randomly-takes-minutes-to-execute

Axel, I think you're right about the query optimiser that "for some temporary reason chooses a different query plan".

Strange thing: it does not appear when executing direct SQL (outside alfresco transaction).

I'll post here any interesting feedback.

Vincent

Highlighted
Established Member

Re: poor TMQ peformances vs SQL

I finally did the following test:
- Execute a search service sample query and capture SQL queries (from client side using P6SPY).
 The set of query captured are added below.

- In a batch file, execute this sample query 10 times (using curl to call the repo webservice)
 -> reproduce the problem described above: a query is stuck during minutes
- Execute 10 times the set of SQL queries using pgsql
 -> problem can never be reproduced

It lead me to think that the issue is not on server side.
BTW I followed Axel's advises about statistics granularity and noticed a real gain of performance, but issue remains.

Any idea of where to look at on application side ? (jdbc driver, spring transaction manager....?)

Thanks.

SQL queries captured from the sample search:

    <code>

BEGIN;
SELECT 1;
select * from alf_qname where ns_id = 6 and local_name = 'modified';
select node.id as id from alf_node node where node.type_qname_id <> 149 AND node.store_id = 6 AND ( node.id IN (select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN ( 260 ) ) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where (249 = PROP.qname_id) AND PROP.string_value = 'Mandats' ) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where (245 = PROP.qname_id) AND PROP.string_value = '1' ) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where (247 = PROP.qname_id) AND PROP.string_value = '939473' ) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where (248 = PROP.qname_id) AND PROP.string_value = 'AG6091' ) ) order by node.audit_modified DESC;
select node.id as id, node.version as version, store.id as store_id, store.protocol as protocol, store.identifier as identifier, node.uuid as uuid, node.type_qname_id as type_qname_id, node.locale_id as locale_id, node.acl_id as acl_id, txn.id as txn_id, txn.change_txn_id as txn_change_id, node.audit_creator as audit_creator, node.audit_created as audit_created, node.audit_modifier as audit_modifier, node.audit_modified as audit_modified, node.audit_accessed as audit_accessed from alf_node node join alf_store store on (store.id = node.store_id) join alf_transaction txn on (txn.id = node.transaction_id) where node.id = 1956578;
select node.id as id, node.version as version, store.id as store_id, store.protocol as protocol, store.identifier as identifier, node.uuid as uuid, node.type_qname_id as type_qname_id, node.locale_id as locale_id, node.acl_id as acl_id, txn.id as txn_id, txn.change_txn_id as txn_change_id, node.audit_creator as audit_creator, node.audit_created as audit_created, node.audit_modifier as audit_modifier, node.audit_modified as audit_modified, node.audit_accessed as audit_accessed from alf_node node join alf_store store on (store.id = node.store_id) join alf_transaction txn on (txn.id = node.transaction_id) where node.id = 26982441;
select * from alf_access_control_list where id = 76;
select ace.allowed, ace.applies, authority.authority, ace.permission_id, aclmem.pos, aclmem.id as aclmem_id from alf_acl_member aclmem join alf_access_control_list acl on (aclmem.acl_id = acl.id) join alf_access_control_entry ace on (aclmem.ace_id = ace.id) join alf_authority authority on (ace.authority_id = authority.id) where acl.id = 76;
select * from alf_permission where id = 8;
select * from alf_permission where id = 9;
select * from alf_permission where id = 10;
select * from alf_permission where id = 11;
select assoc.id as id, parentNode.id as parentNodeId, parentNode.version as parentNodeVersion, parentStore.protocol as parentNodeProtocol, parentStore.identifier as parentNodeIdentifier, parentNode.uuid as parentNodeUuid, childNode.id as childNodeId, childNode.version as childNodeVersion, childStore.protocol as childNodeProtocol, childStore.identifier as childNodeIdentifier, childNode.uuid as childNodeUuid, assoc.type_qname_id as type_qname_id, assoc.child_node_name_crc as child_node_name_crc, assoc.child_node_name as child_node_name, assoc.qname_ns_id as qname_ns_id, assoc.qname_localname as qname_localname, assoc.is_primary as is_primary, assoc.assoc_index as assoc_index from alf_child_assoc assoc join alf_node parentNode on (parentNode.id = assoc.parent_node_id) join alf_store parentStore on (parentStore.id = parentNode.store_id) join alf_node childNode on (childNode.id = assoc.child_node_id) join alf_store childStore on (childStore.id = childNode.store_id) where parentNode.id = 477 and assoc.type_qname_id = 4 and assoc.qname_crc = 2226085765 and assoc.qname_ns_id = 6 and assoc.qname_localname = 'GROUP_site_mandats_SiteCollaborator' order by assoc.assoc_index ASC, assoc.id ASC;
select '' as child_name, parentName.string_value as parent_name from alf_child_assoc ca join alf_node childNode on ((childNode.id = ca.child_node_id) and (childNode.store_id = 6)) join alf_node parentNode on ((parentNode.id = ca.parent_node_id) and (parentNode.type_qname_id = 89) and (parentNode.store_id = 6)) join alf_node_properties parentName on (parentName.node_id = parentNode.id and parentName.qname_id = 90) where ca.type_qname_id = 93 and childNode.id = 32;
select assoc.id as id, parentNode.id as parentNodeId, parentNode.version as parentNodeVersion, parentStore.protocol as parentNodeProtocol, parentStore.identifier as parentNodeIdentifier, parentNode.uuid as parentNodeUuid, childNode.id as childNodeId, childNode.version as childNodeVersion, childStore.protocol as childNodeProtocol, childStore.identifier as childNodeIdentifier, childNode.uuid as childNodeUuid, assoc.type_qname_id as type_qname_id, assoc.child_node_name_crc as child_node_name_crc, assoc.child_node_name as child_node_name, assoc.qname_ns_id as qname_ns_id, assoc.qname_localname as qname_localname, assoc.is_primary as is_primary, assoc.assoc_index as assoc_index from alf_child_assoc assoc join alf_node parentNode on (parentNode.id = assoc.parent_node_id) join alf_store parentStore on (parentStore.id = parentNode.store_id) join alf_node childNode on (childNode.id = assoc.child_node_id) join alf_store childStore on (childStore.id = childNode.store_id) where parentNode.id = 477 and assoc.type_qname_id = 4 and assoc.qname_crc = 1767095639 and assoc.qname_ns_id = 6 and assoc.qname_localname = 'GROUP_site_mandats_SiteManager' order by assoc.assoc_index ASC, assoc.id ASC;
select node.id as id, node.version as version, store.id as store_id, store.protocol as protocol, store.identifier as identifier, node.uuid as uuid, node.type_qname_id as type_qname_id, node.locale_id as locale_id, node.acl_id as acl_id, txn.id as txn_id, txn.change_txn_id as txn_change_id, node.audit_creator as audit_creator, node.audit_created as audit_created, node.audit_modifier as audit_modifier, node.audit_modified as audit_modified, node.audit_accessed as audit_accessed from alf_node node join alf_store store on (store.id = node.store_id) join alf_transaction txn on (txn.id = node.transaction_id) where node.id = 26978488;
select node.id as id, node.version as version, store.id as store_id, store.protocol as protocol, store.identifier as identifier, node.uuid as uuid, node.type_qname_id as type_qname_id, node.locale_id as locale_id, node.acl_id as acl_id, txn.id as txn_id, txn.change_txn_id as txn_change_id, node.audit_creator as audit_creator, node.audit_created as audit_created, node.audit_modifier as audit_modifier, node.audit_modified as audit_modified, node.audit_accessed as audit_accessed from alf_node node join alf_store store on (store.id = node.store_id) join alf_transaction txn on (txn.id = node.transaction_id) where node.id = 26978486;
...
select node.id as id, node.version as version, store.id as store_id, store.protocol as protocol, store.identifier as identifier, node.uuid as uuid, node.type_qname_id as type_qname_id, node.locale_id as locale_id, node.acl_id as acl_id, txn.id as txn_id, txn.change_txn_id as txn_change_id, node.audit_creator as audit_creator, node.audit_created as audit_created, node.audit_modifier as audit_modifier, node.audit_modified as audit_modified, node.audit_accessed as audit_accessed from alf_node node join alf_store store on (store.id = node.store_id) join alf_transaction txn on (txn.id = node.transaction_id) where node.id = 1956574;
COMMIT;

</code>