Thread/Transaction-safe action rules for massive concurrency

cancel
Showing results for 
Search instead for 
Did you mean: 
iru
Active Member II

Thread/Transaction-safe action rules for massive concurrency

Hi there,

Got a task that seems easy but is getting us lots of headaches.
Functionally speaking, what we need is to create a custom rule in a folder, that given a file that is being uploaded, moves it into another folder, based on its creation date.

  • File enters: Inbox
  • File is moved to:  Folderized / yyyy / MM / dd / ... (in production up to minutes, for example)


Rule must implement two subtasks

  1. create the folderized structure if not existing
  2. move the file into the folderized structure

____


Now, because this must work thread-safely, on a concurrent batch of file upload, it may happen that multiple threads enter the rule at the same time, and here start the problems.

  • We started using the synchronized word, in the upper method (ActionExecuterAbstractBase#executeImpl), inner methods, or blocks of code, but this does not seem to work because even if we synchronize, the transaction may have not been committed, hence, no task1 output is available to other threads. Added a sleep but cannot ensure that it's done anyway and fails sometimes.

  • We also tried the RetryingTransactionHelper (RTH) workaround exposed by Alexey Vasyukov in the lightning talk about 'Alfresco repo under concurrent write load', of this year's beecon, but got several problems too.

I've already asked how this helper is supposed to work when the flag 'requiresNew' is set to false, because it seems not to retry, or we may be misunderstanding something.
https://community.alfresco.com/message/815007-re-retryingtransactionhelper-learnings?commentID=81500...

The problem with this utility is that it behaves differently on a postgres based server vs. mysql based server (we're using latest alfresco520,  mysql server-core 5.7, with 5.1.32 driver , Ubuntu 16; Updated  checked InnoDB for all tables for transacational management)

Updated Tested in SQLServer too, and gives different behaviour from previous ones too.

The approach that works fine in Postgres is having task 1 under a RetryingTransactionHelper (and new transaction), and don't go into task 2 until it's successfully performed.
Task 2 is not executed under a RTH.

However when we try this with mysql, fails. Guess this error is launched because task 1's output is still not fully operable (due to being made inside a transaction) to the parent transaction.

Updated- MySQL  (first task with RTH, second task not - NonRootNodeWithoutParentsException)
- src: https://pastebin.com/raw/DD0bWjW2
- action context: https://pastebin.com/CVWS9U9q
- error: https://pastebin.com/raw/gx2TkWxn

Updated-SQLServer (DataIntegrityViolationException)

- error: https://pastebin.com/raw/BZSTKDUX 

We also tried to insert the second task inside RTH.
With the requireNew flag set to true seems as if the node is not visible to the code within the transaction and when trying to move it, the node is not found.
Guess because it's a new transaction and does not know about the "working in progress node" (seems some AOP implementation)

Updated Clarify that this happens no matter the database provider, so it's nice to see it's coherent.

Updated (both tasks with RTH, with requiresNew - InvalidNodeRefException: Node does not exist)
- src: https://pastebin.com/CvrRSt3X
- action context: same as before

- error: https://pastebin.com/raw/yVNc8YWf

If we insert it with a requireNewFlag set to false, it gives the previous NonRootNodeWithoutParentsException error.

______

We've tried other approaches/combinations and got several kinds of errors, but tried to summarize everything as much as possible.

The only solution that has seemed to work (in any database provider, concurrently) is when rule is executed in background. However, this is not desired; we would like it to be sequential, so final user can know if the action has been correctly performed.

We could also have a cron-like task, that populates the folderized structure, but will lead to lots of empty folders if no data is uploaded and would have to purge it afterwards.

Any suggestions on how to approach this?


Would provide code, but its something quite simple using fileFolderService, both to create and move. We've  analysed the code that is in the default "MoveActionExecuter" action rule provided in the core of Alfresco, and its quite similar to ours.

Thanks in advance.

12 Replies
afaust
Master

Re: Thread-safe action rules for massive concurrency

There is a lot to unpack here.

First, creating new transactions within already running transactions will never work if the nested transaction needs to access data created in the outer one. This has nothing to do with AOP, just with transaction isolation on the DB between two uncommitted transactions.

Second, using the synchronized keyword is a very brutal way of dealing with concurrency issues and is rarely appropriate when trying to deal with data in concurrent transactions. Java-level synchronisation is primarily relevant when dealing with Java object-level state, not transactional data. In the worst case it can cause your entire solution to lock up for good.

Third, most (if not all) operations that are triggered by a user will already run in a retrying transaction, so normally there should be little need in a rule / action to use the retrying transaction helper. Only if you are dealing with custom web scripts, cron jobs or other custom code that - for some reason - is not already running in a transaction will you have to deal with retrying transaction helper directly.

Fourth, you should never use Thread.sleep or anything similar in business-level code. If you think you are dealing with timing issues, a sleep will usually just mask/hide the problem without really addressing it. Most of the time sleep is used as a brutal, poor-man's tool for synchronisation. Outside of technical utility libraries, sleep should be considered a "forbidden API".

I have implemented similar rules / behaviours for moving content in a specific structure upon upload and never had to deal with explicit transaction management for these. Default Alfresco retrying functionality took care of any conflicts and in the worst case, a file upload was processed two or three times on the server (resulting in a bit longer overall duration). It is essential that you do not try to deal with / work around / supress some specific exceptions that are triggered in concurrent situations (e.g. via try-catch). The retrying functionality relies on exceptions for concurrency issues to be propagated up the call chain and if you catch such an exception without rethrowing, Alfresco will not be able to deal with them properly.

iru
Active Member II

Re: Thread-safe action rules for massive concurrency

Hi Faust,

Thanks again for replying.

  • Where are new transactions supposed to be created if not under a running transaction? Meaning.. which are the uses cases prone to create new transactions? Understand that " not already running in a transaction" operations already run within an Alfresco transaction... or maybe I'm misunderstanding it with the JVM active thread concept.
  • The AOP thing was just to emphasize about the stacktrace error. I understand that the first one (the parent) is still uncommitted, and may occur a visibility error between parent-child ones.
  • Agree with you that using synchronized is more a java thing, and that is not a best-practice, but wanted to ensure that each thread enters the code sequentially, due to the requirement that task1 (folderized structe creation) must only run once per day/minute/...
    Don't think a well written synchronized method would have to block your entire application.
  • In this case, the actions are not triggered by a front-end user, but from an API exposed to third party applications through an upload operation vía CMIS endpoints. We tried all this scenario without using transactions and they were not retried.
  • Also agree with you that sleeps/synchronizations don't ensure that the transaction committed, it was just a proof-of-concept.
  • Also agree that exceptions should be propagated upwards, and we do so, but in this case Alfresco does not help us dealing with the problem.

From all your response, the workaround that may work, is to leave the task2 in the rule, but take task 1 and verify it ourselves in the custom code, simplifying things and hopefully our problem, but that won't give response to this matter itself. Still confused why it works in a postgres-based environment, but not in mysql. Have rechecked the compatibility matrix and alfresco version, mysql version, mysql driver and generated tables engine is InnoDB, so everything is within parameters.

afaust
Master

Re: Thread-safe action rules for massive concurrency

Unfortunately your description of how the functionality is implemented is a bit too abstract to fully comprehend the problem or be able to provide concrete recommendations.

"Where are new transactions supposed to be created if not under a running transaction?" - In 99.9% of cases there should be no need for nested transactions, so a "new" transaction will either be the top-level transaction created by Alfresco automatically when calling a standard API endpoint or the top-level transaction created by a developer writing some sort of asynchronous code (e.g. jobs). Calls to CMIS endpoints should always trigger a retrying transaction automatically which would be propagated to any action called from a rule (without "run in background"). Unfortunately your stack trace is too short to point out the responsible component / be aware of the context.

iru
Active Member II

Re: Thread-safe action rules for massive concurrency (MySQL)

Morning,

Have updated the discussion with the requested information, and the action-context.xml with the "extraExceptions" in order the RTH to retry the procedure, as if not, it won't do it (they're not considered retryable)
The client side it's a simple CMIS upload with Atompub binding, and ConcurrentestRunner for concurrency test.

Anyhow, we're going to test it on an SQLServer implementation too, as my suspicion is over some kind of bug... we'll see

afaust
Master

Re: Thread-safe action rules for massive concurrency (MySQL)

Ok - so you are not using the default retryingTransactionHelper but a custom instance with some custom extra exceptions. This is not something that I would recommend. For one, most of the exceptions included are not exceptions where a retry will have a chance to "fix" anything - some of them are indicators of serious developer errors, e.g. regarding not validating input / trigger conditions. Basically you are retrying to try and work around (potentially your own) developer errors...
Never have I encountered a situation where the default retrying transaction helper would not suffice.

I also see some potential issue with using the FileFolderUtil.makeFolders utility. This will also traverse and retrieve folder paths that the current user is not allowed to access, and the result may be unusable due to restricted permissions.

One problem you face in both of your constellations is that the FileFolderService.exists() check in line 90/91 will/should always fail on the first try of the action when the target folder does not exist. This is because the new folder will be created in a nested transaction and will not be visible to the outside transaction for the exists() check. As a result of that (I assume) the outer transaction will rollback and retry the entire action.

The issue with the NonRootNodeWithoutParentsException looks like it may have something to do with incorrect isolation levels on the DB tier and/or a potential bug in the parentAssocs cache inside the NodeDAO. Unfortunately, that cache is a custom implementation and not one of the regular caches, so it may be prone to issues not noticed in any of the other caches.
Is the MySQL server configured with the right default transaction level, and with the configuration parameters as recommended in Alfresco documentation? Normally, before the exception is thrown, Alfresco will check the database to load the parents of a node if they have not been cached yet, and for a node that has just been uploaded it should be able to find the parents (unless some other custom code has messed with the DB in the meantime or the isolation level is not working properly).

iru
Active Member II

Re: Thread-safe action rules for massive concurrency (MySQL)

Thanks for the recommendations regarding the RTH and FileFolderUtil.makeFolders usage, will take them into consideration.

Regarding the FileFolderService.exists, seems not to fail due to target folder being the object that is returned from the previous RTH task (plus we're not seeing the trace placed in the else statement).

As shown in the traces, the problem occurs when we're trying to move the just uploaded document into the just created folder.
__

NonRootNodeWithoutParentsException its an odd one and could understand it's due to some cache bug, but what about the second one; InvalidNodeRefException: Node does not exist?
This, if I understood it correctly, is referring to the fact that the transaction has no context of the node that is being uploaded itself.
__

We've followed the instructions within the configuration documentation http://docs.alfresco.com/5.2/tasks/mysql-config.html

The only aspect it talks about transaction levels, is that the database tables are to use InnoDB, and that's done correctly by the population script of Alfresco itself.

Have also read  http://docs.alfresco.com/4.0/concepts/mysql-config-settings.html

where it recommends to set the "innodb_locks_unsafe_for_binlog = 1"

I'm no DBA, but is there any specific "transaction level"  or "configuration parameters" we might be missing?

As read, the default transaction isolation level for mysql innodb is REPEATABLE-READ.
__

Been going back to previous simpler implementations, taking into consideration your recommendation regarding "CMIS endpoints should always trigger a retrying transaction automatically which would be propagated to any action called from a rule", but still don't see the correct way to implement it.

If multiple threads enter the rule, and only one is to create the folder structure... how should we develop the rule thread-safely?

FileFolderUtils.makeFolders, for instance, should work, as it states in the javadoc: checks for the presence of, and creates as necessary, the folder structure in the provided path

But "FileExistsException" is thrown for non first threads that arrive, and CMIS RTH does not contemplate it as retryable.

- src: https://pastebin.com/nJwQwTmG

- error: (4 thread execution in concurrency, 1 ok, 3 fail)  https://pastebin.com/raw/M23c5j8C

Could force a while statement, until node is created or resolved, but again we're going into deep mud.
It's silly.. but any recommendations on this? Will try anything x)


Best workaround up to the point, using 1 RTH for the folder creation, to be tested in sql server and hope for the best :rolleyes:

afaust
Master

Re: Thread-safe action rules for massive concurrency (MySQL)

If you look at your own error stacktrace (https://pastebin.com/raw/yVNc8YWf ) you can see that a couple of lines above "org.apache.chemistry.opencmis.server.impl.atompub.CmisAtomPubServlet.service" there is an instance of "org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction" triggered by the RetryingTransactionInterceptor. That is the retrying behaviour of the CMIS endpoint I am referring to - and this will apply to most (if not all) of the default Alfresco endpoints.

I wish you would stop referring to "thread-safety" - what you need/want is "transaction-safety", which is a completely separate technical concept, and by continuing to use "thread-safety" you risk confusing other people / community members that may stumble over this thread.

"If multiple threads enter the rule, and only one is to create the folder structure... how should we develop the rule thread-safely?"
Typically this would be very simple: Stop trying to manage transactions manually and rely on the retrying behaviour of Alfresco. As I already outlined in previous responses, multiple transactions may enter the rule and create the folder structure in parallel, but only one will be allowed to commit while all others will be rejected by the DB with update conflicts, causing the transactions to retry and then they will find the folder structure already created by the one transaction that succeeded.

"But "FileExistsException" is thrown for non first threads that arrive, and CMIS RTH does not contemplate it as retryable."
And that is correct - a FileExistsException is never retryable. If a file already exists with the same name that means there exists a node already committed to the DB, and there is no chance at all that simply retrying the current transactions will see the other node vanish (unless some user / code may be deleting it in a parallel transaction).

"NonRootNodeWithoutParentsException its an odd one and could understand it's due to some cache bug, but what about the second one; InvalidNodeRefException: Node does not exist?"

And that would be the correct exception when you create the folder structure in a nested transaction using a RTH, and then try to use that to move something in the outer transaction. Due to transaction isolation, the NodeRef does indeed not exist (yet) for the outer transaction, triggering an InvalidNodeRefException.

"FileFolderUtils.makeFolders, for instance, should work, as it states in the javadoc: checks for the presence of, and creates as necessary, the folder structure in the provided path"

Yes, but the JavaDoc does NOT mention that the lookup will be done with system privileges which means it can return existing folders that the user is not allowed to access and thus cause AccessDeniedException failures later on. Never take JavaDoc comments for 100% accurate without checking the implementation.

iru
Active Member II

Re: Thread-safe action rules for massive concurrency (MySQL)

Morning,

Some quick clarifications.

- Yep, it's clear to me the RTH involvement in the CMIS endpoint. Thanks for re-pointing it out

- Yep, you're right, maybe it's more a transaction issue, but from the client's point of view, it's a multiple-concurrent thread test, even if on the server side, as seen, it's leading to multiple-transactions.

- Related to the InvalidNodeRefException, the nodeRef ID the exception is pointing to its not of the just created folder, but of the just uploaded file.
To me, the problem is not in the just created folder, but in the fact that in an inner transaction snapshot, due to isolation, the just just to be uploaded node does not exist.
To reinforce my statement, this happens too when we have 1 RTH with both actions (create and move) within the same block of code.
Just for clarification, this happens in both our mysql and postgres setup.

- Yep, FileFolderUtils.makeFolders is done with system privileges, checked code besides javadoc.
However, no AccessDeniedException is raised in this tests and it's not what we're discussing, although thanks for re-pointing it out. What I was empathizing is how it's programmed, first check, then create if not existing (just what we want).

"If a file already exists with the same name that means there exists a node already committed to the DB"
That's what FileFolderUtils.makeFolders deals with;
1. first checks > node does not exists
2. then tries to create > node exists and gives the exception (created by another thread / transaction)

However, this method does throw "FileExistsException" internally, which, in first instance does not make sense if its job is to check first.

Guess database isolation again.
We do had some custom code in order to create the hierarchical structure of folders, using FileFolderService in a recursive check/creation loop, but wanted to simplify the test and that's why we opted for FileFolderUtils, thought it would be better a product provided solution than our own code.
__

Yesterday we tested it on an SQLServer based installation, and got other round of different exceptions, will update first post now.
__

To sum it up, we still don't know howto solve this within the rule implementation.
- Using 1 RTH for the folder creation, leads to different kind of exceptions, different depending on the database provider, like  mysql and sqlserver (although not in postgres)
- Not using RTH, leads to some threads not to be aware of the "initizliation" of the folderization structure that only one of the batch will create.
When multiple threads (with multiple transactions) try to create the folders FileExistsException is thrown on all but one. As this is not retryable, n-1 errors are thrown to the client-side, being batch size n.
Relying on the RTH of the cmis endpoint seems not to be a valid solution, as the exception is not handled as a retryable one.

We will try to move part of the code to our custom API, hoping CMIS deals ok with it and hope we got some clarifying results.
__

Even if we seem to be in a vicious replying circle, insight appreciated @Axel, I'm sure your highlights will help someone else.
pd: Feel free to change any misleading word/concept that may disrupt other community members. If not you, some moderator should be able to do so.

afaust
Master

Re: Thread/Transaction-safe action rules for massive concurrency

I am actually a moderator, so there should be no issues. Though I really don't want to go about editing any replies because it could be even more confusing.

Ok, the InvalidNodeRef error inside a nested transaction is of course an issue with the isolation / non-committed state.

Multiple threads in multiple transactions running through the makeFolders code should not run into FileExistsException (again, issue with isolation may be cause here) - there should be something along the line of DeadlockLoserException or ConcurrencyFailureException, which are retryable.

I wouldn't call the reply-cycle "vicious" - just long-running since we don't seem to come to a simple, clear-cut answer to what is going wrong.