Unable to rolling restart my cluster due to Hazelcast timeouts

cancel
Showing results for 
Search instead for 
Did you mean: 
josh_barrett
Active Member II

Unable to rolling restart my cluster due to Hazelcast timeouts

Jump to solution

I am running 5.1.1 on an environment and ran into an issue yesterday under peak load.

We had a couple of servers get into a bad state so we tried to do a rolling restart of Alfresco.

The servers wouldn't start up because of a Hazelcast timeout.   Probably because the cluster was so busy.

We had to stop everything to clear the cluster then start up the servers.   This fixed us but caused a full outage to our clients.

Looking back I think I need to increase a timeout and/or maybe get creative.   Try restarting an alfresco with it removed from the cluster.   And get up in running with say another server out of the cluster.   Then take down the other servers that are in the cluster.   Start them up with a fresh cache then add the other servers back to the cluster.

Anyone have this situation before or have other ideas?

Here is the error:

2017-10-18 16:38:32,379 ERROR [web.context.ContextLoader] [localhost-startStop-1] Context initialization failed 
com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Operation Timeout (with no response!): 20000 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.waitAndGetResult(BaseManager.java:619) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getRedoAwareResult(BaseManager.java:641) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResult(BaseManager.java:636) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:464) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:460) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1894) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1818) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.put(ConcurrentMapManager.java:1682) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:632) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:606) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:173) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:124) 
at org.alfresco.enterprise.repo.cluster.cache.HazelcastSimpleCache.put(HazelcastSimpleCache.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:241) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$AbstractClusteredObjectProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:326)
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory.upgradeCaches(ClusteredObjectProxyFactory.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.upgradeClusterObjects(ClusteringBootstrap.java:136) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.bootstrapWork(ClusteringBootstrap.java:127) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:69) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:65) 
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:78) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:75) 
at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:555) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.onBootstrap(ClusteringBootstrap.java:74) 
at org.springframework.extensions.surf.util.AbstractLifecycleBean.onApplicationEvent(AbstractLifecycleBean.java:56) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEventInternal(SafeApplicationEventMulticaster.java:214) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEvent(SafeApplicationEventMulticaster.java:185) 
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:334) 
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:950) 
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:482) 
at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:410) 
at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:306) 
at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:112) 
at org.alfresco.web.app.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:70) 
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5016) 
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5524) 
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) 
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) 
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) 
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:649) 
at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1081) 
at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1877) 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Oct 18, 2017 4:38:32 PM org.apache.catalina.core.StandardContext listenerStart 
SEVERE: Exception sending context initialized event to listener instance of class org.alfresco.web.app.ContextLoaderListener 
com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Operation Timeout (with no response!): 20000 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.waitAndGetResult(BaseManager.java:619) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getRedoAwareResult(BaseManager.java:641) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResult(BaseManager.java:636) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:464) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:460) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1894) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1818) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.put(ConcurrentMapManager.java:1682) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:632) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:606) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:173) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:124) 
at org.alfresco.enterprise.repo.cluster.cache.HazelcastSimpleCache.put(HazelcastSimpleCache.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:241) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$AbstractClusteredObjectProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:326)
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory.upgradeCaches(ClusteredObjectProxyFactory.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.upgradeClusterObjects(ClusteringBootstrap.java:136) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.bootstrapWork(ClusteringBootstrap.java:127) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:69) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:65) 
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:78) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:75) 
at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:555) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.onBootstrap(ClusteringBootstrap.java:74) 
at org.springframework.extensions.surf.util.AbstractLifecycleBean.onApplicationEvent(AbstractLifecycleBean.java:56) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEventInternal(SafeApplicationEventMulticaster.java:214) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEvent(SafeApplicationEventMulticaster.java:185) 
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:334) 
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:950) 
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:482) 
at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:410) 
at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:306) 
at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:112) 
at org.alfresco.web.app.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:70) 
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5016) 
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5524) 
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) 
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) 
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) 
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:649) 
at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1081) 
at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1877) 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 

Oct 18, 2017 4:38:32 PM org.apache.catalina.core.StandardContext listenerStart 
SEVERE: Exception sending context initialized event to listener instance of class org.alfresco.web.app.ContextListener 
com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Operation Timeout (with no response!): 20000 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.waitAndGetResult(BaseManager.java:619) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getRedoAwareResult(BaseManager.java:641) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResult(BaseManager.java:636) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:464) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:460) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1894) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1818) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.put(ConcurrentMapManager.java:1682) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:632) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:606) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:173) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:124) 
at org.alfresco.enterprise.repo.cluster.cache.HazelcastSimpleCache.put(HazelcastSimpleCache.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:241) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$AbstractClusteredObjectProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:326)
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory.upgradeCaches(ClusteredObjectProxyFactory.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.upgradeClusterObjects(ClusteringBootstrap.java:136) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.bootstrapWork(ClusteringBootstrap.java:127) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:69) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:65) 
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:78) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:75) 
at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:555) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.onBootstrap(ClusteringBootstrap.java:74) 
at org.springframework.extensions.surf.util.AbstractLifecycleBean.onApplicationEvent(AbstractLifecycleBean.java:56) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEventInternal(SafeApplicationEventMulticaster.java:214) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEvent(SafeApplicationEventMulticaster.java:185) 
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:334) 
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:950) 
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:482) 
at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:410) 
at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:306) 
at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:112) 
at org.alfresco.web.app.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:70) 
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5016) 
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5524) 
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) 
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) 
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) 
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:649) 
at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1081) 
at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1877) 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 

Oct 18, 2017 4:38:32 PM org.apache.catalina.core.StandardContext listenerStart 
SEVERE: Exception sending context initialized event to listener instance of class org.alfresco.repo.webdav.WebDAVSessionListener 
com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Operation Timeout (with no response!): 20000 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.waitAndGetResult(BaseManager.java:619) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getRedoAwareResult(BaseManager.java:641) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResult(BaseManager.java:636) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:464) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsObject(BaseManager.java:460) 
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsObject(BaseManager.java:555) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1894) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.txnalPut(ConcurrentMapManager.java:1818) 
at com.hazelcast.impl.ConcurrentMapManager$MPut.put(ConcurrentMapManager.java:1682) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:632) 
at com.hazelcast.impl.MProxyImpl$MProxyReal.put(MProxyImpl.java:606) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:173) 
at com.hazelcast.impl.MProxyImpl.put(MProxyImpl.java:124) 
at org.alfresco.enterprise.repo.cluster.cache.HazelcastSimpleCache.put(HazelcastSimpleCache.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:241) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.transferCollectedItems(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$AbstractClusteredObjectProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:326)
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory$CacheProxyInvoker.upgradeBackingObject(ClusteredObjectProxyFactory.java:204) 
at org.alfresco.enterprise.repo.cluster.core.ClusteredObjectProxyFactory.upgradeCaches(ClusteredObjectProxyFactory.java:108) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.upgradeClusterObjects(ClusteringBootstrap.java:136) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.bootstrapWork(ClusteringBootstrap.java:127) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:69) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$1.execute(ClusteringBootstrap.java:65) 
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:78) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap$2.doWork(ClusteringBootstrap.java:75) 
at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:555) 
at org.alfresco.enterprise.repo.cluster.core.ClusteringBootstrap.onBootstrap(ClusteringBootstrap.java:74) 
at org.springframework.extensions.surf.util.AbstractLifecycleBean.onApplicationEvent(AbstractLifecycleBean.java:56) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEventInternal(SafeApplicationEventMulticaster.java:214) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEvent(SafeApplicationEventMulticaster.java:185) 
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:334) 
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:950) 
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:482) 
at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:410) 
at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:306) 
at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:112) 
at org.alfresco.web.app.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:70) 
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5016) 
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5524) 
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) 
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) 
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) 
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:649) 
at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1081) 
at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1877) 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 

Oct 18, 2017 4:38:32 PM com.sun.xml.ws.transport.http.servlet.WSServletContextListener contextInitialized 
INFO: WSSERVLET12: JAX-WS context listener initializing 
Oct 18, 2017 4:38:40 PM com.sun.xml.ws.transport.http.servlet.WSServletDelegate <init> 
INFO: WSSERVLET14: JAX-WS servlet initializing 
Oct 18, 2017 4:38:42 PM org.apache.catalina.core.StandardContext startInternal 
SEVERE: Error listenerStart 
Oct 18, 2017 4:38:42 PM org.apache.catalina.core.StandardContext startInternal 
SEVERE: Context [/alfresco] startup failed due to previous errors 
Oct 18, 2017 4:38:42 PM com.sun.xml.ws.transport.http.servlet.WSServletDelegate destroy 
INFO: WSSERVLET15: JAX-WS servlet destroyed 
Oct 18, 2017 4:38:42 PM com.sun.xml.ws.transport.http.servlet.WSServletContextListener contextDestroyed 
INFO: WSSERVLET13: JAX-WS context listener destroyed 
Oct 18, 2017 4:38:42 PM org.apache.catalina.core.ApplicationContext log 
INFO: Closing Spring root WebApplicationContext 
2017-10-18 16:38:43,807 INFO [cluster.core.ClusteringBootstrap] [localhost-startStop-1] Clustering has shutdown. 
2017-10-18 16:38:43,808 WARN [context.support.XmlWebApplicationContext] [localhost-startStop-1] Exception thrown from ApplicationListener handling ContextClosedEvent 
java.lang.NullPointerException 
at org.alfresco.repo.workflow.activiti.ActivitiEngineInitializer.onShutdown(ActivitiEngineInitializer.java:65) 
at org.springframework.extensions.surf.util.AbstractLifecycleBean.onApplicationEvent(AbstractLifecycleBean.java:67) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEventInternal(SafeApplicationEventMulticaster.java:214) 
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEvent(SafeApplicationEventMulticaster.java:190) 
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:334) 
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1051) 
at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1012) 
at org.springframework.web.context.ContextLoader.closeWebApplicationContext(ContextLoader.java:586) 
at org.springframework.web.context.ContextLoaderListener.contextDestroyed(ContextLoaderListener.java:143) 
at org.apache.catalina.core.StandardContext.listenerStop(StandardContext.java:5063) 
at org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5719) 
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) 
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:160) 
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901) 
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) 
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:649) 
at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1081) 
at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1877) 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)

1 Solution

Accepted Solutions
afaust
Master

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

Jump to solution

Hazelcast timeouts can be caused by many things, from actual networking issues over CPU overload to memory / garbage collection issues on the other cluster node.The issue I have seen the most often is the latter, with a system being poorly configured and very close to garbage collection hell, where only a slight change in circumstance would bring down the entire cluster. You need to investigate what issue you were actually suffering from. I'd advise running some JVM monitoring via i.e. jvisualvm during startup (on all cluster nodes) to get a picture of what's going on.

In some circumstances you might even be able to avoid doing a full restart of your entire cluster, e.g. if only the communication / cluster state is affected. Using the JavaScript Console you can restart only the Hazelcast layer, and using the Caches tool of the OOTBee Support Tools addon you can purge data caches to remove potentially stale data.

View solution in original post

4 Replies
afaust
Master

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

Jump to solution

Hazelcast timeouts can be caused by many things, from actual networking issues over CPU overload to memory / garbage collection issues on the other cluster node.The issue I have seen the most often is the latter, with a system being poorly configured and very close to garbage collection hell, where only a slight change in circumstance would bring down the entire cluster. You need to investigate what issue you were actually suffering from. I'd advise running some JVM monitoring via i.e. jvisualvm during startup (on all cluster nodes) to get a picture of what's going on.

In some circumstances you might even be able to avoid doing a full restart of your entire cluster, e.g. if only the communication / cluster state is affected. Using the JavaScript Console you can restart only the Hazelcast layer, and using the Caches tool of the OOTBee Support Tools addon you can purge data caches to remove potentially stale data.

jpotts
Professional

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

Jump to solution

Have you tried disabling multicast and instead listing the members of the cluster individually in the hazelcast config?

It looks something like:

            <hz:join>
               <hz:multicast enabled="false"
                     multicast-group="224.2.2.5"
                     multicast-port="54327"/>
               <hz:tcp-ip enabled="true">
                  <hz:members>10.84.1.151,10.84.1.152</hz:members>
               </hz:tcp-ip>
            </hz:join>

afaust
Master

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

Jump to solution

With Hazelcast on Repository, multicast is disabled by default. The config example from Jeff applies only to the Share tier where the Hazelcast config is embedded in Spring. For Share the documentation of Alfresco provides the configuration with multicast enabled. The error messages in the logs point to Repository-tier issues though.

josh_barrett
Active Member II

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

Jump to solution

Thanks for the replies Axel Faust‌ and Jeff Potts‌.   The actual root problem was all of our Alfresco servers in the cluster were close to being maxed on CPU.

The issue was under peak load we had a few background (custom) processes kicking off which put the servers over the edge.  

In the heat of the moment we removed all of the servers from the cluster and simply had our API layer talking to Alfresco via CMIS through a load balancer unclustered.   We thought we were all good.   Servers seemed healthy from CPU, JVM, and the number of requests we were handling.   But.....   After looking into the logs a majority of the document update calls were failing with messages like the following in our custom API logs.
Expected xxxx bytes but retrieved 0 bytes!

We reproduced this issue in our Performance environment.   We resolved this issue by adding the servers back into the cluster.   The weird thing was it was only updates causing this issue.    New document adds didn't have any issues.   Only binary updates.    I wonder if this is a bug with the CMIS implementation.