searchService returns same nodeRef twice (duplicate index in solr)

cancel
Showing results for 
Search instead for 
Did you mean: 
vincent-kali
Established Member

searchService returns same nodeRef twice (duplicate index in solr)

Dear all,

For some reason, we're using a custom REST API to perform searches on repo (alf community 5.1.g). We discovered that "sometimes", this custom API returns the same nodeRef more than once.

Search code is below:

<code>

ResultSet resultSet = null;
 List<NodeRef> results = null;
 try{
    SearchParameters sp = new SearchParameters();
    sp.setLanguage(this.services.getSearchService().LANGUAGE_SOLR_FTS_ALFRESCO);
    sp.addStore(new StoreRef("workspace://SpacesStore"));
    sp.setQuery(query);
    sp.setMaxItems(maxItems);
    SortDefinition sortDefinition = new SortDefinition(
    SearchParameters.SortDefinition.SortType.FIELD, "@" + sortField, sortAscending);
    sp.addSort(sortDefinition);
    logger.debug(" search - query: " + sp.getQuery());
    resultSet = this.services.getSearchService().query(sp);
    logger.debug("Results found: " + resultSet.getNumberFound());
    results = resultSet.getNodeRefs();
}
 finally{
    if(resultSet != null)
   {
       resultSet.close();
   }
   }
 return results;

</code>

The solr4 report indicates: "Count of duplicate nodes in the index":"100", meaning that there is errors in solr4 indexes.

1) Does somebody know how to fix this ?

2) When running the same query using "alfresco/service/slingshot/node/search" API, only one result is returned. Does it means that a duplicate check in performed within java node (I did not find anything in code related to this)

Thanks

vincent

21 Replies
mehe
Senior Member II

Re: searchService returns same nodeRef twice (duplicate index in solr)

I had a similar problem with 5.0.?, there was also a JIRA for this [MNT-13767] Using disjunction "OR" in CMIS query returns wrong number of results when SOLR 4 is used... 

To get rid of the duplicates, you should reindex your repo (see JIRA) or use the fix option described in Troubleshooting Solr Index | Alfresco Documentation 

If your system creates new duplicates, you could try an update to a newer version - if not bound to the older version for some reason.

vincent-kali
Established Member

Re: searchService returns same nodeRef twice (duplicate index in solr)

Thanks for your reply.

The repo contains millions of documents.... It may takes days for a complete reindexing ? Do you have any benchmark on this ?

Is solr available for query during the reindex period ? Does alfresco repo have to retransform the whole content to post to solr ? (which would be unacceptable during production)

I also noticed that some duplicate are cleaned automatically by SOLR.... Is there any background process in solr that clean this kind of errors ?

I finally found that NodeBrowser/Standard search API is performing DB query, while my code is performing SOLR query (that's why I get some duplicate in my API, but not in NodeBrowser). 

Even when setting query consistency to "QueryConsistency.TRANSACTIONAL_IF_POSSIBLE" in my code.

I was expecting a query like =cm\:name:myFileName.txt to go to DB instead of SOLR.... Am'I wrong ?

Thanks for your comments/advises

Vincent

mehe
Senior Member II

Re: searchService returns same nodeRef twice (duplicate index in solr)

Solr reindexing is a "destructive" operation (until now)  - you have to delete (I always rename the index dir, in case of the planned downtime is too short to rebuild, so I can switch back to the old index) the index and restart alfresco. Then alfresco will rebuild the index and, as you feared - also the content will be transformed and reindexed. In the newer alfresco versions, there  are a kind of "content segments" in the solr data store maybe there is a way to prevent the indexing process from the necessity of transforming every document again - but I don't know. 

Solr is available during reindex, but you can only "see" the data that is already processed - and solr is under heavy load while reindexing. So you should only reindex when nobody is working with alfresco (Weekend, planned downtime)

Auto cleaning? Don't know...   But you can use the described "fix" option as a first try to eliminate the duplicates, that won't be so harmful to your users - but on big repos I do that at night.

Reindexing time is also dependent on the size of your content and your server hardware. Reindexing took me 2 days for a repo with about 10.000.000 Docs at 3TB content. You can tune your reindexing process with solrcore.properties (batch size, number of threads and so on). (storing the index on SSD?)

If there is no possibility for a reindex on your production system, you can clone your system from a backup (DB and content), reindex the clone and transfer the new index to your production system and switch solr to the new index (stop solr, move the index data dir and start it again). The index tracker will recognize the missing transactions and catchup in a short(er) time. This minimizes your alfresco downtime.

You can also use the clone for a benchmark of your reindexing process.

Are you using the same tomcat for alfresco and solr?

 I think queries like =cm\:name:myFileName.txt are FTS (Fulltext) queries that always operate on the index - but I haven't tried the QueryConsistency.TRANSACTIONAL_IF_POSSIBLE until know. I thought this had to be configured in the repo too and some extra db indexes have to be created when using this...

regards,

Martin

afaust
Master

Re: searchService returns same nodeRef twice (duplicate index in solr)

There is technically no need for a downtime during re-indexing. You can always create a new SOLR core to build a new index while you keep the old core around for continued search availability. Once the new SOLR core is done indexing, you can simply switch out the index.

Vincent, did you check your historical SOLR logs for any indexing errors? Often I find that index inconsistencies are the result of exceptions that people - for some reason - keep ignoring.

The 10 million documents in 2 days that Martin mentions sounds like a reasonable amount for a "standard" (non-optimised) system. There are a lot of factors that affect the duration, e.g. number/size of transactions, ACLs etc. The best I have seen without extreme resources / scaling is about 300.000 - 400.000 documents per hour.

Queries like =cm:name:"myFileName.txt" are DB-compatible and by default Alfresco is set to the query consistency TRANSACTIONAL_IF_POSSIBLE. Martin is correct that additional indices have to be created on the DB and unfortunately Alfresco by default does not do this unless you configure:

system.metadata-query-indexes.ignored=false
system.metadata-query-indexes-more.ignored=false

At BeeCon I did a full session about transactional metadata queries for more information. (slides)

mehe
Senior Member II

Re: searchService returns same nodeRef twice (duplicate index in solr)

Hi Axel,

Have you tried indexing a new core with a big repo and solr4? I had a very slow, nearly inaccessible system when trying this with the users online. Also the libreoffice conversion was a bottleneck (no jod converter on community).

So I decided to use a planned downtime...


Regards,

Martin


cesarista
Customer

Re: searchService returns same nodeRef twice (duplicate index in solr)

Hi Martin Ehe

In these cases, I reindex in parallel with a dedicated SOLR barebone machine (with as many resources as possible CPU, SSD disks for a shorter reindex time) and alfresco.war in it, for doing the indexation process in the local machine only disturbing database resources (but no other Alfresco nodes and the corresponding service). When indices are ready, I copy them to the original SOLR machine(s), using the barebone machine as the replacement in the SOLR balancer (if any). So downtime it's not strictly necessary. But it may be a long time depending on your CPU and disk resources.

Autocleaning is not the case exactly, but reindexing always obtains a healthier index, without "deleted files" that may degrade your searches and have bigger indices size. 

Regards.

--C.

mehe
Senior Member II

Re: searchService returns same nodeRef twice (duplicate index in solr)

Hi Cesar Capillas ,

Thank you for this recommendation! I was able to reindex only enterprise systems without downtime, having a spare Solr/alfresco.war node. 

I didn't dare to spin up a second node in the community version. 

So the proposed setup would be:

- temporary server with alfresco.war (from prod system, in case there are models applied) and Solr

- connect the temporary server to prod DB and prod filesystem

- reindex on temp system

Do I have to take care of disabling some cleanup jobs or something like this?


cesarista
Customer

Re: searchService returns same nodeRef twice (duplicate index in solr)

Well, I did It for enterprise edition. Not sure if applies exactly for Community edition.

--C.

mehe
Senior Member II

Re: searchService returns same nodeRef twice (duplicate index in solr)

ok, on Enterprise Systems I always use 2 solr nodes with alfresco in the cluster and don't have a downtime when reindexing one of the machines - conversion of documents to text is also better on enterprise versions, because they can scale the libreoffice conversion via jod converter.

The original question was in context of alfresco 5 community, which has no cluster option.

But the scenario could work with community, even if it's not cluster aware, because the index-tracker just asks the db about the transactions and reindexes metadata and reads content...  have to test that...  would make the clone unnecessary... hmmm...

Hi

Axel Faust‌ have you ever tried something like Cesar Capillas proposed for enterprise on the community edition? Or is there an easier way (besides the extra core)  (could be a poor mans solr cluster Smiley Happy)

Is it possible to set the "second" alfresco in readonly mode and nevertheless do a full solr reindex?