Tips on troubleshooting individual file indexing with Alfresco 5.2 / Solr 6
I have a Alfresco Community (201707) installation which i am using to compare the default solr 4 vs solr 6 in the alfresco-search-services-1.1.0 install.
After a full index with Solr 4, I get the following info from the solr4 admin page:
Num Docs: 163458
Max Docs: 163458
Deleted Docs: 0
Master (Searching) 1524504594659 159 6.5 GB
Nodes in Index: 70921 Transactions in Index: 80844 Approx transactions remaining: 0
Unindexed Nodes: 11441 Error Nodes in Index: 0
in the solr4 SUMMARY report, I can see that it's done:
Node count with FTSStatus Clean 69165 Node count with FTSStatus Dirty 0 Node count with FTSStatus New 0
When I test the solr 6 setup, I stop the alfresco app, make the changes to the alfresco install for Solr 6, start the solr server and the alfresco server, and let it re-index. It plugs along for a few hours, and then completes with the following stats:
Deleted Docs: 0
Master (Searching) 1524581958240 586 2.48 GB
, and in the SUMMARY report:
Alfresco Nodes in Index 70937 Alfresco Transactions in Index 81470 Alfresco Unindexed Nodes 11698 Alfresco Error Nodes in Index 0
Node count with FTSStatus Clean 69181 Node count with FTSStatus Dirty 0 Node count with FTSStatus New 0
So the indexer looks done and comparable volume-wise to the solr4 setup.
What first concerned me was the significantly smaller size: the Solr4 6.5 Gb vs Solr6 2.5 Gb size after a complete reindex, when I was expecting a 15% size increase with the introduction of fingerprints.
There are some docs that I can't get in a full text search result set, even though the docs have the index aspect attached. I can try to reindex one of those docs, but no luck
"FlateFilter: stop reading corrupt stream due to a DataFormatException"
"An error occured when reading table hmtx"
But no more then I saw on the solr4 setup.
Any thoughts on how best to troubleshoot the inconsistencies?
Also, I know i can't upgrade to the pdfbox 2.0.X in 5.2, but anyone able to replace the pdfbox-1.8.10.jar and pdfbox-1.8.10.jar with pdfbox-1.8.13.jar and pdfbox-1.8.13.jar to get over the pdfbox probs?