Lucene Extensions and Issues

resplin · ‎6 Jun 2015

Alfresco replaced Lucene with Solr in 4.0, and in 5.0.b Lucene has been removed.

This page accumulates the outstanding issues we have with Lucene, extensions to Lucene and our work arounds and solutions.

We are currently based on the 1.4.3 release of Lucene. All our modifications are backwards compatible with this release. The modified source zip is available in the distribution so you can see these changes. As we have one API extension we can not use the base 1.4.3 and still have the issues below.

NOTE: This is somewhat dated information. The version of Lucene used in Alfresco CE 3.3 and 3.4 is Lucene 2.4.1. The information below may or may not represent current problems.

Here is how Solr did it:
http://svn.apache.org/viewcvs.cgi/incubator/solr/trunk/src/java/org/apache/solr/util/NumberUtils.java?rev=382610&view=markup

It's a binary representation transformed to sort correctly and fit in to chars.
A 4 byte int or float is transformed into 3 java chars
An 8 byte long or double is transformed into 5 java chars

Index corruption

If the JVM crashes or it is terminated during the writing of a segment the new segment is left around. When the index next writes a new segment it uses this existing segment. This leads to a corrupted segment. A clean/empty files should be used each time.

See LUCENE-415

We have fixed this to zero the file length of new segement files.

They could check if the segment file already exists.

Stale file handles under windows

There was an issue with creating new segement files.
Finding stale file handles when writing to a file we should have created.
This does not happen if we immediately open a channel to the file.

See LUCENE-415

Fixed by creating a channel immediately.

Lock File may not be deleted

The Lock implementation uses file.delete() and does not check the lock file is cleaned up.
This can leave a lock file around.

NOT YET RAISED

We check the file is deleted and retry. If it is not deleted in reasonable time we throw an error. We then know we left the lock file.

Lock file is not appropriate to signal IPC

See the Java Doc regarding File.create() and its use for IPC/signalling.

This mechanism is not reliable. We reuse lock object with synchronisation.
This means in-JVM locks are secure. Inter-process locks will not be safe. This also
requires a channel level lock. WE DO NOT SUPPORT INTER-PROCESS LOCKS. All clients must implement the same lock mechanism.

Others have raised this or related issues and seem to have been ignored.

Lock mechamism could reuse objects

Many lock object are created. We do not create many but register them with the single instance of FSDirectory and use them for lock synchronisation.

NOT YET RAISED

IndexerReader.exist() can report incorrect results

This tests for the segment file existing. During index commit this file may not exist as the new file is copied. It seems it is not atomic. So the index is reported as not existing and ew try to create a new one. All sorts of errors as the file exists and can not be deleted. It is possible, (and has happened ojnce) that we would delete the index as a result. This arises fro the semantics of IndexWriter and creaing indexes.

Lucene should use a simple file for index exists that never changes.

Seen once using FTP to upload many files.

NOT YET RAISED

Combining indexes is expensive

The target index is optimised before and after the standard merge.
This is very expensive. We want to be able to control index optimisation.
We would like an algorithm similar to adding documents to the index.
Ideally this would produce a compromise of read speed and speed to merge an index.
(This is how we do transactional updates to the index)

Others have raised this and been ignored.

We have extended the IndexWriter to merge indexes in a way more suitable for us.

This is the only issue that stops us being backward compatible with 1.4.3. (Unless we test for the extended method being supported using reflection)

Search

Lucene Extensions and Issues

Lucene Extensions and Issues

Table of Contents

Lucene's use of thread locals

Standard tokenisation of numeric and date fields

Index corruption

Stale file handles under windows

Lock File may not be deleted

Lock file is not appropriate to signal IPC

Lock mechamism could reuse objects

IndexerReader.exist() can report incorrect results

Combining indexes is expensive

Lucene Extensions and Issues

Lucene Extensions and Issues

Table of Contents

Lucene's use of thread locals

Standard tokenisation of numeric and date fields

Index corruption

Stale file handles under windows

Lock File may not be deleted

Lock file is not appropriate to signal IPC

Lock mechamism could reuse objects

IndexerReader.exist() can report incorrect results

Combining indexes is expensive

We use cookies on this site to enhance your user experience