lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kiwi clive <>
Subject How do I write in 3.x format to an upgradeded index using Lucene 4.10
Date Wed, 01 Feb 2017 01:38:57 GMT
Hi Guys
We have several hundred thousand indexes that have been written in Lucene 3.x format. I am
trying to migrate to Lucene 4.10 without the need to reindex and  the process should be transparent
to our customers. Reindexing all our legacy data is not an option.

The predominant analyzer we currently use is ClassicAnalyzer as we needed to backwards compatible
with the old StandardAnalyzer from pre-Lucene 3.x days. 

Our latest application uses 4.10 lucene jars and we knobble it to use Lucene 3.x format. 
When we create IndexWiriters, we are doing this:IndexWriterConfig idxCfg = new IndexWriterConfig(Version.LUCENE_3_6,
new ClassicAnalyzer());

New indexes could be written in Lucene 4.10 format and we aim to apply newer analyzers to
these new indexes. So all new index reading and writing should be fine. We need to query lucene
4.10 indexes with lucene 4.10 analyzers and our architecture is such that we can query lucene
3.x indexes with lucene 3.x analyzers (using Lucene Versions etc).

However, there is a difference between how Lucene 3.x and Lucene 4.10 write indexes which
breaks phrase queries. 

Lucene 3.x
Lucene 3.x seems to write tokens to the index adjacent to each other (and I assume the positions
are stored elsewhere). This means if we index "Thanks for coming", it get indexed as:
"thanks", "coming" after stop-word removal.
If we use a phrase query and pass it to QueryParser as:
content:"Thanks for coming"
queryParser will (using lowercase and stopword removal via ClassicAnalyzer) apply the phrase
query content:"thanks coming" and find the document correctly.
Lucene 4.10My understanding is that Lucene 4.10 keeps the position increments in the index
as placeholders in the data. I believe this is due to a change in how StopFilter works. So
if we index our "Thanks for coming" data in Lucene 4.10, it appears to be stored as:
"thanks", <pim>, "coming"
Where <pim> is some kind of position increment marker (excuse my ignorance, I don't
know the low level details).
Now if we send the query through QueryParser and after lowercasing and stopword removal it
is the same as before:content:"thanks coming".
This query fails because there is a <pim> between the two terms. 

If we turn on positionIncrements in queryParser, the document is returned. A phaseQuery with
a slop of 1 also finds the document, but that is a different query to that used on lucene
So, knowing which are 3.x indexes and which are 4.x indexes means we can toggle positionincrements
in QueryParser and our customers should be unaware of any changes while we start our migration.
Well, that was the plan!

Lucene IndexUpgrader

If we take our old 3.x index and apply IndexUpgrader to it, we end up with a 4.10 index. There
are several lucene 4.x files created in the index directory and no errors are thrown. However,
it appears that the index data is still in the 3.x format, namely it remains:
"thanks", "coming"
and not:
"thanks", <pim>, "coming"
This means that although the newly upgraded index is in theory a 4.10 index, we still have
to use a 3.x QueryParser syntax  (positionincrement=false) for phrase queries. Not the end
of the world, but if a new document is added "Ivan the Terrible", we end up with the index
"thanks", "coming""ivan", <pim>, "terrible"
So now we have one record in 3.x format and one in 4.10 format and having a hybrid index means
we cannot meaningfully use phrase queries on it.
The ProblemSo we need a way to write documents in 3.x format (no <pim>), to our upgraded
indexes, new indexes can use native 4.10 format. 

I have tried turning off positions when writing indexes (DOCS_AND_FREQ only) but I can't see
how phrase queries can work without positional information and Lucene complains about an illegal
state when querying such an index.

So, my cry for help here is:How do I write documents in 3.x format to a 3.x index upgraded
to a 4.10 index so that phrase queries work ?
Any other suggestions welcome!
Thank you for taking the time to work through this rather lengthy explanation and please let
me know if I have not described the issue clearly.
If it's just "Duh, you need to just do this..", I'd be a happy man :-)
Many thanks,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message