lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Allan Hill <p...@metajure.com>
Subject Upgrade Path Lucene 3.0.2 to 3.4
Date Wed, 16 Nov 2011 21:55:24 GMT
As it says in the title, we are moving from 3.0.2 from to 3.4.  I am interested in issues about
the need to build a new index or just keep changing the current one.   My company has been
busy building software and have not upgraded the Lucene and Tika libraries since last year,
but I'm trying to remedy that as quickly as I can.   We have production indices with 5,000,000
to 1,000,000 English language documents.  These are business documents (the usual MS word,
PDF ... ) which only the very occasional phrases in other character sets (for example, Japanese
or Chinese company name inserted in an otherwise English document etc.).

So here are my high-level questions when doing such an upgrade jump

1.       Do we need to start from scratch and create a new index or can I re-crawl documents
into the existing index?
My impression is that, if we were using 2.x the answer would definitely be that a rebuild
is required, but the answer doesn't jump out at me in releases since then. I think the answer
seems to be no.

2.       If we don't HAVE TO RE-CREATE the index, are their advantages to doing this?

a.       Should I be looking into eventually leveraging org.apache.lucene.index.IndexUpgrader
(see LUCENE-3082<http://issues.apache.org/jira/browse/LUCENE-3082>)?

In our application there is one Lucene "service" running in this system and it will be running
the latest code, so there is no issues of old code needing to access the index.

Because of the improvements over the last year in Tika, we will set our system to re-crawl
all documents, so I believe this eliminates various issues involving tokenizing  fixes.
We have tests which demonstrate the new Lucene libraries when used to index and then search
return the same (or improved) results.  We also have tests to verify that Tika does a great
job of improving its ability to parse (three cheers to the Tika folks for parsing half the
previously failing PDF and 40% of the old MS Word-95 docs).  Hats off to the folks involved
in both - great job on both bug fixes and the new features!

But my question is about (1) updating libraries, but (2) using an existing index that will
have all documents (eventually) replaced. Given my scenario what our my issues, if any?  I
attempt to answer my own question below and I think the answer is I don't need to create a
new clean index.
I would be interested in any feedback.

-Paul
p.s. If I had one suggestion, I would suggest that in the release note summary of a bug, it
would be better form to eliminate any shorthand acronyms (or just throw in a link to either
an appropriate description or even the JavaDoc).  Obviously, in the bug discussion there will
be all kinds of terse usage, but one liners in release notes are read by folks a little less
informed about some of the parts of Lucene.

*********** Detailed Review Follows *******

Reviewing the releases at http://lucene.apache.org/java/docs/index.html

The Java 7 JVM optimization bug has been fixed.  This is great; we were aware of this, so
never used Java 7.

The Unicode changes across JVMs referenced in the Java 7 and other JVM upgrades is interesting.
See for example the copy at:
https://github.com/apache/lucene-solr/blob/trunk/lucene/JRE_VERSION_MIGRATION.txt

In my case, we will be running the code under Java 7 while re-indexing, so I think all will
be properly upgraded.

Reviewing the 3.4 bugs there only seem to be few that relate to the files in the index on
disk:
LUCENE-3409<http://issues.apache.org/jira/browse/LUCENE-3409>: IndexWriter.deleteAll
was [....], leading to unused files accumulating in the Directory.
My Comment: Curiously the details for this bug describe a memory leak, not a problem with
files on disk, but anyway we aren't using Near Real-Time Readers (yet) and only use deleteAll
when testing in test indexes.

LUCENE-3358<http://issues.apache.org/jira/browse/LUCENE-3358>, LUCENE-3361<http://issues.apache.org/jira/browse/LUCENE-3361>:
StandardTokenizer and UAX29URLEmailTokenizer wrongly [...in ...] Han or Hiragana characters...
My Comment: This (if even relevant to us) would be fixed by re-indexing which we will be doing
anyway.

LUCENE-3368<http://issues.apache.org/jira/browse/LUCENE-3368> IndexWriter applies wrong
deletes during concurrent flush-all
My Comment: Only occurs when there are two writers which we don't have.  I thought only one
writer was allowed, so I'm really not grokking this bug. Can any explain this one to me?

LUCENE-3365<http://issues.apache.org/jira/browse/LUCENE-3365>: ... can cause IndexWriter
overriding an existing index.
My Comment: I think we would have known about this one if it did occur in our system, but
it is now fixed.

LUCENE-3418<http://issues.apache.org/jira/browse/LUCENE-3418>: Lucene was failing to
fsync index files on commit, meaning an operating system or hardware crash, or power loss,
could easily corrupt the index.
My Comment:  This is the issue mentioned in the release announcement.  Luckily for us, even
though we've had production environments crash during a power outage, we didn't see this.
Reading the notes on this, it seems this was a hard fail that was obvious when it occurred.

Reviewing the 3.3 release:
There appear to be no bugs which effected the files on disk that are not fixed by re-indexing.
Reviewing the 3.2.0 release:
LUCENE-3065<http://issues.apache.org/jira/browse/LUCENE-3065>:  In API changes it says,
Document.getField() was deprecated. In changes in runtime behavior it says "... Document.getFieldable()
returns NumericField instances".
My Comment:  We have more than one numeric fields in our index so have moved to using the
Document.getFieldable(), so we're doing this the right way.

Reviewing 3.1.0 release:
There appear to be no bugs which effected the files on disk that are not fixed by re-indexing
documents (for example LUCENE-2911<http://issues.apache.org/jira/browse/LUCENE-2911>).
Reviewing 3.0.3 release:
There appear to be no bugs which effected the files on disk that are not fixed by re-indexing
documents.

That doesn't seem bad at all!
Comments?










Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message