lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: MultiFields#getTerms docs clarification
Date Wed, 31 Aug 2016 06:15:47 GMT
Hi,

if you have an untokenized StringField and index the "empty token" it will appear in the index.
If you are reindexing by hand (parsing the stored fields of your 3.x index), I'd suggest to
add some length==0 check before adding the field.

With IndexUpgrader you cannot easily get rid of the field, unless you use a FilterAtomicReader
that removes empty tokens and IndexWriter.addIndexes() to rebuild your index.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Trejkaz [mailto:trejkaz@trypticon.org]
> Sent: Wednesday, August 31, 2016 6:33 AM
> To: Lucene Users Mailing List <java-user@lucene.apache.org>
> Subject: Re: MultiFields#getTerms docs clarification
> 
> On Mon, Aug 29, 2016 at 8:23 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
> > Seems like you need to scrutinize exactly what documents were indexed in
> step 3?
> >
> > How exactly did you copy documents out of the old index?  Note that
> > when Lucene's IndexReader returns a Document, it's not the same
> > Document that was indexed in the first place: it will only have fields
> > that were stored, and it does not store certain metadata about how
> > those field values were indexed.  But I don't see how that alone can
> > lead to indexing an empty string token.
> 
> The root cause is that, apparently, in some older version, we *did*
> index an empty field, which at some point later had already been fixed
> by someone else. I verified that this empty field was in fact present
> in the stored fields for the document before the index was migrated to
> Lucene 5.
> 
> So the only obvious difference then is between Lucene 3 indexing no
> tokens for this field, and Lucene 5 indexing a single empty token?
> 
> I have ended up putting in a migration to delete the spurious empty
> term in the postings as well as deleting the empty field from all the
> documents where it's present.
> 
> TX
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message