Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: <CADuAvzfgXdwqV58Mkpejq-NVX53TfkcZPj=j6kAsSAKgT7p54Q@mail.gmail.com> <CAL8Pwka=EVNpph8DhO7ip1BuyQesTvMNJ+O933U97ic1U0NJrw@mail.gmail.com> <CADuAvze+coBqYjV02VjmMiqpxf-C-w++2yTDFUN8hJqsXO+d6A@mail.gmail.com> <CADuAvzfWDpB8PkWzaF71L9VEJDsoCBbtbbB3ZNgQ99TPX7R4=g@mail.gmail.com> <CAL8PwkYbdLGYWtYw4x9fAmH-P0PmcsXr5V5cKOwqqO9uj1W4jw@mail.gmail.com> <CADuAvzcyNL5QxfysskT6aF4nmHmkTj=CakCEvwKZqc-_yUGhkQ@mail.gmail.com>
In-Reply-To: <CADuAvzcyNL5QxfysskT6aF4nmHmkTj=CakCEvwKZqc-_yUGhkQ@mail.gmail.com>
Subject: RE: MultiFields#getTerms docs clarification
Date: Wed, 31 Aug 2016 08:15:47 +0200
Message-ID: <026a01d2034f$20bb2100$62316300$@thetaphi.de>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: AQFSxNVPM0cNR4Brc83m8b9nt1XwQwMVRLezAcZ+FnMBmW5wvAJN15i0Am8GXkmhB4Mq4A==
Content-Language: de
archived-at: Wed, 31 Aug 2016 06:16:08 -0000

Hi,

if you have an untokenized StringField and index the "empty token" it =
will appear in the index. If you are reindexing by hand (parsing the =
stored fields of your 3.x index), I'd suggest to add some length=3D=3D0 =
check before adding the field.

With IndexUpgrader you cannot easily get rid of the field, unless you =
use a FilterAtomicReader that removes empty tokens and =
IndexWriter.addIndexes() to rebuild your index.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Trejkaz [mailto:trejkaz@trypticon.org]
> Sent: Wednesday, August 31, 2016 6:33 AM
> To: Lucene Users Mailing List <java-user@lucene.apache.org>
> Subject: Re: MultiFields#getTerms docs clarification
>=20
> On Mon, Aug 29, 2016 at 8:23 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
> > Seems like you need to scrutinize exactly what documents were =
indexed in
> step 3?
> >
> > How exactly did you copy documents out of the old index?  Note that
> > when Lucene's IndexReader returns a Document, it's not the same
> > Document that was indexed in the first place: it will only have =
fields
> > that were stored, and it does not store certain metadata about how
> > those field values were indexed.  But I don't see how that alone can
> > lead to indexing an empty string token.
>=20
> The root cause is that, apparently, in some older version, we *did*
> index an empty field, which at some point later had already been fixed
> by someone else. I verified that this empty field was in fact present
> in the stored fields for the document before the index was migrated to
> Lucene 5.
>=20
> So the only obvious difference then is between Lucene 3 indexing no
> tokens for this field, and Lucene 5 indexing a single empty token?
>=20
> I have ended up putting in a migration to delete the spurious empty
> term in the postings as well as deleting the empty field from all the
> documents where it's present.
>=20
> TX
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org