Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C3C78200B80 for ; Wed, 31 Aug 2016 08:16:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C2372160AC5; Wed, 31 Aug 2016 06:16:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 151CB160ABA for ; Wed, 31 Aug 2016 08:16:06 +0200 (CEST) Received: (qmail 23815 invoked by uid 500); 31 Aug 2016 06:16:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 23804 invoked by uid 99); 31 Aug 2016 06:16:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2016 06:16:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 11741C0BF8 for ; Wed, 31 Aug 2016 06:16:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id EsPto96jQc7O for ; Wed, 31 Aug 2016 06:16:01 +0000 (UTC) Received: from mail.sd-datasolutions.de (serv2.sd-datasolutions.de [85.25.204.22]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id D56085F47D for ; Wed, 31 Aug 2016 06:16:00 +0000 (UTC) Received: from VEGA (unknown [IPv6:2001:1a80:2b36:8801:8da5:283d:524b:5643]) by mail.sd-datasolutions.de (Postfix) with ESMTPSA id 88D8416F802C8 for ; Wed, 31 Aug 2016 06:15:54 +0000 (UTC) X-NSA-Greeting: Dear NSA, have fun with reading and analyzing this e-mail! From: "Uwe Schindler" To: References: In-Reply-To: Subject: RE: MultiFields#getTerms docs clarification Date: Wed, 31 Aug 2016 08:15:47 +0200 Message-ID: <026a01d2034f$20bb2100$62316300$@thetaphi.de> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQFSxNVPM0cNR4Brc83m8b9nt1XwQwMVRLezAcZ+FnMBmW5wvAJN15i0Am8GXkmhB4Mq4A== Content-Language: de archived-at: Wed, 31 Aug 2016 06:16:08 -0000 Hi, if you have an untokenized StringField and index the "empty token" it = will appear in the index. If you are reindexing by hand (parsing the = stored fields of your 3.x index), I'd suggest to add some length=3D=3D0 = check before adding the field. With IndexUpgrader you cannot easily get rid of the field, unless you = use a FilterAtomicReader that removes empty tokens and = IndexWriter.addIndexes() to rebuild your index. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Trejkaz [mailto:trejkaz@trypticon.org] > Sent: Wednesday, August 31, 2016 6:33 AM > To: Lucene Users Mailing List > Subject: Re: MultiFields#getTerms docs clarification >=20 > On Mon, Aug 29, 2016 at 8:23 PM, Michael McCandless > wrote: > > Seems like you need to scrutinize exactly what documents were = indexed in > step 3? > > > > How exactly did you copy documents out of the old index? Note that > > when Lucene's IndexReader returns a Document, it's not the same > > Document that was indexed in the first place: it will only have = fields > > that were stored, and it does not store certain metadata about how > > those field values were indexed. But I don't see how that alone can > > lead to indexing an empty string token. >=20 > The root cause is that, apparently, in some older version, we *did* > index an empty field, which at some point later had already been fixed > by someone else. I verified that this empty field was in fact present > in the stored fields for the document before the index was migrated to > Lucene 5. >=20 > So the only obvious difference then is between Lucene 3 indexing no > tokens for this field, and Lucene 5 indexing a single empty token? >=20 > I have ended up putting in a migration to delete the spurious empty > term in the postings as well as deleting the empty field from all the > documents where it's present. >=20 > TX >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org