Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 27581 invoked from network); 16 May 2008 17:49:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 May 2008 17:49:10 -0000 Received: (qmail 41684 invoked by uid 500); 16 May 2008 17:49:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 41664 invoked by uid 500); 16 May 2008 17:49:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 2833 invoked by uid 99); 16 May 2008 15:38:58 -0000 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) X-Warning: RFC compliance checks disabled due to whitelist X-Warning: Maximum message size check skipped due to whitelist X-Warning: Realtime Block Lists skipped due to whitelist X-Warning: System filters skipped due to whitelist X-Warning: Domain filters skipped due to whitelist X-Warning: User filters skipped due to whitelist X-Warning: Anti-Spam check skipped due to whitelist X-Whitelist: 2147483581 X-Envelope-From: Dan2@redlasso.com X-Envelope-To: java-user@lucene.apache.org X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C8B76A.D095F882" Subject: Version 2.3 Does Not Index/Digest All Document Tokens Date: Fri, 16 May 2008 11:37:11 -0400 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Version 2.3 Does Not Index/Digest All Document Tokens Thread-Index: Aci3arU46DzyZ4kKTJaBvTLA+veCdg== From: "Dan Rugg" To: X-Virus-Checked: Checked by ClamAV on apache.org ------_=_NextPart_001_01C8B76A.D095F882 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable After upgrading to version 2.3.x from 2.2.0, we started experiencing issues with our index searches. Some searches produced false positives, while others produce no hits for terms known to be in specific documents that where digested. After setting up tests that created indexes containing single documents we found that version 2.3.x did not add all the tokens from a document the index while 2.2.0 did. The only thing that changed between the tests were the lucene jar being used, and a fresh index was created for each test. =20 It seems to be some random action that 2.3.x is taking, or not taking. While tokens such as 'traffic' will not be digested in one document, it will in another. Token frequency, order, and relative position seem to not matter, as indexed and non-indexed tokens where across the board. The documents being ingested where XML, and the tokenizer for the documents were the same for 2.2.0 and 2.3.x. We even did a token dump of the documents and verified the documents where being tokenized correctly. =20 I did notice rebuilding the index was quicker with 2.3.x and the index was smaller, but I guess if you aren't adding tokens to the index it is bound to smaller. BTW, we tested versions 2.3.1, 2.3.2, and 2.2.0. We are now back to using 2.2.0. =20 Daniel Rugg ------_=_NextPart_001_01C8B76A.D095F882--