Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B8B0D0DD for ; Wed, 14 Nov 2012 21:16:13 +0000 (UTC) Received: (qmail 61905 invoked by uid 500); 14 Nov 2012 21:16:12 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 61862 invoked by uid 500); 14 Nov 2012 21:16:12 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 61854 invoked by uid 99); 14 Nov 2012 21:16:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Nov 2012 21:16:12 +0000 Date: Wed, 14 Nov 2012 21:16:12 +0000 (UTC) From: "Tim Smith (JIRA)" To: dev@lucene.apache.org Message-ID: <273154815.115908.1352927772226.JavaMail.jiratomcat@arcas> In-Reply-To: <480354093.113940.1352903892214.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497459#comment-13497459 ] Tim Smith commented on LUCENE-4557: ----------------------------------- getSegmentForMerge could of course take AtomicReader to support addIndexes as well CheckIndex validates indexed positions/offsets against term vectors? isn't this really slow? Also, if term vectors were indexed with offsets, and the positions did not have offsets, and offsets are being added to positions as part of the merge, i could easily have my MergeReader enhance the indexed positions offsets from the term vectors. Of course this would be a slower merge, but it would then have 100% the right data and not result in the corruption you allude to. This would then make term vectors consistent and suitable for bulk merge. (right now i don't have a use case that would have offsets indexed for both term vectors and positions (it'd be one or the other), but its helpful you pointed this issue out so i can make sure it would be handled properly in the future) How about i look at working on a patch going down the pluggable segment data merging and we can iterate from there? > Indexed Offsets Can Be Lost During Merge > ---------------------------------------- > > Key: LUCENE-4557 > URL: https://issues.apache.org/jira/browse/LUCENE-4557 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 4.0 > Reporter: Tim Smith > Attachments: OffsetsTest.java > > > Primary Use case: > Start with pre-4.0 index (no indexed offsets available) > Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS, previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) > merge/optimize index > newly indexed documents will now no longer have offsets available > In general, it is impossible to ever change a field to have offsets indexed when starting with an existing index as a merge will cause offsets to be removed from the index. > Desirable behavior would be for new documents to have offsets indexed properly, and old documents would have offset of "0, 0" for all positions after merging with a segment that contains offsets > Current behavior can be very dangerous. > for example: > * Start indexing documents with indexed offsets > * change config to not index offsets by accident > * index 1 document > * revert config back > * offsets will start disappearing from documents as segments are merged -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org