Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C64F39FDE for ; Thu, 1 Mar 2012 17:08:27 +0000 (UTC) Received: (qmail 67856 invoked by uid 500); 1 Mar 2012 17:08:26 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 67802 invoked by uid 500); 1 Mar 2012 17:08:26 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 67795 invoked by uid 99); 1 Mar 2012 17:08:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Mar 2012 17:08:26 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Mar 2012 17:08:21 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id C2D9D38EF for ; Thu, 1 Mar 2012 17:08:00 +0000 (UTC) Date: Thu, 1 Mar 2012 17:08:00 +0000 (UTC) From: "Robert Muir (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <1534550963.7760.1330621680799.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <520116482.7676.1330619999668.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220163#comment-13220163 ] Robert Muir commented on LUCENE-3837: ------------------------------------- Some concerns about scoring: # the stats problem: maybe we should allow overlay readers to just return -1 for docfreq? I dont like the situation today where preflex codec doesnt implement all the stats (the whole -1 situation and 'optional' stats is frustrating), but I think its worse to return out of bounds stuff, e.g. where docfreq > maxdoc. I think totalTermFreq is safe to just sum up though (its wrong, but not out of bounds), and similarity could use this safely as to compute expected IDF instead. Still, this part will be messy, unlike the newer stats in 4.0, lots of code I think expects that docFreq is always supported. Another possibility that I think I like more is to treat this conceptually just like deletes in every way, so all stats are supported but "maxDoc" is wrong (includes masked-away documents), then nothing is out of bounds. So in this case we would add maxDoc(field), which is only used for scoring. For a normal reader this just returns maxDoc() as implemented today... # the norms problem: although norms are implemented as docValues, currently all similarities assume that getArray()/hasArray() is implemented... but here I'm not sure that would be the case? we should probably measure if the method call really even hurts, in general its a burden on the codec I think to require that norms actually be representable as an array (maybe other use cases would want other data structures for less RAM)... we could solve both of these issues separately and independently if we decide what what we want to do. > A modest proposal for updateable fields > --------------------------------------- > > Key: LUCENE-3837 > URL: https://issues.apache.org/jira/browse/LUCENE-3837 > Project: Lucene - Java > Issue Type: New Feature > Components: core/index > Affects Versions: 4.0 > Reporter: Andrzej Bialecki > > I'd like to propose a simple design for implementing updateable fields in Lucene. This design has some limitations, so I'm not claiming it will be appropriate for every use case, and it's obvious it has some performance consequences, but at least it's a start... > This proposal uses a concept of "overlays" or "stacked updates", where the original data is not removed but instead it's overlaid with the new data. I propose to reuse as much of the existing APIs as possible, and represent updates as an IndexReader. Updates to documents in a specific segment would be collected in an "overlay" index specific to that segment, i.e. there would be as many overlay indexes as there are segments in the primary index. > A field update would be represented as a new document in the overlay index . The document would consist of just the updated fields, plus a field that records the id in the primary segment of the document affected by the update. These updates would be processed as usual via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains would be used, the same field types, etc. > On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader? or it would open individual codec format readers? perhaps it should load the whole thing into memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's docId-s. And finally it would wrap the original format readers with "overlay readers", initialized also with the id map. > Now, when consumers of the 4D API would ask for specific data, the "overlay readers" would first re-map the primary's docId to the overlay's docId, and check whether overlay data exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return this data instead of the original. Otherwise they would return the original data. > One obvious performance issue with this appraoch is that the sequential access to primary data would translate into random access to the overlay data. This could be solved by sorting the overlay index so that at least the overlay ids increase monotonically as primary ids do. > Updates to the primary index would be handled as usual, i.e. segment merges, since the segments with updates would pretend to have no overlays) would just work as usual, only the overlay index would have to be deleted once the primary segment is deleted after merge. > Updates to the existing documents that already had some fields updated would be again handled as usual, only underneath they would open an IndexWriter on the overlay index for a specific segment. > That's the broad idea. Feel free to pipe in - I started some coding at the codec level but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader seems more promising. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org