Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 93127 invoked from network); 26 Nov 2008 17:21:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Nov 2008 17:21:08 -0000 Received: (qmail 57517 invoked by uid 500); 26 Nov 2008 17:21:16 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 57455 invoked by uid 500); 26 Nov 2008 17:21:16 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 57389 invoked by uid 99); 26 Nov 2008 17:21:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Nov 2008 09:21:16 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Nov 2008 17:19:58 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 528E6234C2A5 for ; Wed, 26 Nov 2008 09:20:44 -0800 (PST) Message-ID: <1889703123.1227720044336.JavaMail.jira@brutus> Date: Wed, 26 Nov 2008 09:20:44 -0800 (PST) From: "Marvin Humphrey (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing In-Reply-To: <1307591248.1227004424235.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651067#action_12651067 ] Marvin Humphrey commented on LUCENE-1458: ----------------------------------------- > Well, maybe both? Ie, each of these IndexComponents could have many > different codecs to write/read the data to/from the index. So when I > implement PostingsComponent, when writing a segment I could choose my > own codec; when reading it, I retrieve the matching codec to decode > it. Yes, both -- that sounds good. However, I'm not sure whether you're proposing the creation of a class named "Codec", which I think we should avoid unless all of our "codecs" can descend from it. So: PostingsCodec, TermsDictCodec (or LexiconCodec, for Lucy/KS), and so on would be base classes. > Subclassing Schema seems like the right approach. Groovy. How are you going to handle it in Lucene? I think you just have to require the end user to be consistent about supplying the necessary arguments to the IndexReader and IndexWriter constructors. How do we handle auxiliary IndexComponents? I've long wanted to implement an RTreeComponent for geographic searching, so I'll use that as an example. At index-time, I think we just create an array of SegDataWriter objects and feed each document to each writer in turn. The SegDataWriter abstract base class will define all the necessary abstract methods: addDoc(), addSegment(SegReader) (for Lucy/KS), various commands related to merging (for Lucene), finish()/close(), and so on. RTreeWriter would simply subclass SegDataWriter. At search-time, things get a little trickier. Say we hand our Searcher object an RTreeRadiusQuery. At some point, the RTreeRadiusQuery will need to be compiled to an RTreeRadiusScorer, which will involve accessing an RTreeReader which presumably resides within an IndexReader. However, right now, IndexReader hides all of its inner readers and provides access through specific methods, e.g. IndexReader.document(int docNum), which ultimately hands off to FieldsReader internally. This model doesn't scale with the addition of arbitrary IndexComponents. The only thing I can thing of is an IndexReader.getReader(String name) method. > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org