lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1301) Refactor DocumentsWriter
Date Mon, 16 Jun 2008 10:17:45 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-1301:
---------------------------------------

    Attachment: LUCENE-1301.take2.patch

Woops, sorry, I forgot to svn add that.  I'm attaching my current
state, with that file added.  Does this one work?  (You may need to
forcefully remove DocumentsWriterFieldData.java if applying the patch
doesn't do so).



> Refactor DocumentsWriter
> ------------------------
>
>                 Key: LUCENE-1301
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1301
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1301.patch, LUCENE-1301.take2.patch
>
>
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc.  This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing.  EG DocConsumer
> consumes the whole document.  DocFieldConsumer consumes separate
> fields, one at a time.  InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer.  TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing.  Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
>   * NormsWriter holds norms in memory and then flushes them to _X.nrm.
>   * FreqProxTermsWriter holds postings data in memory and then flushes
>     to _X.frq/prx.
>   * StoredFieldsWriter flushes immediately to _X.fdx/fdt
>   * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk.  Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
>   * Improved concurrency with mixed large/small docs: previously the
>     thread state would be tied up when docs finished indexing
>     out-of-order.  Now, it's not: instead I use a separate class to
>     hold any pending state to flush to the doc stores, and immediately
>     free up the thread state to index other docs.
>   * Buffered norms in memory now remain sparse, until flushed to the
>     _X.nrm file.  Previously we would "fill holes" in norms in memory,
>     as we go, which could easily use way too much memory.  Really this
>     isn't a solution to the problem of sparse norms (LUCENE-830); it
>     just delays that issue from causing memory blowup during indexing;
>     memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change.  I'll profile & iterate to minimize this, but I think we can
> accept some loss.  I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message