lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1301) Refactor DocumentsWriter
Date Wed, 18 Jun 2008 01:36:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605798#action_12605798
] 

Michael Busch commented on LUCENE-1301:
---------------------------------------

Just a quick update, Mike: 
With your latest patch it's compiling fine now. Thanks!
I'm seeing NullPointerExceptions in TestStressIndexing2 though,
but I guess this patch is not final yet.

I haven't read the patch yet, hope I'll find some time soon.

> Refactor DocumentsWriter
> ------------------------
>
>                 Key: LUCENE-1301
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1301
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1301.patch, LUCENE-1301.take2.patch
>
>
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc.  This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing.  EG DocConsumer
> consumes the whole document.  DocFieldConsumer consumes separate
> fields, one at a time.  InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer.  TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing.  Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
>   * NormsWriter holds norms in memory and then flushes them to _X.nrm.
>   * FreqProxTermsWriter holds postings data in memory and then flushes
>     to _X.frq/prx.
>   * StoredFieldsWriter flushes immediately to _X.fdx/fdt
>   * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk.  Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
>   * Improved concurrency with mixed large/small docs: previously the
>     thread state would be tied up when docs finished indexing
>     out-of-order.  Now, it's not: instead I use a separate class to
>     hold any pending state to flush to the doc stores, and immediately
>     free up the thread state to index other docs.
>   * Buffered norms in memory now remain sparse, until flushed to the
>     _X.nrm file.  Previously we would "fill holes" in norms in memory,
>     as we go, which could easily use way too much memory.  Really this
>     isn't a solution to the problem of sparse norms (LUCENE-830); it
>     just delays that issue from causing memory blowup during indexing;
>     memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change.  I'll profile & iterate to minimize this, but I think we can
> accept some loss.  I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message