lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Glen Newton <>
Subject Re: Field with reader limitation arbitrary
Date Tue, 15 Sep 2009 19:10:34 GMT
I appreciate your explanation, but I think that the use case I
described merits a deeper exploration:

Scenario 1: 16 threads indexing; queue size = 1000; present api; need to store
In this scenario, there are always 1000 Strings with all the contents
of their respective files.
Averaging 50k per document = 50MB of String objects in memory.

Scenario 2: 16 threads indexing; queue size = 1000; Field constructor
with Reader and Index/Store/Tokenize; need to store
At any one time, there are only 16 Strings with their respective file
contents (i.e. read in at index time); and 984 Readers waiting in the
Averaging 50k per document = 800k of String objects in memory +
overhead of 984 Readers

Or am I not understanding something in your explanation? My
understanding is that the IndexWriter serializes Document writes, but
does not queue them (explicitly: locking multiple calls that are
waiting is not an explicit queue).

Of course, a change I could make would be to defer populating the
Field from the file Reader until just before it gets indexed, using
the String Field constructor, resulting in the equivalent of #2 above.
But pushing this to the API would be easier.


2009/9/15 Chris Hostetter <>:
> : Someone has made the decision that we will not be interested in
> : storing files read using a Reader (at least not with these
> : constructors).
> : This is rather arbitrary.
> No, it was not arbitrary at all.
> The javadocs there are not a "decree" of what shall or shan't be
> supported, they are an explanation of how the constructor works so that
> there isn't any confusion.
> The *reason* why the code works that way, is because when Lucene
> processes Fields that use a Reader, it doesn't buffer the Reader so it
> can't store the full contents of that Reader. the *reason* it doesnt'
> buffer the reader is because then it would have to make very arbitrary
> memory decisions that could easily result in OOM (Readers can in fact be
> infinite streams of characters)
> clients that want to store the value, need to be able to provide the
> entire value as a String.
> : might want to also store files in the index,  having a queue of 1000
> : Documents with 1000 Readers to files is vastly preferable to having
> : 1000 documents with 1000 (perhaps very large) Strings with all the
> : contents of the files. While this is not the best for all cases (#open
> You've just pointed out exactly why it's not feasible for Lucene to store
> the contents of the Reader -- it would have to do the exact same thing you
> describe.  The curerent API leaves it up to the client to decide when to
> do this, and what to do if the Reader produces a String bigger then it
> wants to deal with.
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message