lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Glen Newton <glen.new...@gmail.com>
Subject Re: Field with reader limitation arbitrary
Date Tue, 15 Sep 2009 19:10:34 GMT
I appreciate your explanation, but I think that the use case I
described merits a deeper exploration:

Scenario 1: 16 threads indexing; queue size = 1000; present api; need to store
In this scenario, there are always 1000 Strings with all the contents
of their respective files.
Averaging 50k per document = 50MB of String objects in memory.

Scenario 2: 16 threads indexing; queue size = 1000; Field constructor
with Reader and Index/Store/Tokenize; need to store
At any one time, there are only 16 Strings with their respective file
contents (i.e. read in at index time); and 984 Readers waiting in the
queue.
Averaging 50k per document = 800k of String objects in memory +
overhead of 984 Readers

Or am I not understanding something in your explanation? My
understanding is that the IndexWriter serializes Document writes, but
does not queue them (explicitly: locking multiple calls that are
waiting is not an explicit queue).

Of course, a change I could make would be to defer populating the
Field from the file Reader until just before it gets indexed, using
the String Field constructor, resulting in the equivalent of #2 above.
But pushing this to the API would be easier.

Thanks,
Glen

2009/9/15 Chris Hostetter <hossman_lucene@fucit.org>:
>
> : Someone has made the decision that we will not be interested in
> : storing files read using a Reader (at least not with these
> : constructors).
> : This is rather arbitrary.
>
> No, it was not arbitrary at all.
>
> The javadocs there are not a "decree" of what shall or shan't be
> supported, they are an explanation of how the constructor works so that
> there isn't any confusion.
>
> The *reason* why the code works that way, is because when Lucene
> processes Fields that use a Reader, it doesn't buffer the Reader so it
> can't store the full contents of that Reader. the *reason* it doesnt'
> buffer the reader is because then it would have to make very arbitrary
> memory decisions that could easily result in OOM (Readers can in fact be
> infinite streams of characters)
>
> clients that want to store the value, need to be able to provide the
> entire value as a String.
>
> : might want to also store files in the index,  having a queue of 1000
> : Documents with 1000 Readers to files is vastly preferable to having
> : 1000 documents with 1000 (perhaps very large) Strings with all the
> : contents of the files. While this is not the best for all cases (#open
>
> You've just pointed out exactly why it's not feasible for Lucene to store
> the contents of the Reader -- it would have to do the exact same thing you
> describe.  The curerent API leaves it up to the client to decide when to
> do this, and what to do if the Reader produces a String bigger then it
> wants to deal with.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message