lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: InputStream handling problem
Date Thu, 25 Apr 2002 22:05:02 GMT
Roman Rokytskyy wrote:

>>Yes, I forgot about that one. It's even more interesting than that! The
>>stream objects that Doug coded are not streams. They are
>>wrappers on top of those. Each clone maintains it's own seek offset.
>>Essentially, they share the same OS file handle but present an
>>abstraction of multiple independent streams into the same file.
>Sorry, but isn't file handle sharing something specific to FSInputStream?
>Why do we force that on our abstract class level?
I'm sorry, I should have been more specific. The file handle is only in 
the picture when FSInputStream is cloned. From what I can tell after a 
quick look, InputStream is responsible for buffering and it delegates to 
subclasses (via a call to readInternal) to refill the buffer from the 
underlying data store. When cloned, the InputStream clones the buffer 
(in the hope that the next read will still hit the buffered data I 
suppose), but after that it has its own seek position and its own 
buffer. In the case of FSInputStream, there is a Descriptor object that 
is shared between the clones. In the case of RAMInputStream - RAMFile is 
the shared object.

>I would suggest a factory pattern, where input stream is created for a file,
>and how this is handled is up to the implementation. FSDirectory will share
>handles, RAMDirectory will have references to same RAMFile object, my
>JDataStoreDirectory will rely on JDataStore to manage it effectively.
Perhaps a factory patter would be more flexible, but it looks like the 
existing code does a pretty good job for the RAM and FS cases. Would the 
factory pattern allow a better database implementation?

>Should I try to rewrite it? (I also would appreciate your opinion if I
>should try to touch that code at all).
I don't know, I have not heard many complaints about that code recently. 
There is activity in terms of creating a crawler / content handler 
framework. There is also a need to handle "update" better, I think. For 
example, I think it would be great to have deletes go through 
IndexWriter and get "cached" in the new segment, to be later applied to 
the prior segments during optimization. This would make deletes and adds 

Another thing on my wish / todo list is to reduce the number of OS files 
that must be open. Once you get a lot of indexes, with a number of 
stored fields, and keep re-indexing them, the number of open files grows 
rather quickly. And if Lucene is part of another program that already 
has other file IO needs, you end up quickly pushing into the max open 
files limit of the OS. The idea I have for this one is to implement a 
different kind of segment - one that is composed of a single file. Once 
a segment is created by IndexWriter, it never changes (besides the 
deletes), so it could easily be stored as a single file.

These are just a few areas that are my favorites... But then again, if 
you see another problem that's in your way, chances are that there are 
other people out there with the same issue. 

In any case, good luck!

>Roman Rokytskyy
>Do You Yahoo!?
>Get your free address at
>To unsubscribe, e-mail:   <>
>For additional commands, e-mail: <>

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message