lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size
Date Mon, 20 Nov 2006 22:43:03 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12451462 ] 
            
Chuck Williams commented on LUCENE-709:
---------------------------------------

> In your merging documents scenario, you state "Thread 1 adds a new document, creating
a new segment with new index files, leading to segment merging, that creates new larger segment
index files, and then deletes all replaced segment index files."

> If a different thread calls getSizeInBytes() after the merge but before the deletes,
you will see both the old segments and new segments created by the merge and will be double
counting. Synchronizing the directory-level getSizeInBytes() will not solve that... it requires
higher level synchronization.

Except there is no double counting there.  The size after the merge before the deletes really
is that big!  This is what I mean by any computation involving a loop is accurate at that
instant.  Without the synchronization, you can get a result that was *never* accurate, i.e.
represents a file set that never existed.  For a size computation, that result could be larger
or smaller than any actual size the directory ever attained.  That is the point of my example.
 For a list() computaiton with an unprotected loop (as in lucene now) you can set a set of
files that were never the contents of the directory at any instant.

No higher level synchronization is required to achieve the semantics that a looping computaiton
is accurate at the instant it is performed.  Without directory (or files Hashtable) syncrhonization
protecting the whole loop, the result can be random, having no correlation to any actual state
the directory ever attained.

> Anyway, I think the point is moot as I think we should handle the size incrementally.


Not quite, because the bug already exists in lucene in RAMDirectory.list().  My version of
the patch fixes this.  It should be fixed.

>>Counting buffer sizes rather than file length may be slightly more accurate, but at
least for me it is not material.

> It could be *much* more accurate though. All buffering of documents in IndexWriter is
done with single doc segments. That 1 byte norm file takes up 1024 bytes of buffer space!


Point taken that this is important in general.  (These numbers are still small in my app because
maxBufferedDocs is not large and i have some very large documents that cannot be truncated.)

I can update my version of the patch with this improvement if that would be helpful.  Or if
you are going to merge my test case into your version of the patch (and I hope fix the remaining
synchronization issues in RAMDIrectory.list() and the long synchronization issues in fileLength()
and fileModified(), and the rename bug which will need to be fixed for test case to succeed),
then I'll just hold off.

Yonik, thanks for your interested and effort in this issue!


> [PATCH] Enable application-level management of IndexWriter.ramDirectory size
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-709
>                 URL: http://issues.apache.org/jira/browse/LUCENE-709
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.0.1
>         Environment: All
>            Reporter: Chuck Williams
>         Attachments: ramdir.patch, ramdir.patch, ramDirSizeManagement.patch, ramDirSizeManagement.patch,
ramDirSizeManagement.patch
>
>
> IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs,
which limits it to a fixed number of documents.  When document sizes vary substantially, especially
when documents cannot be truncated, this leads either to inefficiencies from a too-small value
or OutOfMemoryErrors from a too large value.
> This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size
information about IndexWriter.ramDirectory so that an application can manage this based on
total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller
documents or a smaller number of larger documents.  This can lead to much better performance
while elimianting the possibility of OutOfMemoryErrors.
> The actual job of managing to a size constraint, or any other constraint, is left up
the applicatation.
> The addition of synchronized to flushRamSegments() is only for safety of an external
call.  It has no significant effect on internal calls since they all come from a sychronized
caller.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message