lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
Date Thu, 18 Mar 2010 14:58:27 GMT


Michael McCandless commented on LUCENE-2328:

Keeping track of not-yet-sync'd files instead of sync'd files is
better, but it still requires upkeep (ie when file is deleted you have
to remove it) because files can be opened, written to, closed, deleted
without ever being sync'd.

And I like moving this tracking under Dir -- that's where it belongs.

bq. I assume that on calling syncEveryoneAndHisDog() you should sync all files that have been
written to, and were closed, and not yet deleted.

This will over-sync in some situations.

Ie, causing commit to take longer than it should.

EG say a merge has finished with the first set of files (say _X.fdx/t,
since it merges fields first) but is still working on postings, when
the user calls commit.  We should not then sync _X.fdx/t because they
are unreferenced by the segments_N we are committing.

Or the merge has finished (so _X.* has been created) but is now off
building the _X.cfs file -- we don't want to sync _X.*, only _X.cfs
when its done.

Another example: we don't do this today, but, addIndexes should really
run fully outside of IW's normal segments file, merging away, and then
only on final success alter IW's segmentInfos.  If we switch to that,
we don't want to sync all the files that addIndexes is temporarily

The knowledge of which files "make up" the transaction lives above
Directory... so I think we should retain the per-file control.

I proposed the bulk-sync API so that Dir impls could choose to do a
system-wide sync.  Or, more generally, any Dir which can be more
efficient if it knows the precise set of files that must be sync'd
right now.

If we stick with file-by-file API, doing a system-wide sync is
somewhat trickier... because you can't assume from one call to the
next that nothing had changed.

Also, bulk sync better matches the semantics IW/IR require: these
consumers don't care the order in which these files are sync'd.  They
just care that the requested set is sync'd.  So it exposes a degree of
freedom to the Dir impls that's otherwise hidden today.

> IndexWriter.synced  field accumulates data leading to a Memory Leak
> -------------------------------------------------------------------
>                 Key: LUCENE-2328
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1
>         Environment: all
>            Reporter: Gregor Kaczor
>            Priority: Minor
>             Fix For: 3.1
>   Original Estimate: 1h
>  Remaining Estimate: 1h
> I am running into a strange OutOfMemoryError. My small test application does
> index and delete some few files. This is repeated for 60k times. Optimization
> is run from every 2k times a file is indexed. Index size is 50KB. I did analyze
> the HeapDumpFile and realized that IndexWriter.synced field occupied more than
> half of the heap. That field is a private HashSet without a getter. Its task is
> to hold files which have been synced already.
> There are two calls to addAll and one call to add on synced but no remove or
> clear throughout the lifecycle of the IndexWriter instance.
> According to the Eclipse Memory Analyzer synced contains 32618 entries which
> look like file names "_e065_1.del" or "_e067.cfs"
> The index directory contains 10 files only.
> I guess synced is holding obsolete data 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message