MemoryIndex was designed to maximize performance for a specific use
case: pure in-memory datastructure, at most one document per
MemoryIndex instance, any number of fields, high frequency reads,
high frequency index writes, no thread-safety required, optional
support for storing offsets.
I briefly considered extending it to the multi-document case, but
eventually refrained from doing so, because I didn't really need such
functionality myself (no itch). Here are some issues to consider when
attempting such an extension:
- The internal datastructure would probably look quite different
- Datastructure/algorithmic trade-offs regarding time vs space, read
vs. write frequency, common vs. less common use cases
- Hence, it may well turn out that there's not much to reuse.
- A priori, it isn't clear whether a new solution would be
significantly faster than normal RAMDirectory usage. Thus...
- Need benchmark suite to evaluate the chosen trade-offs.
- Need tests to ensure correctness (in practise, meaning, it behaves
just like the existing alternative).
I'd say it's a non-trival untertaking. For example, right now, I
don't have time for such an effort. That doesn't mean it's impossible
or shouldn't be done, of course. If someone would like to run with it
that would be great, but in light of the above issues, I'd suggest
doing it in a new class (say MultiMemoryIndex or similar).
I believe Mark has dome some initial work in that direction, based on
an independent (and different) implementation strategy.
Wolfgang.
On May 2, 2006, at 12:25 AM, Robert Engels wrote:
> Along the lines of Lucene-550, what about having a MemoryIndex that
> accepts
> multiple documents, then wrote the index once at the end in the
> Lucene file
> format (so it could be merged) during close.
>
> When adding documents using an IndexWriter, a new segment is
> created for
> each document, and then the segments are periodically merged in
> memory,
> and/or with disk segments. It seems that when constructing an Index or
> updating a "lot" of documents in an existing index, the write,
> read, merge
> cycle is inefficient, and if the documents/field information were
> maintained
> in order (TreeMaps) greater efficiency would be realized.
>
> With a memory index, the memory needed during update will increase
> dramatically, but this could still be bounded, and a "disk based"
> index
> segment written when too many documents are in the memory index (max
> buffered documents).
>
> Does this "sound" like an improvement? Has anyone else tried
> something like
> this?
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|