lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
Date Sun, 20 Dec 2009 02:10:18 GMT


Marvin Humphrey commented on LUCENE-2026:

> But, that's where Lucy presumably takes a perf hit. Lucene can share
> these in RAM, not usign the filesystem as the intermediary (eg we do
> that today with deletions; norms/field cache/eventual CSF can do the
> same.) Lucy must go through the filesystem to share.

For a flush(), I don't think there's a significant penalty.  The only extra
costs Lucy will pay are the bookkeeping costs to update the file system state
and to create the objects that read the index data.  Those are real, but since
we're skipping the fsync(), they're small.  As far as the actual data, I don't
see that there's a difference.  Reading from memory mapped RAM isn't any
slower than reading from malloc'd RAM.

If we have to fsync(), there'll be a cost, but in Lucene you have to pay that
same cost, too.  Lucene expects to get around it with IndexWriter.getReader().
In Lucy, we'll get around it by having you call flush() and then reopen a
reader somewhere, often in another proecess.  

  * In both cases, the availability of fresh data is decoupled from the fsync.  
  * In both cases, the indexing process has to be careful about dropping data
    on the floor before a commit() succeeds.
  * In both cases, it's possible to protect against unbounded corruption by
    rolling back to the last commit.

> Mostly I was thinking performance, ie, trusting the OS to make good
> decisions about what should be RAM resident, when it has limited
> information...

Right, for instance because we generally can't force the OS to pin term
dictionaries in RAM, as discussed a while back.  It's not an ideal situation,
but Lucene's approach isn't bulletproof either, since Lucene's term
dictionaries can get paged out too.  

We're sure not going to throw away all the advantages of mmap and go back to
reading data structures into process RAM just because of that.

> But, also risky is that all important data structures must be "file-flat",
> though in practice that doesn't seem like an issue so far? 

It's a constraint.  For instance, to support mmap, string sort caches
currently require three "files" each: ords, offsets, and UTF-8 character data.  

The compound file system makes the file proliferation bearable, though.  And
it's actually nice in a way to have data structures as named files, strongly
separated from each other and persistent.

If we were willing to ditch portability, we could cast to arrays of structs in
Lucy -- but so far we've just used primitives.  I'd like to keep it that way,
since it would be nice if the core Lucy file format was at least theoretically
compatible with a pure Java implementation.  But Lucy plugins could break that
rule and cast to structs if desired.  

> The RAM resident things Lucene has - norms, deleted docs, terms index, field
> cache - seem to "cast" just fine to file-flat. 

There are often benefits to keeping stuff "file-flat", particularly when the
file-flat form is compressed.  If we were to expand those sort caches to
string objects, they'd take up more RAM than they do now.

I think the only significant drawback is security: we can't trust memory
mapped data the way we can data which has been read into process RAM and
checked on the way in.  For instance, we need to perform UTF-8 sanity checking
each time a string sort cache value escapes the controlled environment of the
cache reader.  If the sort cache value was instead derived from an existing
string in process RAM, we wouldn't need to check it.

> If we switched to an FST for the terms index I guess that could get
> tricky...

Hmm, I haven't been following that.  Too much work to keep up with those
giganto patches for flex indexing, even though it's a subject I'm intimately
acquainted with and deeply interested in.  I plan to look it over when you're
done and see if we can simplify it.  :)

> Wouldn't shared memory be possible for process-only concurrent models?

IPC is a platform-compatibility nightmare.  By restricting ourselves to
communicating via the file system, we save ourselves oodles of engineering
time.  And on really boring, frustrating work, to boot.

> Also, what popular systems/environments have this requirement (only process
> level concurrency) today?

Perl's threads suck.  Actually all threads suck.  Perl's are just worse than
average -- and so many Perl binaries are compiled without them.  Java threads
suck less, but they still suck -- look how much engineering time you folks
blow on managing that stuff.  Threads are a terrible programming model.

I'm not into the idea of forcing Lucy users to use threads.  They should be
able to use processes as their primary concurrency model if they want.

> It's wonderful that Lucy can startup really fast, but, for most apps that's
> not nearly as important as searching/indexing performance, right? 


Total indexing throughput in both Lucene and KinoSearch has been pretty decent
for a long time.  However, there's been a large gap between average index
update performance and worst case index update performance, especially when
you factor in sort cache loading.  There are plenty of applications that may
not have very high throughput requirements but where it may not be acceptable
for an index update to take several seconds or several minutes every once in a
while, even if it usually completes faster.

> I mean, you start only once, and then you handle many, many
> searches / index many documents, with that process, usually?

Sometimes the person who just performed the action that updated the index is
the only one you care about.  For instance, to use a feature request that came
in from Slashdot a while back, if someone leaves a comment on your website,
it's nice to have it available in the search index right away.

Consistently fast index update responsiveness makes personalization of the
customer experience easier.

> But you really need to also test search/indexing throughput, reopen time
> (I think) once that's online for Lucy...


> Is reopen even necessary in Lucy?

Probably.  If you have a boatload of segments and a boatload of fields, you
might start to see file opening and metadata parsing costs come into play.  If
it turns out that for some indexes reopen() can knock down the time from say,
100 ms to 10 ms or less, I'd consider that sufficient justification.

> Refactoring of IndexWriter
> --------------------------
>                 Key: LUCENE-2026
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message