lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Thu, 12 Nov 2009 17:20:40 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777068#action_12777068
] 

Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. OK. It's clear Zoie's design is optimized for insanely fast reopen.

That, and maxing out QPS and indexing rate while keeping query latency degredation to a minimum.
 From trying to turn off the extra deleted check, the latency overhead on a 5M doc index is
a difference of queries taking 12-13ms with the extra check turned on, and 10ms without it,
and you only really start to notice on the extreme edges (the queries hitting all 5million
docs by way of an actual query (not MatchAllDocs)), when your performance goes from maybe
100ms to 140-150ms.  

bq. EG what I'd love to see is, as a function of reopen rate, the "curve" of QPS vs docs per
sec. Ie, if you reopen 1X per second, that consumes some of your machine's resources. What's
left can be spent indexing or searching or both, so, it's a curve/line. So we should set up
fixed rate indexing, and then redline the QPS to see what's possible, and do this for multiple
indexing rates, and for multiple reopen rates.

Yes, that curve would be a very useful benchmark.  Now that I think of it, it wouldn't be
too hard to just sneak some reader caching into the ZoieSystem with a tunable parameter for
how long you hang onto it, so that we could see how much that can help.  One of the nice things
that we can do in Zoie by using this kind of index-latency backoff, is that because we have
an in-memory two-way mapping of zoie-specific UID to docId, if we actually have time (in the
background, since we're caching these readers now) to zip through and update the real delete
BitVectors on the segments, and lose the extra check at query time, only using that if you
have the index-latency time set below some threshold (determined by how long it takes the
system to do this resolution - mapping docId to UID is an array lookup, the reverse is a little
slower).

bq. Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with
controlled reopen frequency, ie, they would choose to not lose indexing and searching performance
if it means they can "only" reopen, eg, 2X per second.

In theory Zoie is making tradeoffs - in practice, at least against what is on trunk, Zoie's
just going way faster in both indexing and querying in the redline perf test.  I agree that
in principle, once LUCENE-1313 and other improvements and bugs have been worked out of NRT,
that query performance should be faster, and if zoie's default BalancedMergePolicy (nee ZoieMergePolicy)
is in use for NRT, the indexing performance should be faster too - it's just not quite there
yet at this point.

bq. I agree - having such well defined API semantics ("once updateDoc returns, searches can
see it") is wonderful. But I think they can be cleanly built on top of Lucene NRT as it is
today, with a pre-determined (reopen) latency.

Of course!  These api semantics are already built up on top of plain-old Lucene - even without
NRT, so I can't imagine how NRT would *remove* this ability! :)

bq. I think the "large merge just finished" case is the most costly for such apps (which the
"merged segment warmer" on IW should take care of)? (Because otherwise the segments are tiny,
assuming everything is cutover to "per segment").

Definitely.  One thing that Zoie benefited from, from an API standpoint, which might be nice
in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could replace the raw SegmentReader
with a warmed user-specified subclass of SegmentReader:

{code}
public abstract class IndexReaderWarmer<R extends IndexReader> {
  public abstract T warm(IndexReader rawReader);
}
{code}

Which could replace the reader in the readerPool with the possibly-user-overridden subclass
of SegmentReader (now that SegmentReader is as public as IndexReader itself is) which has
now been warmed.  For users who like to decorate their readers to keep additional state, instead
of use them as keys into WeakHashMaps kept separate, this could be extremely useful (I know
that the people I talked to at Apple's iTunes store do this, as well as in bobo, and zoie,
to name a few examples off the top of my head).

bq.  I think Lucene could handle this well, if we made an IndexReader impl that directly searches
DocumentWriter's RAM buffer. But that's somewhat challenging

Jason mentioned this approach in his talk at ApacheCon, but I'm not at all convinced it's
necessary - if a single box can handle indexing a couple hundred smallish documents a second
(into a RAMDirectory), and could be sped up by using multiple IndexWriters (writing into multiple
RAMDirecotries in parallel, if you were willing to give up some CPU cores to indexing), and
you can search against them without having to do any deduplification / bloomfilter check against
the disk, then I'd be surprised if searching the pre-indexed RAM buffer would really be much
of a speedup in comparison to just doing it the simple way.  But I could be wrong, as I'm
not sure how much faster such a search could be.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message