lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen" <jason.rutherg...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Fri, 21 Nov 2008 17:58:32 GMT
> In KS, the relevant IndexReader methods no longer take a Term object.  (In
> fact, there IS no Term object any more -- KinoSearch::Index::Term has been
> removed.)  Instead, they take a string field and a generic "Obj".

Allowing pluggable data for the term text would be good.  Numeric data could
be stored as bytes instead of strings (which are costly in terms of parsing
and garbage collection).  Have term payloads been discussed?

On Thu, Nov 20, 2008 at 5:50 PM, Marvin Humphrey (JIRA) <jira@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649569#action_12649569]
>
> Marvin Humphrey commented on LUCENE-1458:
> -----------------------------------------
>
> > Take a large Jira instance, where the app itself is also
> > consuming alot of RAM, doing alot of its own IO, etc., where perhaps
> > searching is done infrequently enough relative to other operations
> > that the OS may no longer think the pages you hit for the terms index
> > are hot enough to keep around.
>
> Search responsiveness is already compromised in such a situation, because
> we
> can all but guarantee that the posting list files have already been evicted
> from cache.  If the box has enough RAM for the large JIRA instance
> including
> the Lucene index, search responsiveness won't be a problem.  As soon as you
> start running a little short on RAM, though, there's no way to stop
> infrequent
> searches from being sluggish.
>
> Nevertheless, the terms index isn't that big in comparison to, say, the
> size
> of a posting list for a common term, so the cost of re-heating it isn't
> astronomical in the grand scheme of things.
>
> > Similarly, when a BG merge is burning through data, or say backup kicks
> off
> > and moves many GB, or the simple act of iterating through a big postings
> > list, the OS will gleefully evict my terms index or norms in order to
> > populate its IO cache with data it will need again for a very long time.
>
> When that background merge finishes, the new files will be hot.  So, if we
> open a new IndexReader right away and that IndexReader uses mmap() to get
> at
> the file data, new segments be responsive right away.
>
> Even better, any IO caches for old segments used by the previous
> IndexReader
> may still be warm.  All of this without having to decompress a bunch of
> stream
> data into per-process data structures at IndexReader startup.
>
> The terms index could indeed get evicted some of the time on busy systems,
> but
> the point is that the system IO cache usually works in our favor, even
> under
> load.
>
> As far as backup daemons blowing up everybody's cache, that's stupid,
> pathological behavior: <http://kerneltrap.org/node/3000#comment-8573>.
>  Such
> apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the
> kernel
> knows it can recycle the cache pages as soon as they're cleared.
>
> >> But hey, we can simplify even further! How about dispensing with the
> index
> >> file? We can just divide the main dictionary file into blocks and binary
> >> search on that.
> >
> > I'm not convinced this'll be a win in practice. You are now paying an
> > even higher overhead cost for each "check" of your binary search,
> > especially with something like pulsing which inlines more stuff into
> > the terms dict. I agree it's simpler, but I think that's trumped by
> > the performance hit.
>
> I'm persuaded that we shouldn't do away with the terms index.  Even if
> we're
> operating on a dedicated search box with gobs of RAM, loading entire cache
> pages when we only care about the first few bytes of each is poor use of
> memory bandwidth.  And, just in case the cache does get blown, we'd like to
> keep the cost of rewarming down.
>
> Nathan Kurz and I brainstormed this subject in a phone call this morning,
> and
> we came up with a three-file lexicon index design:
>
>  * A file which is a solid stack of 64-bit file pointers into the lexicon
>    index term data.  Term data UTF-8 byte length can be determined by
>    subtracting the current pointer from the next one (or the file length at
>    the end).
>  * A file which is contains solid UTF-8 term content.  (No string lengths,
> no
>    file pointers, just character data.)
>  * A file which is a solid stack of 64-bit file pointers into the primary
>    lexicon.
>
> Since the integers are already expanded and the raw UTF-8 data can be
> compared
> as-is, those files can be memory-mapped and used as-is for binary search.
>
> > In Lucene java, the concurrency model we are aiming for is a single JVM
> > sharing a single instance of IndexReader.
>
> When I mentioned this to Nate, he remarked that we're using the OS kernel
> like
> you're using the JVM.
>
> We don't keep a single IndexReader around, but we do keep the bulk of its
> data
> cached so that we can just slap a cheap wrapper around it.
>
> > I do agree, if fork() is the basis of your concurrency model then sharing
> > pages becomes critical.  However, modern OSs implement copy-on-write
> sharing
> > of VM pages after a fork, so that's another good path to sharing?
>
> Lucy/KS can't enforce that, and we wouldn't want to.  It's very convenient
> to
> be able to launch a cheap search process.
>
> > Have you tried any actual tests swapping these approaches in as your
> > terms index impl?
>
> No -- changing something like this requires a lot of coding, so it's better
> to
> do thought experiments first to winnow down the options.
>
> > Tests of fully hot and fully cold ends of the
> > spectrum would be interesting, but also tests where a big segment
> > merge or a backup is running in the background...
>
> >> That doesn't meet the design goals of bringing the cost of
> opening/warming
> >> an IndexReader down to near-zero and sharing backing buffers among
> >> multiple forks.
> >
> > That's a nice goal. Our biggest cost in Lucene is warming the FieldCache,
> used
> > for sorting, function queries, etc.
>
> Exactly. It would be nice to add a plug-in indexing component that writes
> sort
> caches to files that can be memory mapped at IndexReader startup.  There
> would
> be multiple files: both a solid array of 32-bit integers mapping document
> number to sort order, and the field cache values.  Such a component would
> allow us to move the time it takes to read in a sort cache from
> IndexReader-startup-time to index-time.
>
> Hmm, maybe we can conflate this with a column-stride field writer and
> require
> that sort fields have a fixed width?
>
> > In my approach here, the blob is opaque to the terms dict reader: it
> > simply seeks to the right spot in the tis file, and then asks the
> > codec to decode the entry. TermsDictReader is entirely unaware of
> > what/how is stored there.
>
> Sounds good.  Basically, a hash lookup.
>
> In KS, the relevant IndexReader methods no longer take a Term object.  (In
> fact, there IS no Term object any more -- KinoSearch::Index::Term has been
> removed.)  Instead, they take a string field and a generic "Obj".
>
>    Lexicon*
>    SegReader_lexicon(SegReader *self, const CharBuf *field, Obj *term)
>    {
>        return (Lexicon*)LexReader_Lexicon(self->lex_reader, field, term);
>    }
>
> I suppose we genericize this by adding a TermsDictReader/LexReader argument
> to
> the IndexReader constructor?  That way, someone can supply a custom
> subclass
> that knows how to decode custom dictionary files.
>
>
> > Further steps towards flexible indexing
> > ---------------------------------------
> >
> >                 Key: LUCENE-1458
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Index
> >    Affects Versions: 2.9
> >            Reporter: Michael McCandless
> >            Assignee: Michael McCandless
> >            Priority: Minor
> >             Fix For: 2.9
> >
> >         Attachments: LUCENE-1458.patch, LUCENE-1458.patch,
> LUCENE-1458.patch
> >
> >
> > I attached a very rough checkpoint of my current patch, to get early
> > feedback.  All tests pass, though back compat tests don't pass due to
> > changes to package-private APIs plus certain bugs in tests that
> > happened to work (eg call TermPostions.nextPosition() too many times,
> > which the new API asserts against).
> > [Aside: I think, when we commit changes to package-private APIs such
> > that back-compat tests don't pass, we could go back, make a branch on
> > the back-compat tag, commit changes to the tests to use the new
> > package private APIs on that branch, then fix nightly build to use the
> > tip of that branch?o]
> > There's still plenty to do before this is committable! This is a
> > rather large change:
> >   * Switches to a new more efficient terms dict format.  This still
> >     uses tii/tis files, but the tii only stores term & long offset
> >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> >     are structured by field, so we don't have to record field number
> >     in every term.
> > .
> >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > .
> >     RAM usage when loading terms dict index is significantly less
> >     since we only load an array of offsets and an array of String (no
> >     more TermInfo array).  It should be faster to init too.
> > .
> >     This part is basically done.
> >   * Introduces modular reader codec that strongly decouples terms dict
> >     from docs/positions readers.  EG there is no more TermInfo used
> >     when reading the new format.
> > .
> >     There's nice symmetry now between reading & writing in the codec
> >     chain -- the current docs/prox format is captured in:
> > {code}
> > FormatPostingsTermsDictWriter/Reader
> > FormatPostingsDocsWriter/Reader (.frq file) and
> > FormatPostingsPositionsWriter/Reader (.prx file).
> > {code}
> >     This part is basically done.
> >   * Introduces a new "flex" API for iterating through the fields,
> >     terms, docs and positions:
> > {code}
> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > {code}
> >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> >     old API on top of the new API to keep back-compat.
> >
> > Next steps:
> >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> >     fix any hidden assumptions.
> >   * Expose new API out of IndexReader, deprecate old API but emulate
> >     old API on top of new one, switch all core/contrib users to the
> >     new API.
> >   * Maybe switch to AttributeSources as the base class for TermsEnum,
> >     DocsEnum, PostingsEnum -- this would give readers API flexibility
> >     (not just index-file-format flexibility).  EG if someone wanted
> >     to store payload at the term-doc level instead of
> >     term-doc-position level, you could just add a new attribute.
> >   * Test performance & iterate.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message