lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen" <jason.rutherg...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Tue, 25 Nov 2008 22:39:15 GMT
> simple default could be B-Tree with prefix compression, it never
disappoints and is relatively simple to implement. Berkeley DB (java
edition) uses this

Just to clarify, are you saying BDB Java performs prefix compression by
default?

On Sat, Nov 22, 2008 at 6:38 AM, eks dev <eksdev@yahoo.co.uk> wrote:

> that's the way to go!
>
> simple default could be B-Tree with prefix compression, it never
> disappoints and is relatively simple to implement. Berkeley DB (java
> edition) uses this,  I think apache xindice as well ...
>
> if you really go heavyweight, than String B-Tree looks interesting as it
> mixes Patricia Trie features into B-Tree framework.
>
> but, having anything around FSTs, Aho Corasick automata, tries family
> (Patricia, Ternary Search Tree), DAWGs... would  make Fuzzy search usable
> (think turbo SpellChecker without separate index). Also making simple and
> not so simple wildcard searches is trivial!
> There is a simple way to calculate Edit Distance (weighted) walking trie
> structures with some sort of beam (a few examples how this works on
> in-memory structures: http://code.google.com/p/trie/wiki/Status , Lingpipe
> spell checker and chunker... ). The trick is that you do not need to
> calculate distance against all terms, you recycle DP matrix for prefixes.
>
> just opening a possiblity to do it at all is obvoius step one :)
>
> Great work Mike, as usual.
>
> cheers, eks
>
>
>
>
>
>
>
> ----- Original Message ----
> > From: Michael McCandless <lucene@mikemccandless.com>
> > To: java-dev@lucene.apache.org
> > Sent: Saturday, 22 November, 2008 14:46:38
> > Subject: Re: [jira] Commented: (LUCENE-1458) Further steps towards
> flexible indexing
> >
> > Thinking about this more...
> >
> > One way to compress the terms index in RAM and on disk would be to use
> > a pre-compiled (during indexing) letter FST (finite state transducer).
> >
> > It would be similar to a letter-trie, which is similar to a binary
> > tree except at each node you have N letters as children, so that
> > common prefixes are shared.  But an FST also shares the common
> > suffixes, and "outputs" the data you need (in our case the long
> > offsets into the main terms dict file) in the middle of the word when
> > you've identified its unique prefix.
> >
> > To look up a term in the index, you walk letter by letter down this
> > tree.  At some point you may find no letter matching your term, at
> > which point you'd fallback to the highest letter lower than the one
> > you wanted (since we want to look up highest term <= target).
> > Eventually you will hit an output, which you record as the offset into
> > main terms file.  After the output there's only 1 edge leaving every
> > node (in the common suffix part of the FST) but you have to keep
> > walking those edges to fully reconstruct the term you just found.
> >
> > I'm sure there are clever implementations that would minimize RAM
> > usage.
> >
> > If we [eventually] make the terms index pluggable then people can
> > explore these different approaches...
> >
> > Mike
> >
> > Jason Rutherglen wrote:
> >
> > > It would be nice to have btree like features such as previous(), min
> and max.
> > Also a unique sequence id per term that enables faster lookup if the term
> id is
> > known.
> > >
> > > On Wed, Nov 19, 2008 at 1:38 PM, Michael McCandless
> > wrote:
> > >
> > > I think we wouldn't do any term compression for the btree, at least for
> the
> > parts loaded in RAM (we don't today, ie, we create the full Term or
> String as an
> > array).
> > >
> > > For the parts left on disk we should be able to do something similar to
> what
> > we do today, eg for child nodes only encode the "delta" wrt the parent
> node?
> > >
> > > Mike
> > >
> > >
> > > Jason Rutherglen wrote:
> > >
> > > Michael B: Are you interested in making column stride fields realtime
> and use
> > the btree for the terms?  This is an idea I started on I called tag index
> where
> > the postings are divided into blocks.  The blocks can then be replaced in
> memory
> > with periodic flush to disk as the in ram postings grows.
> > >
> > > Michael M: How would the term compression be handled in a btree model?
> > >
> > > On Wed, Nov 19, 2008 at 2:29 AM, Michael McCandless (JIRA)
> > wrote:
> > >
> > >   [
> >
> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648971#action_12648971
> > ]
> > >
> > > Michael McCandless commented on LUCENE-1458:
> > > --------------------------------------------
> > >
> > > bq. So something like a B+Tree would probably work better.
> > >
> > > I agree, btree is a better fit, though we don't need insertion &
> deletion
> > operations since each segment is write once.
> > >
> > > > Further steps towards flexible indexing
> > > > ---------------------------------------
> > > >
> > > >                 Key: LUCENE-1458
> > > >                 URL:
> https://issues.apache.org/jira/browse/LUCENE-1458
> > > >             Project: Lucene - Java
> > > >          Issue Type: New Feature
> > > >          Components: Index
> > > >    Affects Versions: 2.9
> > > >            Reporter: Michael McCandless
> > > >            Assignee: Michael McCandless
> > > >            Priority: Minor
> > > >             Fix For: 2.9
> > > >
> > > >         Attachments: LUCENE-1458.patch, LUCENE-1458.patch,
> LUCENE-1458.patch
> > > >
> > > >
> > > > I attached a very rough checkpoint of my current patch, to get early
> > > > feedback.  All tests pass, though back compat tests don't pass due to
> > > > changes to package-private APIs plus certain bugs in tests that
> > > > happened to work (eg call TermPostions.nextPosition() too many times,
> > > > which the new API asserts against).
> > > > [Aside: I think, when we commit changes to package-private APIs such
> > > > that back-compat tests don't pass, we could go back, make a branch on
> > > > the back-compat tag, commit changes to the tests to use the new
> > > > package private APIs on that branch, then fix nightly build to use
> the
> > > > tip of that branch?o]
> > > > There's still plenty to do before this is committable! This is a
> > > > rather large change:
> > > >   * Switches to a new more efficient terms dict format.  This still
> > > >     uses tii/tis files, but the tii only stores term & long offset
> > > >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> > > >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> > > >     are structured by field, so we don't have to record field number
> > > >     in every term.
> > > > .
> > > >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> > > >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > > > .
> > > >     RAM usage when loading terms dict index is significantly less
> > > >     since we only load an array of offsets and an array of String (no
> > > >     more TermInfo array).  It should be faster to init too.
> > > > .
> > > >     This part is basically done.
> > > >   * Introduces modular reader codec that strongly decouples terms
> dict
> > > >     from docs/positions readers.  EG there is no more TermInfo used
> > > >     when reading the new format.
> > > > .
> > > >     There's nice symmetry now between reading & writing in the codec
> > > >     chain -- the current docs/prox format is captured in:
> > > > {code}
> > > > FormatPostingsTermsDictWriter/Reader
> > > > FormatPostingsDocsWriter/Reader (.frq file) and
> > > > FormatPostingsPositionsWriter/Reader (.prx file).
> > > > {code}
> > > >     This part is basically done.
> > > >   * Introduces a new "flex" API for iterating through the fields,
> > > >     terms, docs and positions:
> > > > {code}
> > > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > > > {code}
> > > >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates
> the
> > > >     old API on top of the new API to keep back-compat.
> > > >
> > > > Next steps:
> > > >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> > > >     fix any hidden assumptions.
> > > >   * Expose new API out of IndexReader, deprecate old API but emulate
> > > >     old API on top of new one, switch all core/contrib users to the
> > > >     new API.
> > > >   * Maybe switch to AttributeSources as the base class for TermsEnum,
> > > >     DocsEnum, PostingsEnum -- this would give readers API flexibility
> > > >     (not just index-file-format flexibility).  EG if someone wanted
> > > >     to store payload at the term-doc level instead of
> > > >     term-doc-position level, you could just add a new attribute.
> > > >   * Test performance & iterate.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message