lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: postings lists deduplication
Date Thu, 06 Jun 2013 10:24:14 GMT
Neat idea!

Would this idea allow a single term to point to (the union of) N other
posting lists?  It seems like that's necessary e.g. to handle the
exact/inexact case.

And then, to produce the Docs/AndPositionsEnum you'd need to do the
merge sort across those N posting lists?

Such a thing might also be do-able as runtime only wrapper around the
postings API (FieldsProducer), if you could at runtime do the reverse
expansion (e.g. stem -> all of its surface forms).

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 6, 2013 at 3:51 AM, Dmitry Kan <dmitry.lucene@gmail.com> wrote:

> Robert Muir and I have discussed what Robert eventually named "postings
> lists deduplication" at bbuzz 2013 conference in Berlin.
>
> The idea is to allow multiple terms to point to the same postings list to
> save space.
>
> The application / impact of this is positive for synonyms, exact / inexact
> terms, leading wildcard support via storing reversed term etc.
>
> At the moment, when supporting exact (unstemmed) and inexact (stemmed)
> searches, we store both unstemmed and stemmed variant of a word form and
> that leads to index bloating. For example, we had to remove the leading
> wildcard support via reversing a token on index and query time because of
> the same index size considerations.
>
> Would you like a jira for this?
>
> Thanks,
>
> Dmitry Kan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message