lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: postings lists deduplication
Date Thu, 06 Jun 2013 10:44:34 GMT
On Thu, Jun 6, 2013 at 3:24 AM, Michael McCandless <> wrote:

> Neat idea!
> Would this idea allow a single term to point to (the union of) N other
> posting lists?  It seems like that's necessary e.g. to handle the
> exact/inexact case.
> And then, to produce the Docs/AndPositionsEnum you'd need to do the
> merge sort across those N posting lists?
> Such a thing might also be do-able as runtime only wrapper around the
> postings API (FieldsProducer), if you could at runtime do the reverse
> expansion (e.g. stem -> all of its surface forms).

I think the exact/inexact is trickier (detecting it would be the hard
part), and you are right, another solution might work better.

but for the reverse wildcard and synonyms situation, it seems we could even
detect it on write if we created some hash of the previous terms postings.
if the hash matches for the current term, we know it might be a "duplicate"
and would have to actually do the costly check they are the same.

maybe there are better ways to do it, but it might be a fun postingformat
experiment to try.

View raw message