lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Keegan" <peterlkee...@gmail.com>
Subject Re: Payloads
Date Mon, 29 Dec 2008 15:27:53 GMT
Hi Karl,

I use payloads for weight only, too, with BoostingTermQuery (see:
http://www.nabble.com/BoostingTermQuery-scoring-td20323615.html#a20323615)

A custom tokenizer looks for the reserved character '\b' followed by a 2
byte 'boost' value. It then creates a special Token type for a custom filter
which sets the payload on the token. Another reserved character '\t' is used
to set a position increment value. Since we often boost multiple tokens in
the stream, the payload 'boost' value is reapplied to subsequent tokens
until a 'boost' value of '0' is encountered, which disables payloads.

This is a bit messy and I agree that it would be nice to come up with a nice
API for this.

Peter

On Fri, Dec 26, 2008 at 8:22 PM, Karl Wettin <karl.wettin@gmail.com> wrote:

> I would very much like to hear how people use payloads.
>
> Personally I use them for weight only. And I use them a lot, almost in all
> applications. I factor the weight of synonyms, stems, dediacritization and
> what not. I create huge indices that contains lots tokens at the same
> position but with different weights. I might for instance create the stream
> "(1)motörhead^1", "(0)motorhead^0.7" and I'll do this at both index and
> query time, i.e. I use the payload weight to calculate both payload weight
> used by the BoostingTermQuery scorer AND to set the boost in the query at
> the same time.
>
> In order to handle this I use an interface that looks something like this:
>
> public interface PayloadWeightHandler {
>  public void setWeight(Token token, float weight);
>  public float getWeight(Token token);
> }
>
> In order to use this I had to patch pretty much any filter I use and pass
> down a weight factor, something like:
>
> TokenStream ts = analyzer.tokenStream(f, new StringReader("motörhead ace of
> spaces"));
> ts = new SynonymTokenFilter(ts, synonyms, 0.7f);
> ts = new StemmerFilter(ts, 0.7f);
> ts = new ASCIIFoldingFilter(ts, 0.5f);
>
> All these filters would, if applicable, create new synonym tokens with
> slightly less weight than the input rather than replace token content:
>
> "(1)mötorhead^1", "(0)motorhead^0.5", "(1)ace^1", "(1)of^1", "(1)spades^1",
> "(1)spad^0.7"
>
> I usually use 4 byte floats while creating the stream and then convert it
> to 8 bit floats in a final filter before adding it to the document.
>
> Is anyone else doing something similar? It would be nice to normalize this
> and perhaps come up with a reusable API for this. It would also be cool if
> all the existing filters could be rewritten to handle this stuff.
>
> I find it to be extemely useful when creating indices with rather niched
> content such as song titles, names of people, street addresses, et c. For
> the last year or so I've done several (3) commercial implementations where I
> try to extend the index with incorrect typed queries but unique enough that
> it does not interfere with the quality of the results. It has been very
> successful, people get great responses in great time even though they enter
> an "incorrect" query.
>
> On a side note, in these implementaions I've completely replaced phrase
> queries using shingles. ShingleMatrixQuery has some built in goodies for
> calculating weight. Combined with SSD I see awesome results with very short
> response time even in fairly large indices (10M-100M documents). I'm talking
> about 100ms-500ms for rather complex queries under heavy load.
>
>
>      karl
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message