lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: Sorting by a percentage On a field
Date Sat, 16 Dec 2006 06:01:48 GMT
I think there's a better way - using SpanFirstQuery.

This query scores higher a term positioned closer to the beginning of a
document.

Assume your resolution is 1%, i.e. 1 to 100, create the percentage field
with max token position 99.  Generate the tokens such that all IDs with
percentage 100 would have token position of 0, all those with percentage 99
would have token position of 1, all those with percentage n would have
token position of 100-n. I never done it myself but I know it is doable,
see Token.setPositionIncrement().

So for document A below the percentage field would look like this: "2[50]
5[20] 13[0] 3[20]" - read this as tokenWord[tokenPositionIcrement]. So this
would places "2" in position 50, "5" and "13" in position 70, and "3" in
position 90.

The query for "2" and "3" would be composed of two SpanFirstQuery objects -
one with "2" and one with "3" - composed in a BooleanQuery I guess.

You can change the effect of the distance on the score by providing your
Similarity class and overriding sloppyFreq().

Regards,
Doron

mmoser <mmoser@balihoo.com> wrote on 15/12/2006 21:09:27:
>
> Thanks for your reply. We have currently thought about both of these
> approaches, so that definitely makes me feel better about things. The
first
> approach you had mentioned, we had thought about our tagging problem and
how
> to make a product tag come to the top, but again, with a lot of tags, the
> data becomes extremely large. Even if we were to take some magnitude and
> apply it, it could possibly be huge due to an infinite number of tags.
>
> The second approach we had considered doing a version almost like that. I
> had considered that if we have a query such as the one you are talking
about
> mixed with other fields, then the query would be extremely big.
Especially
> if the user was looking for 15 attribute ids which would be 15 * 10 = 150
> different ORed term searches alone  if we were only searching every 10
> percent.
>
> I definitely thank you for this input and would love to hear if anyone
has
> any other approaches. If not, we will probably attempt one of these
routes
> and see which one is a bigger storage / performance hit.
>
> Mike
>
>
>
> Doron Cohen wrote:
> >
> > I think the right solution for this would use "payloads", where extra
data
> > can be added for each index token. However Lucene currently does not
> > support this. Without this I can think of two options, each with its
own
> > disadvantage:
> >
> > 1) more tokens at indexing time - decide on the resolution of the
> > percentage - say it is 5% - and add more tokens of the same. For
example,
> > the attributes field for product A in your example would look like: "2
2 2
> > 2 2 2 2 2 2 2 5 5 5 5 5 5 3 3 13 13 13 13 13 13".
> >
> > 2) more tokens at search time - at indexing, include the percentage in
the
> > token. So for product A you would have: "2x50 5x30 3x10 13x30". At
search
> > time, expand the query accordingly. So the query for attribute 5 would
be
> > expanded to: "5x5^5 5x10^10 5x15^15 5x20^20 ... 5x95^95 5x100^100".
> >
> > The first approach would enlarge the index, so if you have lots of data
> > that could eventually be a problem.
> > The second approach would end up with a large query, so, again, if you
> > have
> > lots of data that could eventually be a problem with search time.
> >
> > Also, depending how strict you want the scoring to be, you may want to
> > omit
> > norms for this field.
> >
> > Hope this helps,
> > Doron
> >
> > mmoser <mmoser@balihoo.com> wrote on 15/12/2006 13:17:05:
> >>
> >> So, I am still new to Lucene, so please take this into consideration
when
> >> reading this. Up until now, a novice like myself has been able to
finagle
> >> Lucene into doing what we want. But now we have a problem that I have
> > been
> >> searching for the answer to. We allow users to profile our products
with
> > a
> >> predetermined profile attribute id. We then want to take all the users
> >> profiles on a product and take a particular number of times that this
> >> particular profile attribute id has been chosen and come out with a
> >> percentage for it. This is no problem. Where the problem comes into
play
> > is
> >> that we want the user to be able to search for products that match
that
> >> particular profile attribute id. We want the higher percentages to
come
> > up
> >> on top. To add to the complexity, we want to be able to allow for the
> > user
> >> to select multiple profile attribute ids and still have a combination
of
> > the
> >> score to come up higher. Keep in mind, we would like to somehow keep
> > these
> >> in one field, because we are trying to use the same algorithm for
> > something
> >> that could potentially become very large. Any suggestions. The more
> > detail,
> >> the better.
> >>
> >> Example:
> >>
> >> Product A
> >> Attribute ID = 2    Percentage Chosen = 50%
> >> Attribute ID = 5    Percentage Chosen = 30%
> >> Attribute ID = 3    Percentage Chosen = 10%
> >> Attribute ID = 13    Percentage Chosen = 30%
> >>
> >> Product B
> >> Attribute ID = 1    Percentage Chosen = 50%
> >> Attribute ID = 2    Percentage Chosen = 20%
> >> Attribute ID = 3    Percentage Chosen = 75%
> >>
> >> So if a user selected the attributes that correspond to 2 and 3, then
> >> Product B should show up before Product A because it has a combined
score
> > of
> >> 95% and A has a combined score of 40%.
> >>
> >> Thanks for any help.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Sorting-by-a-
> percentage-On-a-field-tf2829501.html#a7903444
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message