lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rune Stilling <s...@rdfined.dk>
Subject Re: Adding custom weights to individual terms
Date Fri, 14 Feb 2014 20:14:57 GMT
Hi Lukai

That was a great help. Thank you.

I’m continuing reading about payloads:

http://searchhub.org/2009/08/05/getting-started-with-payloads/

Didn’t know that concept at all.

Regards,
Rune

Den 13/02/2014 kl. 23.12 skrev lukai <lukai1984@gmail.com>:

> Hi, Rune:
>  Per your requirement, you can generate a separated filed for the document
> before send document to lucene. Let's say the name is: score_field. The
> content of this field in this way:
> Doc 1#score_field:
>  Lucence:0.7 is:0 ...
> Doc 2#score_field:
>  Lucene:0.5 is:0 ...
> 
> Store the field with "indexed", store other fields as "stored". And store
> the weight value as payload for terms(wrap your ananlyzer to consume the
> weight value, basically you can leverage: DelimitedPayloadTokenFilter and
> WhitespaceTokenizer to form a basic analyzer which can take the input
> format). Make sure the term in each document in score_field is unique
> (according your description it's already fullfilled). You can also disable
> to index the position information for this filed, cuz you dont need it.
> 
> Then when you do query:
> 1. If you want to do score like a cosine similarity based on query and
> document, you should implement a query parser to parse weight you assigned
> in different terms in query phrase.
> 2. create a new query type and customize you score function and tell lucene
> to use your scorer.
> 
>  Here is a small snippet of a query type i had created before, basically
> you can follow this logic to manipulate your score value:
> 
>         final Terms terms = fields.terms(fieldName);
> 
>              if(terms != null ){
> 
>                final TermsEnum termsEnum = terms.iterator(null);
> 
>                BytesRef bytes = new BytesRef(wandTerm.queryTerm);
> 
>                if(termsEnum.seekExact(new BytesRef(wandTerm.queryTerm))){
> 
> 
> 
>                  float ub = termsEnum.maxFeatureValue();
> 
>                  int docFreq = termsEnum.docFreq();
> 
>              //    logger.warn("term:"+wandTerm.queryTerm +"   :" + ub);
> 
>                  DocsAndPositionsEnum docsPositionEnum =
> termsEnum.docsAndPositions(acceptDocs, null);
> 
> 
> tts.add(newWandPosting(fieldName,bytes,docsPositionEnum,ub,wandTerm.
> featureValue,(totalDocNum+1)*1.0f/docFreq ));
> 
>                }
> 
> 
> 
> On Thu, Feb 13, 2014 at 10:49 AM, Rune Stilling <subs@rdfined.dk> wrote:
> 
>> I'm not sure how I would do that, when Lucene is meant to use my custom
>> weights when calculating document weights when executing a search query.
>> 
>> Doc 1
>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
>> API(0.3)
>> 
>> Doc 2
>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>> 
>> Query
>> Lucene
>> 
>> 0.7 and 0.5 are my custom weight and should be used to return Doc 1 with
>> weight 0.7 and Doc 2 with weight 0.5 as an answer to my query.
>> 
>> /Rune
>> 
>> Den 13/02/2014 kl. 13.27 skrev Shai Erera <serera@gmail.com>:
>> 
>>> I often prefer to manage such weights outside the index. Usually managing
>>> them inside the index leads to problems in the future when e.g the
>> weights
>>> change. If they are encoded in the index, it means re-indexing. Also, if
>>> the weight changes then in some segments the weight will be different
>> than
>>> others. I think that if you manage the weights e.g. in a simple FST
>> (which
>>> is very compat), it will give you the best flexibility and it's very easy
>>> to use.
>>> 
>>> Shai
>>> 
>>> 
>>> On Thu, Feb 13, 2014 at 1:36 PM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>> 
>>>> You could stuff your custom weights into a payload, and index that,
>>>> but this is per term per document per position, while it sounds like
>>>> you just want one float for each term regardless of which
>>>> documents/positions where that term occurred?
>>>> 
>>>> Doing your own custom attribute would be a challenge: not only must
>>>> you create & set this attribute during indexing, but you then must
>>>> change the indexing process (custom chain, custom codec) to get the
>>>> new attribute into the index, and then make a custom query that can
>>>> pull this attribute at search time.
>>>> 
>>>> What are these term weights?  Are you sure you can't compute these
>>>> weights at search time with a custom similarity using the stats that
>>>> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?
>>>> 
>>>> Mike McCandless
>>>> 
>>>> http://blog.mikemccandless.com
>>>> 
>>>> 
>>>> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <subs@rdfined.dk> wrote:
>>>>> Hi list
>>>>> 
>>>>> I'm trying to figure out how customizable scoring and weighting is in
>>>> the Lucene API. I read about the API's but still can't figure out if the
>>>> following is possible.
>>>>> 
>>>>> I would like to do normal document text indexing, but I would like to
>>>> control the weight added to tokens my self, also I would like to control
>>>> the weighting of query tokens and the how things are added together.
>>>>> 
>>>>> When indexing a word I would like attache my own weights to the word,
>>>> and use these weights when querying for documents. F.ex.
>>>>> 
>>>>> Doc 1
>>>>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99)
>>>> API(0.3)
>>>>> 
>>>>> Doc 2
>>>>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>>>>> 
>>>>> The floats in parentheses are some I would like to add in the indexing
>>>> process, not something coming from Lucene tdf/id ex.
>>>>> 
>>>>> Wen querying I would like to repeat this and also create the weights
>> for
>>>> each term "myself" and control how the final doc score is calculated.
>>>>> 
>>>>> I have read that it's possible to attach your own custom attributes to
>>>> tokens. Is this the way to go? Ie. should I add my custom weight as
>>>> attributes to tokens, and then access these attributes when calculating
>>>> document score in the search process (described here
>>>> 
>> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.htmlunder"adding
a custom attribute")?
>>>>> 
>>>>> The reason why I'm asking is that I can't find any examples of this
>>>> being done anywhere. But I found someone stating "With Lucene, it is
>>>> impossible to increase or decrease the weight of individual terms in a
>>>> document".
>>>>> 
>>>>> With regards
>>>>> Rune
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message