lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rune Stilling <s...@rdfined.dk>
Subject Re: Adding custom weights to individual terms
Date Thu, 13 Feb 2014 18:45:21 GMT
Den 13/02/2014 kl. 12.36 skrev Michael McCandless <lucene@mikemccandless.com>:

> You could stuff your custom weights into a payload, and index that,
> but this is per term per document per position, while it sounds like
> you just want one float for each term regardless of which
> documents/positions where that term occurred?

No I want to store a weight per term per document. The point is that my custom term weight
is semantically dependent on the document context exactly the same way the other standard
term weights are.

It doesn’t make sense to also have a separate weight per position.

> Doing your own custom attribute would be a challenge: not only must
> you create & set this attribute during indexing, but you then must
> change the indexing process (custom chain, custom codec) to get the
> new attribute into the index, and then make a custom query that can
> pull this attribute at search time.

Hmmm well - But will it solve my problem then?

> What are these term weights?  Are you sure you can't compute these
> weights at search time with a custom similarity using the stats that
> are already stored (docFreq, totalTermFreq, maxDoc, etc.)?

Yes I’m sure. I’m doing a semantic analysis of the documents before they are indexed,
and it’s the result of this I want to store as a custom weight on a term per document basis.
The docFreq, etc. are reflecting a quite simple approach to term weighting (i.e. - td/idf),
which just isn’t precise enough in my case.

So it seems I might as well build my own term lists and code the indexing and searching process
manually?

With regards,
Rune

> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Feb 13, 2014 at 2:40 AM, Rune Stilling <subs@rdfined.dk> wrote:
>> Hi list
>> 
>> I'm trying to figure out how customizable scoring and weighting is in the Lucene
API. I read about the API's but still can't figure out if the following is possible.
>> 
>> I would like to do normal document text indexing, but I would like to control the
weight added to tokens my self, also I would like to control the weighting of query tokens
and the how things are added together.
>> 
>> When indexing a word I would like attache my own weights to the word, and use these
weights when querying for documents. F.ex.
>> 
>> Doc 1
>> Lucene(0.7) is(0) a(0) powerful(0.9) indexing(0.62) and(0) search(0.99) API(0.3)
>> 
>> Doc 2
>> Lucene(0.5) is(0) used by(0) a(0) lot of(0) smart(0) people(0.1)
>> 
>> The floats in parentheses are some I would like to add in the indexing process, not
something coming from Lucene tdf/id ex.
>> 
>> Wen querying I would like to repeat this and also create the weights for each term
"myself" and control how the final doc score is calculated.
>> 
>> I have read that it's possible to attach your own custom attributes to tokens. Is
this the way to go? Ie. should I add my custom weight as attributes to tokens, and then access
these attributes when calculating document score in the search process (described here https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/analysis/package-summary.html
under "adding a custom attribute")?
>> 
>> The reason why I'm asking is that I can't find any examples of this being done anywhere.
But I found someone stating "With Lucene, it is impossible to increase or decrease the weight
of individual terms in a document".
>> 
>> With regards
>> Rune
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message