lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Fischer <>
Subject [Resent] Document boosting based on .. semantics?
Date Wed, 20 Feb 2008 07:51:14 GMT

[Resent: guess I sent the first before I completed my subscription, just 
in case it comes up twice ...]

the subject may be a bit weird but I couldn't find a better way to 
describe a problem I'm trying to solve.

If I'm not mistaken, one factor of scoring is the distance of the word 
within the document and the length of the document. I'm titling my 
problem as the cauliflower-problem, but it's related to any "type of" 

When searching for "cauliflower", I get x hits. Now documents in the top 
range (pos < 10) are most likely selected by the user. Unfortunately, 
although "cauliflower" is in the documents and it's word position is 
like around 3000 characters in the document, the document itself has 
nothing much to do with "cauliflower" or "vegetables" or "eating", etc.

Unfortunately, the most relevant documents come much later in the index 
(pos >= 10) because the "cauliflower" word is positioned like around 
5000 characters within the document.

Based on the relation on the content, these later documents are much 
more appropriate to the search term, because the also deal with 
"vegetables" and "eating", etc.

I'm stuck here how I can signal Lucene to boost those later documents, 
because frankly I don't know on what. I would probably have to tag the 
relation of the document (is-about vegetables, is-about eating) and also 
detect that the searched term is-a vegetable. This gets even more 
complex with non-single-term queries.

On a related topic, I'm also searching for a way to suggest alternate 
spelling of words to the user, when we found a word which is very less 
frequent used in the index or not in the index at all. I'm Austrian 
based, when I e.g. search for "retthich" (wrong spelled "rettich" which 
is radish), Google suggests me the proper spelled word. I'm searching 
for a way to figure how to accomplish this, but again this may be lucene 
off-topic and/or I should properly start a separate thread ...

Has someone an advice how to approach this kind of problems? Is it 
appropriate/can it be solved with Lucene? Am I right here on this list 
anyway? :)

thanks for any feedback,
- Markus

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message