lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Johansson <mag...@technohuman.com>
Subject Re: QueryParser and compound words
Date Thu, 13 Mar 2003 07:52:30 GMT
Tatu Saloranta wrote:

>On Wednesday 12 March 2003 01:19, Magnus Johansson wrote:
>  
>
>>Well, the problem arise when a user enters a query with a compound word
>>and the compound word itself is not indexed, only one of its parts.
>>    
>>
>
>Yes, but neither is compound word itself ever user in query either, assuming
>same analyser is used (like it always should)?
>
>  
>
>>For example the index contains a document with the following word:
>>fotboll (football).
>>
>>Let's say the users searches for fotbollsmatch (football game). The word
>>is split into fotboll and match and the phrase "fotboll match" is
>>searched for.
>>The user finds no matching document.
>>    
>>
>
>But same happens during indexing; fotbollsmatch should be properly
>split and stemmed to "fotboll" and "match" terms, right?
>  
>
Yes but the word fotbollsmatch was never indexed in this example. Only 
the word fotboll.
I want a query for fotbollsmatch to match a document containing the word 
fotboll.

>  
>
>>Comparing this to english the user would have found a document, however
>>scored
>>slightly lower than a document containing both the words football and game.
>>
>>I agree with you that this might not be a problem. The user could be
>>instructed
>>to reformulate his query. However the behaviour for an english index and
>>    
>>
>
>I actually think that if user has to be aware of internal stemming and 
>reformulate query I think this would be bit of a problem. :-)
>But I'm not 100% sure search string would differ from indexed string, assuming 
>same base token (unprocessed token, ie "fotbollsmatch") was both contained
>in the document and searched for using QueryParser.
>
>  
>
>>a swedish
>>index would be different.
>>    
>>
>
>I think that in general behaviour is heavily dependant on analyser (tokenizer 
>+ stemmer) being used, so it's probably different between most languages.
>
I think I'll accept how it works now. It is perhaps unlikely that the 
user would query the index
using a compound word and expecting documents containing only one of its 
parts in result.
The more I think about it the more difficult it becomes to come up with 
a realistic example
of why the behaviour would need to be changed.

Thank you for your feedback

/magnus



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message