lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jong Kim" <j...@sitescape.com>
Subject RE: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project
Date Mon, 09 Jul 2007 17:34:34 GMT
Our requirement is simply that - 

1. Do not throw away any information at indexing time - so we preserve case
information and keep all tokens.

2. Search functionality is provided at two levels - 

2.1 End User search - stop word filtering is done on the search terms, the
same stop word list is used for MoreLikeThis function.

2.2 Admin search - this is more like raw index lookup than typical end-user
search, can include stop words in the search terms.
 
The point here is that, the case matters only for those words that should be
included. For the words we do not want included in the end user search, we
do not care about the case (which to me is quite reasonable). 

I still think it makes sense to re-factor the MoreLikeThis class so that it
can serve a wide variety of use cases (however weird it may look) than
trying to dictate the use cases. I think it is better approach to making the
useful class even more useful.

/Jong

-----Original Message-----
From: mark harwood [mailto:markharw00d@yahoo.co.uk] 
Sent: Monday, July 09, 2007 11:54 AM
To: java-user@lucene.apache.org
Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's
contrib/queries project

OK. I can see the logic that says it might be useful/convenient to filter
case-sensitive search terms using a case-insensitive list of stop words.

What seems slightly odd is that you want exactness in the choice of case yet
are using an imprecise matching technique (MoreLikeThis) - effectively
saying "I really care about the exact use of case but really don't care
exactly which words match". Is this really the requirement? I would have
thought in most cases the user would be willing to relax the exact case
match requirement along with their choice of precisely which words are used
to match. If this applies to your app you could run MoreLikeThis on the
lower-cased version of the field in the index.


Cheers
Mark


----- Original Message ----
From: Jong Kim <jkim@sitescape.com>
To: java-user@lucene.apache.org
Sent: Monday, 9 July, 2007 3:55:03 PM
Subject: RE: Stop-words comparison in MoreLikeThis class in Lucene's
contrib/queries project

>>Or are you saying that you have deliberately chosen to index the 
>>content
with a case-sensitive analyzer and that you want to supply stop words in a
case-insensitive fashion?

Correct. 
To be precise, we index each token up to twice - original token and its
all-lowercase equivalent.
Due to a product requirement, no token is thrown away at the time of
indexing, that is, no stopwords filtering at indexing time.
However, when executing MoreLikeThis feature, we do use a stopwords list
(the fact that we indexed each and every word does not mean that they have
to be included in the execution of MoreLikeThis), and we want the stopwords
filtering to be case insensitive.

/Jong

-----Original Message-----
From: mark harwood [mailto:markharw00d@yahoo.co.uk]
Sent: Monday, July 09, 2007 10:33 AM
To: java-user@lucene.apache.org
Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's
contrib/queries project

>>My application stores term vectors with the index

And those stored term vectors contain terms produced by your choice of
analyzer, no? 
Or are you saying that you have deliberately chosen to index the content
with a case-sensitive analyzer and that you want to supply stop words in a
case-insensitive fashion?



----- Original Message ----
From: Jong Kim <jkim@sitescape.com>
To: java-user@lucene.apache.org
Sent: Monday, 9 July, 2007 3:00:05 PM
Subject: RE: Stop-words comparison in MoreLikeThis class in Lucene's
contrib/queries project

My application stores term vectors with the index, and use that information
to implement more-like-this rather than tokenizing the original text using
an analyzer. Consequently the option of achieving the effect by specifying
different analyzer is no good for my case.

/Jong

-----Original Message-----
From: mark harwood [mailto:markharw00d@yahoo.co.uk]
Sent: Monday, July 09, 2007 5:01 AM
To: java-user@lucene.apache.org
Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's
contrib/queries project

>>I need this comparison to be case-insensitive

The choice of case-sensitivity (and preservation of punctuation, numbers etc
etc) is controlled by your choice of analyzer that you pass to MoreLikeThis.
If you want to ensure your list of stop words adheres to the same logic -
use the same analyzer to construct the set from wherever you store your stop
words e.g. a file. 
I don't imagine there should be a need to change the MoreLikeThis source.


Cheers
Mark



----- Original Message ----
From: Jong Kim <jkim@sitescape.com>
To: java-user@lucene.apache.org
Sent: Sunday, 8 July, 2007 10:12:08 PM
Subject: Stop-words comparison in MoreLikeThis class in Lucene's
contrib/queries project

Hi,
 
The MoreLikeThis class in Lucene's contrib/queries project performs noise
word filtering based on the case-sensitive comparison of the terms against
the user-supplied stopwords set. 
 
I need this comparison to be case-insensitive, but I don't see any way of
achieving it by extending this class. I would have created a subclass of
MoreLikeThis and override the isNoiseWord() method. However, the problem is
that, neither isNoiseWord() method nor the instance variables referenced
inside that method are declared protected. They are all private. Has anyone
solved this problem without modifying and building MoreLikeThis class
directly?
 
An alternative approach would be to supply a stopwords list containing all
variants of the stop words with all possible mixed cases. Needless to say,
that isn't likely to be a workable solution for many.
 
Ultimately it would be nice if those methods and variables would have been
made protected so that applications could override some of the default
behaviors without having to modify the class directly.
 
Any help would be appreciated.
 
Thanks
/Jong





      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      ___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up
for your free account today
http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.htm
l 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      ___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up
for your free account today
http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.htm
l 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message