lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project
Date Mon, 09 Jul 2007 18:06:05 GMT
I have to disagree here. What you get when you try to encompass
more and more use cases is a bloated system that is not
performant.

This gets very close to  "make the engine solve my problems for me
out of the box". Which may be reasonable for an application that is
designed as an end-to-end solution, but is less than reasonable
for an *engine*. IMO.

Leaving that aside, how big is your index? Is it an option to index
the relevant data twice in two separate fields, one with and one without
stopwords and search the relevant fields based on whether it's an
admin search or not? Unless your current index is pretty big this
may be a better option than you think.

Alternatively, what *real* purpose does having the stop-words-included
form of search? Or is this just something somebody threw in without
real justification? I've spent waaaay too much time implementing some
"gosh that would be nice even though I don't know why" requirements,
so I'm suspicious <G>.... And I've actually had some success going
back to the product manager and telling her something like "Sure, we
can do that. But it'll take X people Y weeks. Is it worth that?". Of course
if she answers "yes", I still whimper. But at least the PM can make
an informed decision.

Best
Erick

On 7/9/07, Jong Kim <jkim@sitescape.com> wrote:
>
> Our requirement is simply that -
>
> 1. Do not throw away any information at indexing time - so we preserve
> case
> information and keep all tokens.
>
> 2. Search functionality is provided at two levels -
>
> 2.1 End User search - stop word filtering is done on the search terms, the
> same stop word list is used for MoreLikeThis function.
>
> 2.2 Admin search - this is more like raw index lookup than typical
> end-user
> search, can include stop words in the search terms.
>
> The point here is that, the case matters only for those words that should
> be
> included. For the words we do not want included in the end user search, we
> do not care about the case (which to me is quite reasonable).
>
> I still think it makes sense to re-factor the MoreLikeThis class so that
> it
> can serve a wide variety of use cases (however weird it may look) than
> trying to dictate the use cases. I think it is better approach to making
> the
> useful class even more useful.
>
> /Jong
>
> -----Original Message-----
> From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> Sent: Monday, July 09, 2007 11:54 AM
> To: java-user@lucene.apache.org
> Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's
> contrib/queries project
>
> OK. I can see the logic that says it might be useful/convenient to filter
> case-sensitive search terms using a case-insensitive list of stop words.
>
> What seems slightly odd is that you want exactness in the choice of case
> yet
> are using an imprecise matching technique (MoreLikeThis) - effectively
> saying "I really care about the exact use of case but really don't care
> exactly which words match". Is this really the requirement? I would have
> thought in most cases the user would be willing to relax the exact case
> match requirement along with their choice of precisely which words are
> used
> to match. If this applies to your app you could run MoreLikeThis on the
> lower-cased version of the field in the index.
>
>
> Cheers
> Mark
>
>
> ----- Original Message ----
> From: Jong Kim <jkim@sitescape.com>
> To: java-user@lucene.apache.org
> Sent: Monday, 9 July, 2007 3:55:03 PM
> Subject: RE: Stop-words comparison in MoreLikeThis class in Lucene's
> contrib/queries project
>
> >>Or are you saying that you have deliberately chosen to index the
> >>content
> with a case-sensitive analyzer and that you want to supply stop words in a
> case-insensitive fashion?
>
> Correct.
> To be precise, we index each token up to twice - original token and its
> all-lowercase equivalent.
> Due to a product requirement, no token is thrown away at the time of
> indexing, that is, no stopwords filtering at indexing time.
> However, when executing MoreLikeThis feature, we do use a stopwords list
> (the fact that we indexed each and every word does not mean that they have
> to be included in the execution of MoreLikeThis), and we want the
> stopwords
> filtering to be case insensitive.
>
> /Jong
>
> -----Original Message-----
> From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> Sent: Monday, July 09, 2007 10:33 AM
> To: java-user@lucene.apache.org
> Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's
> contrib/queries project
>
> >>My application stores term vectors with the index
>
> And those stored term vectors contain terms produced by your choice of
> analyzer, no?
> Or are you saying that you have deliberately chosen to index the content
> with a case-sensitive analyzer and that you want to supply stop words in a
> case-insensitive fashion?
>
>
>
> ----- Original Message ----
> From: Jong Kim <jkim@sitescape.com>
> To: java-user@lucene.apache.org
> Sent: Monday, 9 July, 2007 3:00:05 PM
> Subject: RE: Stop-words comparison in MoreLikeThis class in Lucene's
> contrib/queries project
>
> My application stores term vectors with the index, and use that
> information
> to implement more-like-this rather than tokenizing the original text using
> an analyzer. Consequently the option of achieving the effect by specifying
> different analyzer is no good for my case.
>
> /Jong
>
> -----Original Message-----
> From: mark harwood [mailto:markharw00d@yahoo.co.uk]
> Sent: Monday, July 09, 2007 5:01 AM
> To: java-user@lucene.apache.org
> Subject: Re: Stop-words comparison in MoreLikeThis class in Lucene's
> contrib/queries project
>
> >>I need this comparison to be case-insensitive
>
> The choice of case-sensitivity (and preservation of punctuation, numbers
> etc
> etc) is controlled by your choice of analyzer that you pass to
> MoreLikeThis.
> If you want to ensure your list of stop words adheres to the same logic -
> use the same analyzer to construct the set from wherever you store your
> stop
> words e.g. a file.
> I don't imagine there should be a need to change the MoreLikeThis source.
>
>
> Cheers
> Mark
>
>
>
> ----- Original Message ----
> From: Jong Kim <jkim@sitescape.com>
> To: java-user@lucene.apache.org
> Sent: Sunday, 8 July, 2007 10:12:08 PM
> Subject: Stop-words comparison in MoreLikeThis class in Lucene's
> contrib/queries project
>
> Hi,
>
> The MoreLikeThis class in Lucene's contrib/queries project performs noise
> word filtering based on the case-sensitive comparison of the terms against
> the user-supplied stopwords set.
>
> I need this comparison to be case-insensitive, but I don't see any way of
> achieving it by extending this class. I would have created a subclass of
> MoreLikeThis and override the isNoiseWord() method. However, the problem
> is
> that, neither isNoiseWord() method nor the instance variables referenced
> inside that method are declared protected. They are all private. Has
> anyone
> solved this problem without modifying and building MoreLikeThis class
> directly?
>
> An alternative approach would be to supply a stopwords list containing all
> variants of the stop words with all possible mixed cases. Needless to say,
> that isn't likely to be a workable solution for many.
>
> Ultimately it would be nice if those methods and variables would have been
> made protected so that applications could override some of the default
> behaviors without having to modify the class directly.
>
> Any help would be appreciated.
>
> Thanks
> /Jong
>
>
>
>
>
>       ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try
> it
> now.
> http://uk.answers.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
>       ___________________________________________________________
> Yahoo! Mail is the world's favourite email. Don't settle for less, sign up
> for your free account today
>
> http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.htm
> l
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
>       ___________________________________________________________
> Yahoo! Mail is the world's favourite email. Don't settle for less, sign up
> for your free account today
>
> http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.htm
> l
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message