lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lu" <chris...@gmail.com>
Subject Re: factor in stopwords when searching
Date Sun, 23 Mar 2008 02:47:35 GMT
Hi, Erik, I understand your rant. :) Well, the solution I finalized
with is this, as suggested by Jake and Grant.

For those stop words, when indexing content, I will treat them as normal words.
When processing the user query, there will be normal query with stop
words skipped, and another part that has the full query strings trying
to get an exact match. For example, "search the database" will be
parsed like:

content:"search" content:"database" content:"search the database"^1000

This should rank exact match of "search the database" to the top. And
this should be what companies like Yahoo/Google are already doing, I
guess. Can someone confirm this?

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request)
got 2.6 Million Euro funding!


On Sat, Mar 22, 2008 at 2:12 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> Well, whether it's a good user experience is exactly the question. I've
>  spent far too much time satisfying customer (or product manager)
>  requests that add zero value to the product *in the user's eyes*.
>
>  And I quote:
>
> "This is asked by some customer, who may not know what's "stop words" at
>  all."
>
>  which raises red flags to me. You might be far ahead spending the time up
>  front to work with the customer and see if it's really worthwhile.
>  Especially
>  when asking them to provide actual examples that show the "better user
>  experience". The question is *always* "is the benefit *in the real world* to
>  the user worth the extra time and money it'll cost to develop this?".
>  Especially
>  when you factor in ongoing maintenance, and the next developer having to
>  spend time  figuring out what the heck is going on, and ......
>
>  Double especially when you ask the customer "would you rather have this
>  feature or some other feature?". It's amazing how often customers don't
>  need a particular feature when the costs are made explicit. Either in
>  additional time or other features not getting worked on.
>
>  As a profession, we spend way too much time doing whatever the customer
>  asks for, rather than what the customer actually needs. We've done a pretty
>  poor job of reminding the customer what something costs and letting *them*
>  make the decision. Usually, we say something like "sure, we can do that".
>  The
>  problem is that the customer never gets a chance to say "but it's not worth
>  it",
>  because we often neglect to add "and it'll cost 6 developer-weeks which
>  means
>  we won't be able to do X, Y and Z in the time frame".
>
>  Anyway, enough ranting <G>. If the customers are willing to pay for it
>  that's
>  their business I guess..
>
>  Best
>  Erick
>
>
>
>  On Sat, Mar 22, 2008 at 2:38 PM, Chris Lu <chris.lu@gmail.com> wrote:
>
>  > This is asked by some customer, who may not know what's "stop words" at
>  > all.
>  >
>  > Jake's approach should be quite similar to what some search engine
>  > companies are doing. It'll cost some storage, but can achieve a good
>  > user experience.
>  >
>  > The benefit is kind of obvious in real world. When users enter some
>  > query, they do not really know stop words like "the" are not in the
>  > index at all.
>  > If they are looking for something, like a book titled "search the
>  > database", other books like "search database" is ranked top, which is
>  > not a good user experience.
>  >
>  > --
>  > Chris Lu
>  > -------------------------
>  > Instant Scalable Full-Text Search On Any Database/Application
>  > site: http://www.dbsight.net
>  > demo: http://search.dbsight.com
>  > Lucene Database Search in 3 minutes:
>  >
>  > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>  > DBSight customer, a shopping comparison site, (anonymous per request)
>  > got 2.6 Million Euro funding!
>  >
>  >
>  > On Sat, Mar 22, 2008 at 10:53 AM, Erick Erickson
>  > <erickerickson@gmail.com> wrote:
>  > > What's your reason for trying? The whole point of stop words is that
>  > >  they should be considered "no ops". That is, they add nothing to the
>  > >  semantics of whatever is being processed. I' don't understand the use
>  > >  case for why you want to go outside that assumption.
>  > >
>  > >  Another way of asking this is "what tangible benefit to your users
>  > >  are you trying to implement"?
>  > >
>  > >  Best
>  > >  Erick
>  > >
>  > >
>  > >
>  > >  On Fri, Mar 21, 2008 at 9:20 PM, Chris Lu <chris.lu@gmail.com> wrote:
>  > >
>  > >  > Let's say "the" is considered stopword. And for example two documents
>  > are
>  > >  > document A, content: "... search the database..."
>  > >  > document B, content: "... search database..."
>  > >  >
>  > >  > So when the user's input is "search the database", searching with
>  > >  > query content:"search database"~1 can return both.
>  > >  > But is there any way to translate that into a query that can rank the
>  > >  > document A higher than document B?
>  > >  >
>  > >  > Thanks!
>  > >  >
>  > >  > --
>  > >  > Chris Lu
>  > >  > -------------------------
>  > >  > Instant Scalable Full-Text Search On Any Database/Application
>  > >  > site: http://www.dbsight.net
>  > >  > demo: http://search.dbsight.com
>  > >  > Lucene Database Search in 3 minutes:
>  > >  >
>  > >  >
>  > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>  > >  > DBSight customer, a shopping comparison site, (anonymous per request)
>  > >  > got 2.6 Million Euro funding!
>  > >  >
>  > >
>  > >
>  > > > ---------------------------------------------------------------------
>  > >  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>  > >  > For additional commands, e-mail: java-user-help@lucene.apache.org
>  > >  >
>  > >  >
>  > >
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>  > For additional commands, e-mail: java-user-help@lucene.apache.org
>  >
>  >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message