lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: AnalyZer HELP Please
Date Wed, 18 Aug 2004 18:48:37 GMT
On Aug 18, 2004, at 2:39 PM, Tate Avery wrote:
> Anyway, that is how I interpreted my Google tests.  And, as an 
> observation, one would need to be a bit creative to get the same 
> behaviour with Lucene given the current analyzer setup (IMO).

Again, see Nutch for the creative part of this, since it is Lucene 
under the covers.  It's some clever tricks indeed.

> In short, it is interesting that they do it that way, but it certainly 
> doesn't mean that it is THE way to be done.

Absolutely agreed.  A full world wide web with practically no knowledge 
of the domain it is indexing is a completely different beast from what 
most of us build with Lucene where we have intimate domain knowledge we 
can leverage to customize the analysis process.  So, in essence, it is 
not easy (or even possible?) to say what is the best analyzer to use 
and whether removing stop words is a good idea or not.  It really 
depends on what you're trying to do.

I would say in my recent immersion into the search world, though, that 
removing stop words is not often what one really wants.  Being clever 
(see above) with them is more likely the desired effect.  And this was 
primarily my point, in case it didn't come across that way before.

	Erik

>
>
> T
>
>
> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> Sent: Wednesday, August 18, 2004 2:00 PM
> To: Lucene Users List
> Subject: Re: AnalyZer HELP Please
>
>
> Thanks for doing the legwork.  My favorite example is "to be or not to
> be" with and without quotes.  The top hit without quotes is quite
> funny.
>
> So, Google doesn't throw away stop words, but they do special query
> processing to keep you from doing silly things like "show me all
> documents with 'the' in them".  Look at Nutch for how it does something
> very similar.
>
> 	Erik
>
> On Aug 18, 2004, at 11:52 AM, Tate Avery wrote:
>
>>
>> That is interesting.
>>
>> I went to lookup the cases for this (on Google).
>> Here are my 4 queries and the results:
>>
>>
>> a) of the from it
>>
>> 	- 25,500,000 matches containing 'of' and 'the' and 'from' and 'it'
>> 	- i.e. stop list NOT used if query is only stopwords
>>
>> b) "of the from it"
>>
>> 	- 49 results for exact phrase match 'of the from it'
>> 	- i.e. stop list NOT used (see next 2 for real phrase effect)
>>
>> c) of the from it test
>>
>> 	- The following words are very common and were not included in your
>> search: of the from it.
>> 	- In short, 241,000,000 matching the word 'test'
>> 	- i.e. stop list used if there is a non-stopword in the query
>>
>> d) "of the from it test"
>> 	
>> 	- 0 matches for this exact phrase
>> 	- i.e. stoplist NOT used for any words in a phrase query
>>
>>
>> Tate
>>
>> p.s.  Um... did you say that was a rhetorical question?  ;-)
>>
>>
>> -----Original Message-----
>> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
>> Sent: Wednesday, August 18, 2004 6:17 AM
>> To: Lucene Users List
>> Subject: Re: AnalyZer HELP Please
>>
>>
>>
>>
>> On Aug 18, 2004, at 3:41 AM, Karthik N S wrote:
>>> Hi Guys
>>>
>>>   Finally with lot's experimentation, I came to know that
>>>
>>>     A word  such as  'new'  already present in  Analyzer,
>>>
>>>     will  not return  any hits [ Even when enclosed with Quotes "\""]
>>>
>>>     such as  "New Year"....
>>>
>>>
>>>    That's really Intresting....    :(
>>
>>
>> That's why it's call stop word *removal*.  The purpose of removing
>> words is to save space and eliminate words that are ultra common.
>> Tuning the analysis process to your domain/environment is by far the
>> trickiest part of using Lucene, and often is not even much of a
>> consideration as the built-in Analyzers suffice.  It sounds to me that
>> your stop word list is far too aggressive and you should consider
>> trimming down the list of words that are removed.
>>
>> Or, even consider not removing words at all.  From the What Would
>> Google Do (WWGD)? category.... does Google remove stop words?  I'll
>> leave that as a rhetorical question for now :)
>>
>> 	Erik
>>
>>>
>>>
>>> Thx
>>> Karthik
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
>>> Sent: Tuesday, August 17, 2004 7:35 PM
>>> To: Lucene Users List
>>> Subject: Re: AnalyZer HELP Please
>>>
>>>
>>> On Aug 17, 2004, at 9:47 AM, Karthik N S wrote:
>>>> I did as Erik  replied in his mail ,
>>>> and  searched for the complete word   "\"New Year\""  ,
>>>> but the QueryParser Still returns me hit for "Year"  Only.
>>>>
>>>> [ The Analyzer I use has 555 English Stop words  with  "new" present
>>>> in it ]
>>>
>>> No wonder!
>>>
>>>> That's when I checked up with Analyzer's to verify,
>>>> If u look at the list  Analyzer's  o/p
>>>> GrammerAnalyzer is the one that has 555 English STOPWORDS.
>>>>
>>>> Do u think this is the bug in my Code.
>>>
>>> Whether this is a "bug" or not is really for your users to determine
>>> :)
>>>   But it is absolutely the expected behavior.  QueryParser analyzes
>>> the
>>> expression too.  Even if you somehow changed QueryParser, if you 
>>> never
>>> indexed the word "new" then you certainly cannot expect to search on
>>> it
>>> and find it.
>>>
>>> 	Erik
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message