lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Larry Hendrix <lahend...@wisc.edu>
Subject Re: Stemming Problem
Date Thu, 20 May 2010 02:03:14 GMT
Thanks for the advice. I want to keep the capitalization because in our application we are
mining specific contact and company names from news articles. About 99% of the time if we
match a contact or company and it's capitalized we avoid false matches.

--Larry

On May 18, 2010, at 7:46 PM, Erick Erickson wrote:

> You can construct your own analyzer by creating
> it from a pre-existing Tokenizer
> (e.g. WhiteSpaceTokenizer) and any number
> of TokenfFilters (e.g. TokenFilter). You can
> string any number of TokenFilters together
> to get many different effects.
> 
> But I have to ask, why you want to keep capitalization?
> and punctuation? Do you really want to fail to match
> text indexed with "Erickson, Erick" with the query
> "erick erickson"? That's often a source of frustration
> instead of goodness.
> 
> HTH
> Erick
> 
> On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix <lahendrix@wisc.edu> wrote:
> 
>> Hi,
>> 
>> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
>> problems with stemming. Does anyone have a recommendation for other text
>> analyzers that handle stemming and also keep capitalization, stop words, and
>> punctuation?
>> 
>> Thanks,
>> Larry
>> 
>> 
>> Larry A. Hendrix, Graduate Student
>> Computer Science Department
>> University of Wisconsin-Madison
>> 1300 University Ave Rm 6749
>> Madison, WI 53711
>> Office: (608) 263-7624
>> lhendrix@cs.wisc.edu
>> Grambling State University Alum
>> 
>> 

Larry A. Hendrix, Graduate Student 
Computer Science Department 
University of Wisconsin-Madison 
1300 University Ave Rm 6749 
Madison, WI 53711 
Office: (608) 263-7624 
lhendrix@cs.wisc.edu 
Grambling State University Alum 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message