lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
Date Mon, 10 Nov 2014 15:18:58 GMT
Hi,

Regarding Uwe's warning, 

"NOTE: SnowballFilter expects lowercased text." [1]

[1] https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of
> filters, I was thinking about something like this where I 'pipe' output from
> one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in your own package
and remove LowercaseFilter. But be aware, it could be that snowball needs lowercased terms
to correctly do stemming!!! I don't know about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You should make stop-filter
case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set<?> stopWords, boolean
ignoreCase)

Uwe

> Martin O'Shea.
> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in
> Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can be
> seen as "best practise". If you want to modify them, write your own Analyzer
> subclass which uses the wanted Tokenizers and TokenFilters as you like. You
> can for example clone the source code of the original and remove
> LowercaseFilter. Analyzers are very simple, there is no logic in them, it's just
> some "configuration" (which Tokenizer and which TokenFilters). In later
> Lucene 3 and Lucene 4, this is very simple: You just need to override
> createComponents in Analyzer class and add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers by XML
> or JSON configuration.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: Martin O'Shea [mailto:m.oshea@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in
> > Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set<String> stopWords = new Set<String>();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a string
> > of text without stop words, how can I disable the LowerCaseFilter
> > which forms part of the SnowBallAnalyzer? I want to preserve the case
> > of the ngrams generated so that I can perform various counts according
> > to the presence / absence of upper case characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading the
> > version of Lucene is not an option here.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message