lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin O'Shea" <app...@dsl.pipex.com>
Subject RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
Date Tue, 11 Nov 2014 16:56:36 GMT
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID] 
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

"NOTE: SnowballFilter expects lowercased text." [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set<?>
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
> simple: You just need to override createComponents in Analyzer class and
add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: Martin O'Shea [mailto:m.oshea@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set<String> stopWords = new Set<String>();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case
characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading 
> > the version of Lucene is not an option here.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message