jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@onehippo.com>
Subject RE: Excluding words
Date Wed, 22 Oct 2008 12:54:06 GMT
Sorry Julio for not responding, I was very occupied.

As Marcel pointed out you can just configure your own analyzer. If your
stop words are some default set, you can just use some other standard
lucene analyzer, see [1]

For a lot of available analyzers. Realize that most language analyzers
also by default do stemming. At [1] you also have the StopAnalyzer which
does what you want

Regards Ard

[1]
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apac
he/lucene/analysis/Analyzer.html

> 
> Hi,
> 
> there parameter that allows you to configure a custom 
> analyzer is called 'analyzer'. the default value for this 
> parameter is 
> org.apache.lucene.analysis.standard.StandardAnalyzer. so, you 
> just have to write your own implementation that supports stop 
> words and then configure it properly in your workspace.xml files.
> 
> see also: http://wiki.apache.org/jackrabbit/Search
> 
> regards
>  marcel
> 
> Julio Castillo wrote:
> > Hi there,
> > Unfortunately there was no response to my previous posting.
> > 
> > I am still looking for sample configuration specifications 
> that would 
> > allow me to specify a lucene stop word analyzer.
> > 
> > Anybody has a sample repository config file where they have 
> referenced 
> > a stopwords.txt type file?
> > 
> > Thanks
> > 
> > ** julio
> > 
> > -----Original Message-----
> > From: Julio Castillo [mailto:jcastillo@edgenuity.com]
> > Sent: Wednesday, October 15, 2008 9:30 AM
> > To: 'users@jackrabbit.apache.org'
> > Subject: RE: Excluding words
> > 
> > Thanks Ard,
> > Let me see if I understood you, as the link doesn't exactly 
> show how, 
> > but I will guess. Currently my repository.xml has the 
> following entry:
> > 
> > <SearchIndex 
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> >   <param name="path" value="${wsp.home}/index"/>
> >   <param name="textFilterClasses"
> > value="org.apache.jackrabbit.extractor.MsWordTextExtractor,...<list
> > truncated>.."/>
> >   <param name="extractorPoolSize " value="2"/>
> >   <param name="supportHighlighting" value="true"/> </SearchIndex>
> > 
> > I saw an example for synonyms, so I imagine it would look 
> like this (I 
> > just need the actual correct parameter names)?
> > 
> >   <param name="stopWordAnalyzerClass"
> > value="org.apache.lucene.analysis.StopAnalyzer"/>
> >   <param name="stopWordAnalyzerConfigPath" 
> value="../stopwords.txt"/>
> > 
> > Thanks
> > 
> > ** julio
> > 
> > -----Original Message-----
> > From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> > Sent: Wednesday, October 15, 2008 4:39 AM
> > To: users@jackrabbit.apache.org
> > Subject: RE: Excluding words
> > 
> > Hello Julio,
> > 
> > You can define your own lucene analyzer in Jackrabbit (even per 
> > property, see [1] at the bottom). If you just configure a lucene 
> > analyzer having a list of stopwords for example, where you 
> create the 
> > list yourself, you are done.
> > 
> > Regards Ard
> > 
> > [1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
> > 
> >> Is there a way to perhaps on a per node insertion basis 
> exclude words 
> >> from being indexed by Lucene?
> >>
> >> I have to load a large volume of documents. There are 
> certain words 
> >> that I want to exclude as they will be present in 99% of the 
> >> documents, but I haven't found a way to access or restrict 
> Lucene to 
> >> prevent it from indexing such words.
> >>
> >> Any ideas?
> >>
> >> Julio Castillo
> >> Edgenuity Inc.
> >>
> >>
> > 
> > 
> 
> 

Mime
View raw message