jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julio Castillo" <jcasti...@edgenuity.com>
Subject RE: Excluding words
Date Wed, 22 Oct 2008 16:22:07 GMT
Thanks Ard for taking the time to respond.

I just responded to Marcel, I have an idea how to introduce into my
configuration the Lucene StopAnalyzer (see my previous message:
<param name="analyzer" value="org.apache.lucene.analysis.StopAnalyzer"/>).
I just want to know how do I feed to this standard Analyzer input so that it
knows which words to stop/exclude. I don't know how to get Jackrabbit/Lucene
standard set of stop words (I wish to add some words to it).

Thanks

** julio 

-----Original Message-----
From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com] 
Sent: Wednesday, October 22, 2008 5:54 AM
To: users@jackrabbit.apache.org
Subject: RE: Excluding words

Sorry Julio for not responding, I was very occupied.

As Marcel pointed out you can just configure your own analyzer. If your stop
words are some default set, you can just use some other standard lucene
analyzer, see [1]

For a lot of available analyzers. Realize that most language analyzers also
by default do stemming. At [1] you also have the StopAnalyzer which does
what you want

Regards Ard

[1]
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apac
he/lucene/analysis/Analyzer.html

> 
> Hi,
> 
> there parameter that allows you to configure a custom analyzer is 
> called 'analyzer'. the default value for this parameter is 
> org.apache.lucene.analysis.standard.StandardAnalyzer. so, you just 
> have to write your own implementation that supports stop words and 
> then configure it properly in your workspace.xml files.
> 
> see also: http://wiki.apache.org/jackrabbit/Search
> 
> regards
>  marcel
> 
> Julio Castillo wrote:
> > Hi there,
> > Unfortunately there was no response to my previous posting.
> > 
> > I am still looking for sample configuration specifications
> that would
> > allow me to specify a lucene stop word analyzer.
> > 
> > Anybody has a sample repository config file where they have
> referenced
> > a stopwords.txt type file?
> > 
> > Thanks
> > 
> > ** julio
> > 
> > -----Original Message-----
> > From: Julio Castillo [mailto:jcastillo@edgenuity.com]
> > Sent: Wednesday, October 15, 2008 9:30 AM
> > To: 'users@jackrabbit.apache.org'
> > Subject: RE: Excluding words
> > 
> > Thanks Ard,
> > Let me see if I understood you, as the link doesn't exactly
> show how,
> > but I will guess. Currently my repository.xml has the
> following entry:
> > 
> > <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> >   <param name="path" value="${wsp.home}/index"/>
> >   <param name="textFilterClasses"
> > value="org.apache.jackrabbit.extractor.MsWordTextExtractor,...<list
> > truncated>.."/>
> >   <param name="extractorPoolSize " value="2"/>
> >   <param name="supportHighlighting" value="true"/> </SearchIndex>
> > 
> > I saw an example for synonyms, so I imagine it would look
> like this (I
> > just need the actual correct parameter names)?
> > 
> >   <param name="stopWordAnalyzerClass"
> > value="org.apache.lucene.analysis.StopAnalyzer"/>
> >   <param name="stopWordAnalyzerConfigPath" 
> value="../stopwords.txt"/>
> > 
> > Thanks
> > 
> > ** julio
> > 
> > -----Original Message-----
> > From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> > Sent: Wednesday, October 15, 2008 4:39 AM
> > To: users@jackrabbit.apache.org
> > Subject: RE: Excluding words
> > 
> > Hello Julio,
> > 
> > You can define your own lucene analyzer in Jackrabbit (even per 
> > property, see [1] at the bottom). If you just configure a lucene 
> > analyzer having a list of stopwords for example, where you
> create the
> > list yourself, you are done.
> > 
> > Regards Ard
> > 
> > [1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
> > 
> >> Is there a way to perhaps on a per node insertion basis
> exclude words
> >> from being indexed by Lucene?
> >>
> >> I have to load a large volume of documents. There are
> certain words
> >> that I want to exclude as they will be present in 99% of the 
> >> documents, but I haven't found a way to access or restrict
> Lucene to
> >> prevent it from indexing such words.
> >>
> >> Any ideas?
> >>
> >> Julio Castillo
> >> Edgenuity Inc.
> >>
> >>
> > 
> > 
> 
> 


Mime
View raw message