jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@onehippo.com>
Subject RE: Excluding words
Date Thu, 23 Oct 2008 09:26:52 GMT
Hello Julio,

Currently, you cannot get some (dynamic) list of stopwords. Do you have
some specific static list? Then just create your own StopAnalyzer,
having a final set of stopwords, or just read the stopwords from some
file you define the stopwords in. If you have a dynamic list of
stopwords, I think you have to build something more smart

-Ard

> 
> Thanks Ard for taking the time to respond.
> 
> I just responded to Marcel, I have an idea how to introduce 
> into my configuration the Lucene StopAnalyzer (see my 
> previous message:
> <param name="analyzer" 
> value="org.apache.lucene.analysis.StopAnalyzer"/>).
> I just want to know how do I feed to this standard Analyzer 
> input so that it knows which words to stop/exclude. I don't 
> know how to get Jackrabbit/Lucene standard set of stop words 
> (I wish to add some words to it).
> 
> Thanks
> 
> ** julio 
> 
> -----Original Message-----
> From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> Sent: Wednesday, October 22, 2008 5:54 AM
> To: users@jackrabbit.apache.org
> Subject: RE: Excluding words
> 
> Sorry Julio for not responding, I was very occupied.
> 
> As Marcel pointed out you can just configure your own 
> analyzer. If your stop words are some default set, you can 
> just use some other standard lucene analyzer, see [1]
> 
> For a lot of available analyzers. Realize that most language 
> analyzers also by default do stemming. At [1] you also have 
> the StopAnalyzer which does what you want
> 
> Regards Ard
> 
> [1]
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc
> //org/apac
> he/lucene/analysis/Analyzer.html
> 
> > 
> > Hi,
> > 
> > there parameter that allows you to configure a custom analyzer is 
> > called 'analyzer'. the default value for this parameter is 
> > org.apache.lucene.analysis.standard.StandardAnalyzer. so, you just 
> > have to write your own implementation that supports stop words and 
> > then configure it properly in your workspace.xml files.
> > 
> > see also: http://wiki.apache.org/jackrabbit/Search
> > 
> > regards
> >  marcel
> > 
> > Julio Castillo wrote:
> > > Hi there,
> > > Unfortunately there was no response to my previous posting.
> > > 
> > > I am still looking for sample configuration specifications
> > that would
> > > allow me to specify a lucene stop word analyzer.
> > > 
> > > Anybody has a sample repository config file where they have
> > referenced
> > > a stopwords.txt type file?
> > > 
> > > Thanks
> > > 
> > > ** julio
> > > 
> > > -----Original Message-----
> > > From: Julio Castillo [mailto:jcastillo@edgenuity.com]
> > > Sent: Wednesday, October 15, 2008 9:30 AM
> > > To: 'users@jackrabbit.apache.org'
> > > Subject: RE: Excluding words
> > > 
> > > Thanks Ard,
> > > Let me see if I understood you, as the link doesn't exactly
> > show how,
> > > but I will guess. Currently my repository.xml has the
> > following entry:
> > > 
> > > <SearchIndex
> > class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> > >   <param name="path" value="${wsp.home}/index"/>
> > >   <param name="textFilterClasses"
> > > 
> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,...<list
> > > truncated>.."/>
> > >   <param name="extractorPoolSize " value="2"/>
> > >   <param name="supportHighlighting" value="true"/> </SearchIndex>
> > > 
> > > I saw an example for synonyms, so I imagine it would look
> > like this (I
> > > just need the actual correct parameter names)?
> > > 
> > >   <param name="stopWordAnalyzerClass"
> > > value="org.apache.lucene.analysis.StopAnalyzer"/>
> > >   <param name="stopWordAnalyzerConfigPath" 
> > value="../stopwords.txt"/>
> > > 
> > > Thanks
> > > 
> > > ** julio
> > > 
> > > -----Original Message-----
> > > From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> > > Sent: Wednesday, October 15, 2008 4:39 AM
> > > To: users@jackrabbit.apache.org
> > > Subject: RE: Excluding words
> > > 
> > > Hello Julio,
> > > 
> > > You can define your own lucene analyzer in Jackrabbit (even per 
> > > property, see [1] at the bottom). If you just configure a lucene 
> > > analyzer having a list of stopwords for example, where you
> > create the
> > > list yourself, you are done.
> > > 
> > > Regards Ard
> > > 
> > > [1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
> > > 
> > >> Is there a way to perhaps on a per node insertion basis
> > exclude words
> > >> from being indexed by Lucene?
> > >>
> > >> I have to load a large volume of documents. There are
> > certain words
> > >> that I want to exclude as they will be present in 99% of the 
> > >> documents, but I haven't found a way to access or restrict
> > Lucene to
> > >> prevent it from indexing such words.
> > >>
> > >> Any ideas?
> > >>
> > >> Julio Castillo
> > >> Edgenuity Inc.
> > >>
> > >>
> > > 
> > > 
> > 
> > 
> 
> 

Mime
View raw message