jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@onehippo.com>
Subject RE: Excluding words
Date Fri, 24 Oct 2008 07:50:54 GMT
Yes, that's it. You can see the default list of stop words for this
analyzer is

public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such",
    "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
  };

Regards Ard

> 
> Thanks Ard for the quick response.
> 
> I'll start with the standard StopAnalyzer which I presume has 
> defined within the list of stopwords somewhere in the bowels 
> of Lucene.
> 
> Is this the right way to specify it within the <SearchIndex> tag?
> 
> <param name="analyzer" 
> value="org.apache.lucene.analysis.StopAnalyzer"/>
> 
> Thanks
> 
> ** julio
> 
> -----Original Message-----
> From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> Sent: Thursday, October 23, 2008 2:27 AM
> To: users@jackrabbit.apache.org
> Subject: RE: Excluding words
> 
> Hello Julio,
> 
> Currently, you cannot get some (dynamic) list of stopwords. 
> Do you have some specific static list? Then just create your 
> own StopAnalyzer, having a final set of stopwords, or just 
> read the stopwords from some file you define the stopwords 
> in. If you have a dynamic list of stopwords, I think you have 
> to build something more smart
> 
> -Ard
> 
> > 
> > Thanks Ard for taking the time to respond.
> > 
> > I just responded to Marcel, I have an idea how to introduce into my 
> > configuration the Lucene StopAnalyzer (see my previous message:
> > <param name="analyzer" 
> > value="org.apache.lucene.analysis.StopAnalyzer"/>).
> > I just want to know how do I feed to this standard Analyzer 
> input so 
> > that it knows which words to stop/exclude. I don't know how to get 
> > Jackrabbit/Lucene standard set of stop words (I wish to add 
> some words 
> > to it).
> > 
> > Thanks
> > 
> > ** julio
> > 
> > -----Original Message-----
> > From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> > Sent: Wednesday, October 22, 2008 5:54 AM
> > To: users@jackrabbit.apache.org
> > Subject: RE: Excluding words
> > 
> > Sorry Julio for not responding, I was very occupied.
> > 
> > As Marcel pointed out you can just configure your own analyzer. If 
> > your stop words are some default set, you can just use some other 
> > standard lucene analyzer, see [1]
> > 
> > For a lot of available analyzers. Realize that most 
> language analyzers 
> > also by default do stemming. At [1] you also have the StopAnalyzer 
> > which does what you want
> > 
> > Regards Ard
> > 
> > [1]
> > http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc
> > //org/apac
> > he/lucene/analysis/Analyzer.html
> > 
> > > 
> > > Hi,
> > > 
> > > there parameter that allows you to configure a custom analyzer is 
> > > called 'analyzer'. the default value for this parameter is 
> > > org.apache.lucene.analysis.standard.StandardAnalyzer. so, 
> you just 
> > > have to write your own implementation that supports stop 
> words and 
> > > then configure it properly in your workspace.xml files.
> > > 
> > > see also: http://wiki.apache.org/jackrabbit/Search
> > > 
> > > regards
> > >  marcel
> > > 
> > > Julio Castillo wrote:
> > > > Hi there,
> > > > Unfortunately there was no response to my previous posting.
> > > > 
> > > > I am still looking for sample configuration specifications
> > > that would
> > > > allow me to specify a lucene stop word analyzer.
> > > > 
> > > > Anybody has a sample repository config file where they have
> > > referenced
> > > > a stopwords.txt type file?
> > > > 
> > > > Thanks
> > > > 
> > > > ** julio
> > > > 
> > > > -----Original Message-----
> > > > From: Julio Castillo [mailto:jcastillo@edgenuity.com]
> > > > Sent: Wednesday, October 15, 2008 9:30 AM
> > > > To: 'users@jackrabbit.apache.org'
> > > > Subject: RE: Excluding words
> > > > 
> > > > Thanks Ard,
> > > > Let me see if I understood you, as the link doesn't exactly
> > > show how,
> > > > but I will guess. Currently my repository.xml has the
> > > following entry:
> > > > 
> > > > <SearchIndex
> > > class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> > > >   <param name="path" value="${wsp.home}/index"/>
> > > >   <param name="textFilterClasses"
> > > > 
> > value="org.apache.jackrabbit.extractor.MsWordTextExtractor,...<list
> > > > truncated>.."/>
> > > >   <param name="extractorPoolSize " value="2"/>
> > > >   <param name="supportHighlighting" value="true"/> 
> </SearchIndex>
> > > > 
> > > > I saw an example for synonyms, so I imagine it would look
> > > like this (I
> > > > just need the actual correct parameter names)?
> > > > 
> > > >   <param name="stopWordAnalyzerClass"
> > > > value="org.apache.lucene.analysis.StopAnalyzer"/>
> > > >   <param name="stopWordAnalyzerConfigPath" 
> > > value="../stopwords.txt"/>
> > > > 
> > > > Thanks
> > > > 
> > > > ** julio
> > > > 
> > > > -----Original Message-----
> > > > From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> > > > Sent: Wednesday, October 15, 2008 4:39 AM
> > > > To: users@jackrabbit.apache.org
> > > > Subject: RE: Excluding words
> > > > 
> > > > Hello Julio,
> > > > 
> > > > You can define your own lucene analyzer in Jackrabbit (even per 
> > > > property, see [1] at the bottom). If you just configure 
> a lucene 
> > > > analyzer having a list of stopwords for example, where you
> > > create the
> > > > list yourself, you are done.
> > > > 
> > > > Regards Ard
> > > > 
> > > > [1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
> > > > 
> > > >> Is there a way to perhaps on a per node insertion basis
> > > exclude words
> > > >> from being indexed by Lucene?
> > > >>
> > > >> I have to load a large volume of documents. There are
> > > certain words
> > > >> that I want to exclude as they will be present in 99% of the 
> > > >> documents, but I haven't found a way to access or restrict
> > > Lucene to
> > > >> prevent it from indexing such words.
> > > >>
> > > >> Any ideas?
> > > >>
> > > >> Julio Castillo
> > > >> Edgenuity Inc.
> > > >>
> > > >>
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 

Mime
View raw message