jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julio Castillo" <jcasti...@edgenuity.com>
Subject RE: Excluding words
Date Wed, 22 Oct 2008 16:18:48 GMT
Marcel,
I wish to use the standard Lucene Stop word analyzer:
org.apache.lucene.analysis.StopAnalyzer

So based on the wiki page indicating the Search parameters configuration it
would look something like this?

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
  <param name="path" value="${wsp.home}/index"/>
  <param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsWordTextExtractor,...<list
truncated>.."/>
  <param name="analyzer" value="org.apache.lucene.analysis.StopAnalyzer"/>
</SearchIndex>

Where and how do I specify which words should be excluded (stopped?).

Thanks

** julio


-----Original Message-----
From: Marcel Reutegger [mailto:marcel.reutegger@gmx.net] 
Sent: Wednesday, October 22, 2008 5:07 AM
To: users@jackrabbit.apache.org
Subject: Re: Excluding words

Hi,

there parameter that allows you to configure a custom analyzer is called
'analyzer'. the default value for this parameter is
org.apache.lucene.analysis.standard.StandardAnalyzer. so, you just have to
write your own implementation that supports stop words and then configure it
properly in your workspace.xml files.

see also: http://wiki.apache.org/jackrabbit/Search

regards
 marcel

> -----Original Message-----
> From: Julio Castillo [mailto:jcastillo@edgenuity.com]
> Sent: Wednesday, October 15, 2008 9:30 AM
> To: 'users@jackrabbit.apache.org'
> Subject: RE: Excluding words
> 
> Thanks Ard,
> Let me see if I understood you, as the link doesn't exactly show
> how, but I will guess. Currently my repository.xml has 
> the following entry:
> 
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
  <param name="path" value="${wsp.home}/index"/>
  <param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsWordTextExtractor,...<list
truncated>.."/>
  <param name="extractorPoolSize " value="2"/>
  <param name="supportHighlighting" value="true"/>
</SearchIndex>

I saw an example for synonyms,so I imagine it would look like this (I 
just need the actual correct parameter names)?

  <param name="stopWordAnalyzerClass"
value="org.apache.lucene.analysis.StopAnalyzer"/>
  <param name="stopWordAnalyzerConfigPath" value="../stopwords.txt"/>

 Thanks
> 
> ** julio
> 
> -----Original Message-----
> From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> Sent: Wednesday, October 15, 2008 4:39 AM
> To: users@jackrabbit.apache.org
> Subject: RE: Excluding words
> 
> Hello Julio,
> 
> You can define your own lucene analyzer in Jackrabbit (even per 
> property, see [1] at the bottom). If you just configure a lucene 
> analyzer having a list of stopwords for example, where you create the 
> list yourself, you are done.
> 
> Regards Ard
> 
> [1] http://wiki.apache.org/jackrabbit/IndexingConfiguration
> 
>> Is there a way to perhaps on a per node insertion basis exclude words 
>> from being indexed by Lucene?
>>
>> I have to load a large volume of documents. There are certain words 
>> that I want to exclude as they will be present in 99% of the 
>> documents, but I haven't found a way to access or restrict Lucene to 
>> prevent it from indexing such words.
>>
>> Any ideas?
>>
>> Julio Castillo
>> Edgenuity Inc.
>>
>>
> 
> 


Mime
View raw message