lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AHMET ARSLAN <iori...@yahoo.com>
Subject Re: Solr stemming -> preserve original words
Date Sat, 24 Jan 2009 18:13:23 GMT
I still don't understand your final goal but if you want to get an output in the form of 
"run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner" 
you need to index your documents using standard analyzer. Walk through the index using org.apache.lucene.index.IndexReader
and stem each term using stemmer. Storing stems (key) and orignal word list (value) in a map
will give that kind of output.

However if seeing something like the following list (not exactly you want but similar) on
schema.jsp will help you

run=>run
run=>running
run=>runner
run=>runners

add one line of code 

newstr = newstr + "=>" +  new String(termBuffer, 0, len);

to org.apache.solr.analysis.EnglishPorterFilterFactory.java between lines #116 and #117.

Rename the file, compile the code, put your jar file to libs directory under your solr home.
Now you can use your new FilfterFactory in your schema.xml


--- On Sat, 1/24/09, Thushara Wijeratna <thushw@gmail.com> wrote:

> From: Thushara Wijeratna <thushw@gmail.com>
> Subject: Re: Solr stemming -> preserve original words
> To: solr-user@lucene.apache.org, iorixxx@yahoo.com
> Date: Saturday, January 24, 2009, 1:53 AM
> Chris, Ahmet - thanks for the responses.
> 
> Ahmet - yes, i want to see "run" as a top term +
> the original words that
> formed that term
> The reason is that due to mis-stemming, the terms could
> become non-english.
> ex:  "permanent" would stem to "perm",
> "archive" would become "archiv".
> 
> I need to extract a set of keywords from the indexed
> content - I'd like
> these to be correct full english words.
> 
> thanks,
> thushara


      

Mime
View raw message