lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Stemming - limited index expansion
Date Tue, 12 Jun 2012 20:14:23 GMT
I don't completely follow precisely what you want to do, but the 
WordDelimiterFilter is an example of a token filter that outputs an extra 
token at the same position, such as with its CATENATE_ALL/WORDS/NUMBERS 
options.

https://builds.apache.org/job/Lucene-trunk/javadoc/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

For example, given the input "wi-fi", it would output "wi" with position 0, 
"fi" with position 1, and "wifi" also with position 0.

Or, with its PRESERVE_ORIGINAL option, that same input would output "wi" at 
0, "fi" at 1, and "wi-fi" at 0.

That said, maybe you could clarify your specific intent with an example. 
Maybe you simple want to internally call some existing stemmer filter and 
output both the original and stemmed term at the same location?

-- Jack Krupansky

-----Original Message----- 
From: Paul Hill
Sent: Tuesday, June 12, 2012 3:07 PM
To: java-user@lucene.apache.org
Subject: Stemming - limited index expansion

As others have previously proposed on this list, I am interesting in 
inserting a second token at some positions in my index.  I'll call this 
Limited Index Expansion.
I want to retain the original token, so that I can score an original word 
that matches in a text better than just any synonym/stem etc.  Maybe I'll 
even do this with payloads (on the 2nd token?).
If I didn't keep the original word all I would be doing is a limited index 
time "reduction".  Saving the original word and sometimes a lemma/stem (or 
something else), means I anticipate at most two tokens at a position in the 
index.

I couldn't find a nearly-right high-level Filter that I could use to add 
logic to call a stemmer and conditionally add another token.  Any 
suggestions?
One idea I had is that adding a second token is much like what a 
SynonymFilter does, but yikes I was starting to grok PendingInputs, 
PendingOutputs,
but wasn't getting very far reading through SynonymMap and its BytesRefHash 
etc.  Obviously it is written to be very good with memory very and fast, but 
it looks a bit tricky to extend for other sources of "synonyms". It is too 
bad that the get synonym part of the operation is not encapsulated in 
something pluggable or overridable, so I could just return an appropriate 
array of CharRefs.  The SynonymFilter is final anyway.

Can anyone point me toward any existing high-level filter that I could use 
by sub-classing, modifying, plugging, or just as a good example to which I 
might add my additional code to add another token?
Building Filters is new to me, but right now nothing is jumping out at me as 
a basis for such a Filter.  Any suggestions?  Did I miss something in core 
or contrib?
Is there some other combination of buffering, copying, sinking etc filters 
that I'm missing what I should use to build a filter chain that would aid 
this process?

-Paul 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message