Hi Mark,
Out of curiosity, what was your use case?
Thanks,
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html
On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbennett@ideaeng.com> wrote:
> This filter lets you "glue" tokens back together. This has been discussed
> and posted on the list before, but this updated version uses all the
> preferred 4.x classes.
>
> Normally you wouldn't want to stick tokens back together, but if you've
> found this post, you probably have some atypical need for it (as I did)
> As an example you could:
> * Let tokenizer break up text on white spaces
> * Then lowercase
> * then remove stop words
> * ***then concatenate all the words back together into one string***
>
> You'll need:
> * ConcatFilter.java (for lucene, below)
> * ConcatFilterFactory.java (for solr, below)
> * entry in your schema
>
> schema.xml entry
> ----------
> ...
> <fieldType ...>
> <analyzer>
> ...
> <filter class="solr.ConcatFilterFactory" />
> ...
> </analyzer>
> </fieldType>
> ...
>
> ConcatFilter.java
> -----------------
> package org.apache.lucene.analysis;
> import java.io.IOException;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> public class ConcatFilter extends TokenFilter {
> protected CharTermAttribute charTermAttr;
> public ConcatFilter(TokenStream input) {
> super(input);
> charTermAttr = addAttribute( CharTermAttribute.class );
> }
> @Override
> public boolean incrementToken() throws IOException {
> StringBuilder buffer = new StringBuilder();
> while( input.incrementToken() ) {
> buffer.append( charTermAttr );
> }
> // We need to clear it either way
> charTermAttr.setEmpty();
> if ( buffer.length() > 0 ) {
> charTermAttr.append( buffer );
> return true;
> }
> else {
> return false;
> }
> }
> }
>
> ConcatFilterFactory.java
> ------------------------
> package org.apache.solr.analysis;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.util.TokenFilterFactory;
> public class ConcatFilterFactory extends TokenFilterFactory {
> @Override
> public TokenStream create(TokenStream stream) {
> return new ConcatFilter(stream);
> }
> }
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
|