lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Posting updated ConcatFilter code, using 4.0.0 compatible classes
Date Thu, 01 Nov 2012 19:01:17 GMT
Hi Mark,

Out of curiosity, what was your use case?

Thanks,
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbennett@ideaeng.com> wrote:

> This filter lets you "glue" tokens back together.  This has been discussed
> and posted on the list before, but this updated version uses all the
> preferred 4.x classes.
>
> Normally you wouldn't want to stick tokens back together, but if you've
> found this post, you probably have some atypical need for it (as I did)
> As an example you could:
> * Let tokenizer break up text on white spaces
> * Then lowercase
> * then remove stop words
> * ***then concatenate all the words back together into one string***
>
> You'll need:
> * ConcatFilter.java  (for lucene, below)
> * ConcatFilterFactory.java   (for solr, below)
> * entry in your schema
>
> schema.xml entry
> ----------
> ...
> <fieldType ...>
>     <analyzer>
>         ...
>         <filter class="solr.ConcatFilterFactory" />
>         ...
>     </analyzer>
> </fieldType>
> ...
>
> ConcatFilter.java
> -----------------
> package org.apache.lucene.analysis;
> import java.io.IOException;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> public class ConcatFilter extends TokenFilter {
>     protected CharTermAttribute charTermAttr;
>     public ConcatFilter(TokenStream input) {
>         super(input);
>         charTermAttr = addAttribute( CharTermAttribute.class );
>     }
>     @Override
>     public boolean incrementToken() throws IOException {
>         StringBuilder buffer = new StringBuilder();
>         while( input.incrementToken() ) {
>             buffer.append( charTermAttr );
>         }
>         // We need to clear it either way
>         charTermAttr.setEmpty();
>         if ( buffer.length() > 0 ) {
>             charTermAttr.append( buffer );
>             return true;
>         }
>         else {
>             return false;
>         }
>     }
> }
>
> ConcatFilterFactory.java
> ------------------------
> package org.apache.solr.analysis;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.util.TokenFilterFactory;
> public class ConcatFilterFactory extends TokenFilterFactory {
>     @Override
>     public TokenStream create(TokenStream stream) {
>         return new ConcatFilter(stream);
>     }
> }
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>

Mime
View raw message