lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@gmail.com>
Subject Re: Posting updated ConcatFilter code, using 4.0.0 compatible classes
Date Thu, 01 Nov 2012 19:33:13 GMT
I used "combine" filters before too. I think there is a usecase for
this stuff we do similar things in suggesters with
TokenStreamToAutomaton and finite strings. That is really the same
kind of thing though. maybe we can wrap it in a tokenstream and emit
the finite path as synonyms ie . on the same position?

simon

On Thu, Nov 1, 2012 at 8:16 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hi Otis,
>
>
>
> One use case I had for a similar filter for a customer was some ngramming
> approach. The tokenization before was there to create “normalized” tokens,
> which were then be glued together (with or w/o whitespace) and ngrammed
> (means several ngram tokens created from the glued-together thingie).
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com]
> Sent: Thursday, November 01, 2012 8:01 PM
> To: dev@lucene.apache.org
> Subject: Re: Posting updated ConcatFilter code, using 4.0.0 compatible
> classes
>
>
>
> Hi Mark,
>
>
>
> Out of curiosity, what was your use case?
>
>
>
> Thanks,
> Otis
>
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html
>
> On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbennett@ideaeng.com> wrote:
>
> This filter lets you "glue" tokens back together.  This has been discussed
> and posted on the list before, but this updated version uses all the
> preferred 4.x classes.
>
> Normally you wouldn't want to stick tokens back together, but if you've
> found this post, you probably have some atypical need for it (as I did)
> As an example you could:
> * Let tokenizer break up text on white spaces
> * Then lowercase
> * then remove stop words
> * ***then concatenate all the words back together into one string***
>
> You'll need:
> * ConcatFilter.java  (for lucene, below)
> * ConcatFilterFactory.java   (for solr, below)
> * entry in your schema
>
> schema.xml entry
> ----------
> ...
> <fieldType ...>
>     <analyzer>
>         ...
>         <filter class="solr.ConcatFilterFactory" />
>         ...
>     </analyzer>
> </fieldType>
> ...
>
> ConcatFilter.java
> -----------------
> package org.apache.lucene.analysis;
> import java.io.IOException;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> public class ConcatFilter extends TokenFilter {
>     protected CharTermAttribute charTermAttr;
>     public ConcatFilter(TokenStream input) {
>         super(input);
>         charTermAttr = addAttribute( CharTermAttribute.class );
>     }
>     @Override
>     public boolean incrementToken() throws IOException {
>         StringBuilder buffer = new StringBuilder();
>         while( input.incrementToken() ) {
>             buffer.append( charTermAttr );
>         }
>         // We need to clear it either way
>         charTermAttr.setEmpty();
>         if ( buffer.length() > 0 ) {
>             charTermAttr.append( buffer );
>             return true;
>         }
>         else {
>             return false;
>         }
>     }
> }
>
> ConcatFilterFactory.java
> ------------------------
> package org.apache.solr.analysis;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.util.TokenFilterFactory;
> public class ConcatFilterFactory extends TokenFilterFactory {
>     @Override
>     public TokenStream create(TokenStream stream) {
>         return new ConcatFilter(stream);
>     }
> }
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message