lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Re: Posting updated ConcatFilter code, using 4.0.0 compatible classes
Date Thu, 01 Nov 2012 19:37:19 GMT
Hi Simon,

I'd love to see a ConcatFilter and factory find a permanent home as part of
the stable to standard filters.  But perhaps for the Automaton function
it'd need to be packaged differently?

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Thu, Nov 1, 2012 at 12:33 PM, Simon Willnauer
<simon.willnauer@gmail.com>wrote:

> I used "combine" filters before too. I think there is a usecase for
> this stuff we do similar things in suggesters with
> TokenStreamToAutomaton and finite strings. That is really the same
> kind of thing though. maybe we can wrap it in a tokenstream and emit
> the finite path as synonyms ie . on the same position?
>
> simon
>
> On Thu, Nov 1, 2012 at 8:16 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> > Hi Otis,
> >
> >
> >
> > One use case I had for a similar filter for a customer was some ngramming
> > approach. The tokenization before was there to create “normalized”
> tokens,
> > which were then be glued together (with or w/o whitespace) and ngrammed
> > (means several ngram tokens created from the glued-together thingie).
> >
> >
> >
> > Uwe
> >
> >
> >
> > -----
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: uwe@thetaphi.de
> >
> >
> >
> > From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com]
> > Sent: Thursday, November 01, 2012 8:01 PM
> > To: dev@lucene.apache.org
> > Subject: Re: Posting updated ConcatFilter code, using 4.0.0 compatible
> > classes
> >
> >
> >
> > Hi Mark,
> >
> >
> >
> > Out of curiosity, what was your use case?
> >
> >
> >
> > Thanks,
> > Otis
> >
> > --
> > Search Analytics - http://sematext.com/search-analytics/index.html
> > Performance Monitoring - http://sematext.com/spm/index.html
> >
> > On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbennett@ideaeng.com>
> wrote:
> >
> > This filter lets you "glue" tokens back together.  This has been
> discussed
> > and posted on the list before, but this updated version uses all the
> > preferred 4.x classes.
> >
> > Normally you wouldn't want to stick tokens back together, but if you've
> > found this post, you probably have some atypical need for it (as I did)
> > As an example you could:
> > * Let tokenizer break up text on white spaces
> > * Then lowercase
> > * then remove stop words
> > * ***then concatenate all the words back together into one string***
> >
> > You'll need:
> > * ConcatFilter.java  (for lucene, below)
> > * ConcatFilterFactory.java   (for solr, below)
> > * entry in your schema
> >
> > schema.xml entry
> > ----------
> > ...
> > <fieldType ...>
> >     <analyzer>
> >         ...
> >         <filter class="solr.ConcatFilterFactory" />
> >         ...
> >     </analyzer>
> > </fieldType>
> > ...
> >
> > ConcatFilter.java
> > -----------------
> > package org.apache.lucene.analysis;
> > import java.io.IOException;
> > import org.apache.lucene.analysis.TokenFilter;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> > public class ConcatFilter extends TokenFilter {
> >     protected CharTermAttribute charTermAttr;
> >     public ConcatFilter(TokenStream input) {
> >         super(input);
> >         charTermAttr = addAttribute( CharTermAttribute.class );
> >     }
> >     @Override
> >     public boolean incrementToken() throws IOException {
> >         StringBuilder buffer = new StringBuilder();
> >         while( input.incrementToken() ) {
> >             buffer.append( charTermAttr );
> >         }
> >         // We need to clear it either way
> >         charTermAttr.setEmpty();
> >         if ( buffer.length() > 0 ) {
> >             charTermAttr.append( buffer );
> >             return true;
> >         }
> >         else {
> >             return false;
> >         }
> >     }
> > }
> >
> > ConcatFilterFactory.java
> > ------------------------
> > package org.apache.solr.analysis;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.util.TokenFilterFactory;
> > public class ConcatFilterFactory extends TokenFilterFactory {
> >     @Override
> >     public TokenStream create(TokenStream stream) {
> >         return new ConcatFilter(stream);
> >     }
> > }
> >
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message