lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Posting updated ConcatFilter code, using 4.0.0 compatible classes
Date Thu, 01 Nov 2012 19:16:21 GMT
Hi Otis,

 

One use case I had for a similar filter for a customer was some ngramming approach. The tokenization
before was there to create “normalized” tokens, which were then be glued together (with
or w/o whitespace) and ngrammed (means several ngram tokens created from the glued-together
thingie).

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Otis Gospodnetic [mailto:otis.gospodnetic@gmail.com] 
Sent: Thursday, November 01, 2012 8:01 PM
To: dev@lucene.apache.org
Subject: Re: Posting updated ConcatFilter code, using 4.0.0 compatible classes

 

Hi Mark,

 

Out of curiosity, what was your use case?

 

Thanks,
Otis

--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html



On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbennett@ideaeng.com> wrote:

This filter lets you "glue" tokens back together.  This has been discussed and posted on the
list before, but this updated version uses all the preferred 4.x classes.

Normally you wouldn't want to stick tokens back together, but if you've found this post, you
probably have some atypical need for it (as I did)
As an example you could:
* Let tokenizer break up text on white spaces
* Then lowercase
* then remove stop words
* ***then concatenate all the words back together into one string***

You'll need:
* ConcatFilter.java  (for lucene, below)
* ConcatFilterFactory.java   (for solr, below)
* entry in your schema

schema.xml entry
----------
...
<fieldType ...>
    <analyzer>
        ...
        <filter class="solr.ConcatFilterFactory" />
        ...
    </analyzer>
</fieldType>
...

ConcatFilter.java
-----------------
package org.apache.lucene.analysis;
import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class ConcatFilter extends TokenFilter {
    protected CharTermAttribute charTermAttr;
    public ConcatFilter(TokenStream input) {
        super(input);
        charTermAttr = addAttribute( CharTermAttribute.class );
    }
    @Override
    public boolean incrementToken() throws IOException {
        StringBuilder buffer = new StringBuilder();
        while( input.incrementToken() ) {
            buffer.append( charTermAttr );
        }
        // We need to clear it either way
        charTermAttr.setEmpty();
        if ( buffer.length() > 0 ) {
            charTermAttr.append( buffer );
            return true;
        }
        else {
            return false;
        }
    }
}

ConcatFilterFactory.java
------------------------
package org.apache.solr.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
public class ConcatFilterFactory extends TokenFilterFactory {
    @Override
    public TokenStream create(TokenStream stream) {
        return new ConcatFilter(stream);
    }
}


--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

 


Mime
View raw message