lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Re: Posting updated ConcatFilter code, using 4.0.0 compatible classes
Date Thu, 01 Nov 2012 19:32:58 GMT
Hi Otis,

Forgive my vagueness, it's an NDA thing.

Generally speaking you might want to do record matching based on a number
of fields.  But since text fields are input by humans, they can be a bit
inconsistent about how values are entered.

One answer is to remove things like stop words, abbreviations, punctuation,
etc. to normalize the fields a bit.  But then you might want to do some
fuzzy matching with things like Levenstein or double metaphone, etc., and
treat the entire field as one "unit".

I realize you could still get much of this by then using traditional
search, but in the app we're porting the business rules are quite specific
and need to support legacy accounts.  And clearly the combination of
normalization and fuzzy matching is potentially quite "lossy", but here
again the business logic has other mitigators for that.

Let me ask you a question back.  We really appreciate your ongoing series
on Solr vs. ElasticSearch (I haven't dove into ES yet).  Looking your
section on indexing (
http://blog.sematext.com/2012/09/04/solr-vs-elasticsearch-part-2-data-handling/),
can ES be as precise and flexible about creating highly customized tokens?
When I initially heard about schema-less and read their use cases, I had
the impression that ES was more for mainstream use-cases, but your review
got me thinking maybe there's a lot more there?

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Thu, Nov 1, 2012 at 12:01 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi Mark,
>
> Out of curiosity, what was your use case?
>
> Thanks,
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html
>
>
>
> On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbennett@ideaeng.com>wrote:
>
>> This filter lets you "glue" tokens back together.  This has been
>> discussed and posted on the list before, but this updated version uses all
>> the preferred 4.x classes.
>>
>> Normally you wouldn't want to stick tokens back together, but if you've
>> found this post, you probably have some atypical need for it (as I did)
>> As an example you could:
>> * Let tokenizer break up text on white spaces
>> * Then lowercase
>> * then remove stop words
>> * ***then concatenate all the words back together into one string***
>>
>> You'll need:
>> * ConcatFilter.java  (for lucene, below)
>> * ConcatFilterFactory.java   (for solr, below)
>> * entry in your schema
>>
>> schema.xml entry
>> ----------
>> ...
>> <fieldType ...>
>>     <analyzer>
>>         ...
>>         <filter class="solr.ConcatFilterFactory" />
>>         ...
>>     </analyzer>
>> </fieldType>
>> ...
>>
>> ConcatFilter.java
>> -----------------
>> package org.apache.lucene.analysis;
>> import java.io.IOException;
>> import org.apache.lucene.analysis.TokenFilter;
>> import org.apache.lucene.analysis.TokenStream;
>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>> public class ConcatFilter extends TokenFilter {
>>     protected CharTermAttribute charTermAttr;
>>     public ConcatFilter(TokenStream input) {
>>         super(input);
>>         charTermAttr = addAttribute( CharTermAttribute.class );
>>     }
>>     @Override
>>     public boolean incrementToken() throws IOException {
>>         StringBuilder buffer = new StringBuilder();
>>         while( input.incrementToken() ) {
>>             buffer.append( charTermAttr );
>>         }
>>         // We need to clear it either way
>>         charTermAttr.setEmpty();
>>         if ( buffer.length() > 0 ) {
>>             charTermAttr.append( buffer );
>>             return true;
>>         }
>>         else {
>>             return false;
>>         }
>>     }
>> }
>>
>> ConcatFilterFactory.java
>> ------------------------
>> package org.apache.solr.analysis;
>> import org.apache.lucene.analysis.TokenStream;
>> import org.apache.lucene.analysis.util.TokenFilterFactory;
>> public class ConcatFilterFactory extends TokenFilterFactory {
>>     @Override
>>     public TokenStream create(TokenStream stream) {
>>         return new ConcatFilter(stream);
>>     }
>> }
>>
>>
>> --
>> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>>
>
>

Mime
View raw message