lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Posting updated ConcatFilter code, using 4.0.0 compatible classes
Date Thu, 01 Nov 2012 19:54:08 GMT
Hi Mark,

Thanks for the explanation - makes sense!

Re ES - yes.  But I pasted your Q in
http://blog.sematext.com/2012/09/04/solr-vs-elasticsearch-part-2-data-handling/comments,
too, so you should get a more thorough answer there soon.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Thu, Nov 1, 2012 at 3:32 PM, Mark Bennett <mbennett@ideaeng.com> wrote:

> Hi Otis,
>
> Forgive my vagueness, it's an NDA thing.
>
> Generally speaking you might want to do record matching based on a number
> of fields.  But since text fields are input by humans, they can be a bit
> inconsistent about how values are entered.
>
> One answer is to remove things like stop words, abbreviations,
> punctuation, etc. to normalize the fields a bit.  But then you might want
> to do some fuzzy matching with things like Levenstein or double metaphone,
> etc., and treat the entire field as one "unit".
>
> I realize you could still get much of this by then using traditional
> search, but in the app we're porting the business rules are quite specific
> and need to support legacy accounts.  And clearly the combination of
> normalization and fuzzy matching is potentially quite "lossy", but here
> again the business logic has other mitigators for that.
>
> Let me ask you a question back.  We really appreciate your ongoing series
> on Solr vs. ElasticSearch (I haven't dove into ES yet).  Looking your
> section on indexing (
> http://blog.sematext.com/2012/09/04/solr-vs-elasticsearch-part-2-data-handling/),
> can ES be as precise and flexible about creating highly customized tokens?
> When I initially heard about schema-less and read their use cases, I had
> the impression that ES was more for mainstream use-cases, but your review
> got me thinking maybe there's a lot more there?
>
> Mark
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Thu, Nov 1, 2012 at 12:01 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi Mark,
>>
>> Out of curiosity, what was your use case?
>>
>> Thanks,
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>> Performance Monitoring - http://sematext.com/spm/index.html
>>
>>
>>
>> On Wed, Oct 31, 2012 at 10:56 PM, Mark Bennett <mbennett@ideaeng.com>wrote:
>>
>>> This filter lets you "glue" tokens back together.  This has been
>>> discussed and posted on the list before, but this updated version uses all
>>> the preferred 4.x classes.
>>>
>>> Normally you wouldn't want to stick tokens back together, but if you've
>>> found this post, you probably have some atypical need for it (as I did)
>>> As an example you could:
>>> * Let tokenizer break up text on white spaces
>>> * Then lowercase
>>> * then remove stop words
>>> * ***then concatenate all the words back together into one string***
>>>
>>> You'll need:
>>> * ConcatFilter.java  (for lucene, below)
>>> * ConcatFilterFactory.java   (for solr, below)
>>> * entry in your schema
>>>
>>> schema.xml entry
>>> ----------
>>> ...
>>> <fieldType ...>
>>>     <analyzer>
>>>         ...
>>>         <filter class="solr.ConcatFilterFactory" />
>>>         ...
>>>     </analyzer>
>>> </fieldType>
>>> ...
>>>
>>> ConcatFilter.java
>>> -----------------
>>> package org.apache.lucene.analysis;
>>> import java.io.IOException;
>>> import org.apache.lucene.analysis.TokenFilter;
>>> import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>>> public class ConcatFilter extends TokenFilter {
>>>     protected CharTermAttribute charTermAttr;
>>>     public ConcatFilter(TokenStream input) {
>>>         super(input);
>>>         charTermAttr = addAttribute( CharTermAttribute.class );
>>>     }
>>>     @Override
>>>     public boolean incrementToken() throws IOException {
>>>         StringBuilder buffer = new StringBuilder();
>>>         while( input.incrementToken() ) {
>>>             buffer.append( charTermAttr );
>>>         }
>>>         // We need to clear it either way
>>>         charTermAttr.setEmpty();
>>>         if ( buffer.length() > 0 ) {
>>>             charTermAttr.append( buffer );
>>>             return true;
>>>         }
>>>         else {
>>>             return false;
>>>         }
>>>     }
>>> }
>>>
>>> ConcatFilterFactory.java
>>> ------------------------
>>> package org.apache.solr.analysis;
>>> import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.util.TokenFilterFactory;
>>> public class ConcatFilterFactory extends TokenFilterFactory {
>>>     @Override
>>>     public TokenStream create(TokenStream stream) {
>>>         return new ConcatFilter(stream);
>>>     }
>>> }
>>>
>>>
>>> --
>>> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>>>
>>
>>
>

Mime
View raw message