lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Proper analyzer / tokenizer for syslog data?
Date Fri, 04 Nov 2011 08:20:20 GMT
> Example data:
> 01/23/2011 05:12:34 [Test] a=1; hello_there=50;
> data=[1,5,30%];
> 
> I would love to be able to just "grep" the data - ie. if I
> search for "ello", it finds and returns "ello", and if I
> search for "hello_there=5", it would match too.
> 
> Here's what I'm using now:
> 
>    <fieldType name="text_sy"
> class="solr.TextField">
>      <analyzer>
>        <tokenizer
> class="solr.StandardTokenizerFactory"/>
>        <filter
> class="solr.LowerCaseFilterFactory"/>
>        <filter
> class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
>      </analyzer>
>    </fieldType>
> 
> The problem with this is that if I search for a substring,
> I don't get anything back.  For example, searching for
> "ello" or "*ello*" doesn't return.  Any ideas?
> 
> http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400

For sub-string match NGramFilterFactory is required at index time.

<filter class="solr.NGramFilterFactory" minGramSize="1"
maxGramSize="15"/> 

Plus you may want to use WhiteSpaceTokenizer instead of StandardTokenizerFactory. Analysis
admin page displays behavior of each tokenizer.

Mime
View raw message