lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konrad Lötzsch <konrad.loetz...@antibodies-online.com>
Subject Re: Any filter to map mutiple tokens into one ?
Date Fri, 12 Oct 2012 07:04:10 GMT
You can build shingles and then use the synonym filter. in this case you 
will have to think about all these token that you don't need after the 
shingle filter.


Am 12.10.2012 01:35, schrieb T. Kuro Kurosaka:
> I am looking for a way to fold a particular sequence of tokens into 
> one token.
> Concretely, I'd like to detect a three-token sequence of "*", ":" and 
> "*", and replace it with a token of the text "*:*".
> I tried SynonymFIlter but it seems it can only deal with a single 
> input token. "* : * => *:*" seems to be interpreted
> as one input token of 5 characters "*", space, ":", space and "*".
>
> I'm using Solr 3.5.
>
> Background:
> My tokenizer separate the three character sequence "*:*" into 3 tokens 
> of one character each.
> The edismax parser, when given the query "*:*", i.e. find every doc, 
> seems to pass the entire string "*:*" to the query analyzer (I suspect 
> a bug.),
> and feed the tokenized result to DisjunctionMaxQuery object,
> according to this debug output:
>
> <lst name="debug">
> <str name="rawquerystring">*:*</str>
> <str name="querystring">*:*</str>
> <str name="parsedquery">+MatchAllDocsQuery(*:*) 
> DisjunctionMaxQuery((body:"* : *"~100^0.5 | title:"* : 
> *"~100^1.2)~0.01)</str>
> <str name="parsedquery_toString">+*:* (body:"* : *"~100^0.5 | title:"* 
> : *"~100^1.2)~0.01</str>
>
> Notice that there is a space between * and : in 
> DisjunctionMaxQuery((body:"* : *" ....)
>
> Probably because of this, the hit score is as low as 0.109, while it 
> is 1.000 if an analyzer that doesn't break "*:*" is used.
> So I'd like to stitch together "*", ":", "*" into "*:*" again to make 
> DisjunctionMaxQuery happy.
>
>
> Thanks.
>
>
> T. "Kuro" Kurosaka
>
>


Mime
View raw message