lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: KeywordTokenizerFactory - trouble with "exact" matches
Date Thu, 30 Jan 2014 14:20:36 GMT
Note, the comments about lowercasetokenizer were a red herring. You were
using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it would
just do what you expected, lowercase the entire input. You would have used
LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a Filter.

As for the rest, I expect Jack is right, it's the query parsing above
the field input.

Best
Erick

On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
<aleksander@gurusoft.no> wrote:
> Hi Srinivasa
>
> Yes I've come to understand that the analyzers will never "see" the
> whitespace, thus no need for patternreplacement, like Jack points out. So
> the solution would be to set wich parser to use for the query. Also Jack
> has pointed out that the "field" queryparser should work in this particular
> setting -> http://wiki.apache.org/solr/QueryParser
>
> My problem was though, that it was only for one of the fields in the schema
> that i needed this for, but for all the other fields, e.g. name,
> description etc., I would very much like to make use of the eDisMax
> functionality. And it seems that there can only be defined one query parser
> per query. in other words: for all fields. Jack, you may correct me if I'm
> wrong here :)
>
> This particular customer wanted a wildcard search at both ends of the
> phrase, and that sort of ambiguated the problem. And therefore I chose to
> replace all whitespace for this field in sql at index time, using the DIH.
> And then using EdgeNGramFilterFactory on both sides of the keyword like the
> config below, and that seemed to work pretty nicely.
>
> <!-- #### WildCard search number #### --> <fieldType name="keyword" class=
> "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
> tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
> "solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory"
> minGramSize="2" maxGramSize="25" side="front"/> <filter class=
> "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" side="back"/>
> </analyzer> <analyzer type="query"> <tokenizer class=
> "solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"
> /> </analyzer> </fieldType>
>
> I also added a bit of extra weighting for the "keyword" field so that exact
> matches recieved a higher score.
>
> What this solution doesn't do is to exclude values like "EE 009", when
> searching for "FE 009", but they return far down on the list, which for the
> customer is ok, because usually these results are somewhat related og
> within the same category.
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksander@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-30 Jack Krupansky <jack@basetechnology.com>
>
>> The standard, keyword-oriented query parsers will all treat unquoted,
>> unescaped white space as term delimiters and ignore the what space. There
>> is no way to bypass that behavior. So, your regex will never even see the
>> white space - unless you enclose the text and white space in quotes or use
>> a backslash to quote each white space character.
>>
>> You can use the "field" and "term" query parsers to pass a query string as
>> if it were fully enclosed in quotes, but that only handles a single term
>> and does not allow for multiple terms or any query operators. For example:
>>
>> {!field f=myfield}Foo Bar
>>
>> See:
>> http://wiki.apache.org/solr/QueryParser
>>
>> You can also pre-configure the field query parser with the defType=field
>> parameter.
>>
>> -- Jack Krupansky
>>
>>
>> -----Original Message----- From: Srinivasa7
>> Sent: Thursday, January 30, 2014 6:37 AM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>>
>> Hi,
>>
>> I  have similar kind of problem  where I want search for a words with
>> spaces
>> in that. And I wanted to search by stripping all the spaces .
>>
>> I have used following schema for that
>>
>> <fieldType name="nospaces" class="solr.TextField"
>> autoGeneratePhraseQueries="true"  >
>>            <analyzer type="index">
>>              <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[^\w]+"  replacement="" replace="all"/>
>>            </analyzer>
>>            <analyzer type="query">
>>
>>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[^\w]+"  replacement="" replace="all"/>
>>            </analyzer>
>>        </fieldType>
>>
>>
>> And
>>
>>
>> <field name="text_nospaces" type="nospaces"  indexed="true" stored="true"
>> omitNorms="true" />
>>        <copyField source="text" dest="text_nospaces" />
>>
>>
>>
>> But it is not searching the right terms . we are stripping the spaces and
>> indexing lowercase values when we do that.
>>
>>
>> Like : East Enders
>>
>> when I seach for   'east end ers'  text, its not returning any values
>> saying
>> no document found.
>>
>> I realised the solr uses QueryParser before passing query string to the
>> QueryAnalyzer in defined in schema.
>>
>> And The Query parser is tokenizing the query string providing in query . So
>> it is sending each token to the QueryAnalyser that is defined in schema.
>>
>>
>> SO is there anyway that I can by pass this query parser or use a correct
>> query processor which can consider the entire string as single pharse.
>>
>> At the moment I am using dismax query processor.
>>
>> Any suggestion would be much appreciated.
>>
>> Thanks
>> Srinivasa
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/
>> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Mime
View raw message