lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: KeywordTokenizerFactory splits the string for the exclamation mark
Date Wed, 14 May 2014 11:31:14 GMT
Exclamation point is the shortcut for the "NOT" operator. See the minus in
front of the second generated term?

You need to escape it, either with backslash or enclosing the full term in
quotes. Or use the term query parser.

Here's a list of the special characters for the query parser:
http://lucene.apache.org/core/4_8_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Escaping_Special_Characters

-- Jack Krupansky

-----Original Message----- 
From: Romani Rupasinghe
Sent: Tuesday, May 13, 2014 11:14 AM
To: solr-user@lucene.apache.org
Subject: KeywordTokenizerFactory splits the string for the exclamation mark

Hi All

I have a following field settings in solr schema

<field name="<b>Exact_Word" omitPositions="true" termVectors="false"
omitTermFreqAndPositions="true" compressed="true" type="string_ci"
multiValued="false" indexed="true" stored="true" required="false"
omitNorms="true"/>

<field name="Word" compressed="true" type="email_text_ptn"
multiValued="false" indexed="true" stored="true" required="false"
omitNorms="true"/>

<fieldtype name="string_ci" class="solr.TextField" sortMissingLast="true"
omitNorms="true"><analyzer><tokenizer
class="solr.KeywordTokenizerFactory"/><filter
class="solr.LowerCaseFilterFactory"/></analyzer></fieldtype>

<copyField source="Word" dest="Exact_Word"/>

As you can see Exact_Word has the KeywordTokenizerFactory and that should
treat the string as it is.

Following is my responseHeader. As you can see I am searching my string
only in the filed Exact_Word and expecting it to return the Word field and
the score

"responseHeader":{
    "status":0,
    "QTime":14,
    "params":{
      "explainOther":"",
      "fl":"Word,score",
      "debugQuery":"on",
      "indent":"on",
      "start":"0",
      "q":"d!sdasdsdwasd!asd@dsadsadas.edu",
      "qf":"Exact_Word",
      "wt":"json",
      "fq":"",
      "version":"2.2",
      "rows":"10"}},


But when I enter email with the following string "d!
sdasdsdwasdasd@dsadsadas.edu" it splits the string to two. I was under the
impression that KeywordTokenizerFactory will treat the string as it is.

Following is the query debug result. There you can see it has split the word
"parsedquery":"+((DisjunctionMaxQuery((Exact_Word:d))
-DisjunctionMaxQuery((Exact_Word:sdasdsdwasdasd@dsadsadas.edu)))~1)",

can someone please tell why it produce the query result as this

If I put a string without the "!" sign as below, the produced query will be
as below
"parsedquery":"+DisjunctionMaxQuery((
Exact_Word:d_sdasdsdwasd_asd@dsadsadas.edu))",. This is what I expected
solr to even with the "!" mark. with "_" mark it wont do a string split and
treats the string as it is

I thought if the KeywordTokenizerFactory is applied then it should return
the exact string as it is

Please help me to understand what is going wrong here 


Mime
View raw message