lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Romani Rupasinghe <romrom...@gmail.com>
Subject KeywordTokenizerFactory splits the string for the exclamation mark
Date Tue, 13 May 2014 15:14:54 GMT
Hi All

I have a following field settings in solr schema

<field name="<b>Exact_Word" omitPositions="true" termVectors="false"
omitTermFreqAndPositions="true" compressed="true" type="string_ci"
multiValued="false" indexed="true" stored="true" required="false"
omitNorms="true"/>

<field name="Word" compressed="true" type="email_text_ptn"
multiValued="false" indexed="true" stored="true" required="false"
omitNorms="true"/>

<fieldtype name="string_ci" class="solr.TextField" sortMissingLast="true"
omitNorms="true"><analyzer><tokenizer
class="solr.KeywordTokenizerFactory"/><filter
class="solr.LowerCaseFilterFactory"/></analyzer></fieldtype>

<copyField source="Word" dest="Exact_Word"/>

As you can see Exact_Word has the KeywordTokenizerFactory and that should
treat the string as it is.

Following is my responseHeader. As you can see I am searching my string
only in the filed Exact_Word and expecting it to return the Word field and
the score

"responseHeader":{
    "status":0,
    "QTime":14,
    "params":{
      "explainOther":"",
      "fl":"Word,score",
      "debugQuery":"on",
      "indent":"on",
      "start":"0",
      "q":"d!sdasdsdwasd!asd@dsadsadas.edu",
      "qf":"Exact_Word",
      "wt":"json",
      "fq":"",
      "version":"2.2",
      "rows":"10"}},


But when I enter email with the following string "d!
sdasdsdwasdasd@dsadsadas.edu" it splits the string to two. I was under the
impression that KeywordTokenizerFactory will treat the string as it is.

Following is the query debug result. There you can see it has split the word
 "parsedquery":"+((DisjunctionMaxQuery((Exact_Word:d))
-DisjunctionMaxQuery((Exact_Word:sdasdsdwasdasd@dsadsadas.edu)))~1)",

can someone please tell why it produce the query result as this

If I put a string without the "!" sign as below, the produced query will be
as below
 "parsedquery":"+DisjunctionMaxQuery((
Exact_Word:d_sdasdsdwasd_asd@dsadsadas.edu))",. This is what I expected
solr to even with the "!" mark. with "_" mark it wont do a string split and
treats the string as it is

I thought if the KeywordTokenizerFactory is applied then it should return
the exact string as it is

Please help me to understand what is going wrong here

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message