lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Rangarajan <arunrangara...@gmail.com>
Subject Re: Special character and wildcard matching
Date Tue, 24 Feb 2015 19:35:39 GMT
Exact query:
/select?q=raw_name:beyonce*&wt=json&fl=raw_name

Response:

{  "responseHeader": {    "status": 0,    "QTime": 0,    "params": {
   "fl": "raw_name",      "q": "raw_name:beyonce*",      "wt": "json"
  }  },  "response": {    "numFound": 2,    "start": 0,    "docs": [
   {        "raw_name": "beyoncé"      },      {        "raw_name":
"beyoncé"      }    ]  }}



On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> Please post the info I requested - the exact query, and the Solr response.
>
> -- Jack Krupansky
>
> On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan <
> arunrangarajan@gmail.com>
> wrote:
>
> > In our case, the lower-casing is happening in a custom Java indexer code,
> > via Java's String.toLowerCase() method.
> >
> > I used the analysis tool in Solr admin (with Jetty). I believe the raw
> > bytes explain this.
> >
> > Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
> > beyoncé in file beyonce_with_spl_chars.JPG.
> >
> > Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> > Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
> >
> > So when you look at the bytes, it seems to explain why beyonce* matches
> > beyoncé.
> >
> > I tried your approach with a KeywordTokenizer followed by a
> > LowerCaseFilter, but I see the same behavior.
> >
> >
> >
> > On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky <
> jack.krupansky@gmail.com>
> > wrote:
> >
> >> But how is that lowercasing occurring? I mean, solr.StrField doesn't do
> >> that.
> >>
> >> Some containers default to automatically mapping accented characters, so
> >> that the accented "e" would then get indexed as a normal "e", and then
> >> your
> >> wildcard would match it, and an accented "e" in a query would get mapped
> >> as
> >> well and then match the normal "e" in the index. What does your query
> >> response look like?
> >>
> >> This blog post explains that problem:
> >> http://bensch.be/tomcat-solr-and-special-characters
> >>
> >> Note that you could make your string field a text field with the keyword
> >> tokenizer and then filter it for lower case, such as when the user query
> >> might have a capital "B". String field is most appropriate when the
> field
> >> really is 100% raw.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan <
> >> arunrangarajan@gmail.com>
> >> wrote:
> >>
> >> > Yes, it is a string field and not a text field.
> >> >
> >> > <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> >> > omitNorms="true"/>
> >> > <field name="raw_name" type="string" indexed="true" stored="true" />
> >> >
> >> > Lower-casing done to do case-insensitive matching.
> >> >
> >> > On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky <
> >> jack.krupansky@gmail.com>
> >> > wrote:
> >> >
> >> > > Is it really a string field - as opposed to a text field? Show us
> the
> >> > field
> >> > > and field type.
> >> > >
> >> > > Besides, if it really were a "raw" name, wouldn't that be a capital
> >> "B"?
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan <
> >> > arunrangarajan@gmail.com
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > I have a string field raw_name like this in my document:
> >> > > >
> >> > > > {raw_name: beyoncé}
> >> > > >
> >> > > > (Notice that the last character is a special character.)
> >> > > >
> >> > > > When I issue this wildcard query:
> >> > > >
> >> > > > q=raw_name:beyonce*
> >> > > >
> >> > > > i.e. with the last character simply being the ASCII 'e', Solr
> >> returns
> >> > me
> >> > > > the above document.
> >> > > >
> >> > > > How do I prevent this?
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message