lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven White <swhite4...@gmail.com>
Subject Re: Getting a hit on "the}" but not on "the" or "}"
Date Tue, 05 Jul 2016 23:07:48 GMT
Thanks for the quick reply Erick.

Here is the analyzer I'm using:

  <fieldType name="all_raw_text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
    <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
catenateAll="1" catenateNumbers="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>

If in fact it is my analyzer, what part of it is causing this?  If not, I'm
not clear about the "TermsComponent" that you suggested having me look
into.  How do I "point" it at my field?  I have zero knowledge about this.
Is this something I do from Solr's Admin Console via Schema Browser link?

Steve


On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> My guess is that your field analysis isn't stripping the various non
> alpha-num
> characters, thus "the]" is actually a token in your index, square bracket
> and
> all. If that's true, it certainly doesn't match the stopword "the".
>
> You can check by using the TermsComponent, pointing it at your field
> and setting terms.prefix=the
>
> See:
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>
> Best,
> Erick
>
> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <swhite4141@gmail.com> wrote:
> > HI Everyone,
> >
> > I'm trying to understand why I get a hit when I search for "the}" but not
> > when I search for "the" (searches are done without the quotes and "the"
> is
> > a stopword in my case).
> >
> > Here is the debugQuery output using "the}":
> >   "debug": {
> >     "rawquerystring": "the}",
> >     "querystring": "the}",
> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
> > ALL_FIELDS:the))~1.0))/no_coord",
> >     "parsedquery_toString": "+((ALL_FIELDS:the} ALL_FIELDS:the))~1.0",
> >     "explain": {
> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> 0.14220011
> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.078125 = fieldNorm(doc=0)\n",
> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> 0.14220011
> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.078125 = fieldNorm(doc=0)\n",
> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
> 0.14220011
> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.078125 = fieldNorm(doc=1)\n",
> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> 0.1137601
> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.0625 = fieldNorm(doc=0)\n"
> >     },
> >     "QParser": "ExtendedDismaxQParser",
> >     "altquerystring": null,
> >     "boost_queries": null,
> >     "parsed_boost_queries": [],
> >     "boostfuncs": null,
> >     "filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >     "parsed_filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >
> > Here is the debugQuery output using "the"
> >   "debug": {
> >     "rawquerystring": "the",
> >     "querystring": "the",
> >     "parsedquery": "(+())/no_coord",
> >     "parsedquery_toString": "+()",
> >     "explain": {},
> >     "QParser": "ExtendedDismaxQParser",
> >     "altquerystring": null,
> >     "boost_queries": null,
> >     "parsed_boost_queries": [],
> >     "boostfuncs": null,
> >     "filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >     "parsed_filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >
> > As expected, I get no hits when I search for just "}":
> >   "debug": {
> >     "rawquerystring": "}",
> >     "querystring": "}",
> >     "parsedquery": "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
> >     "explain": {},
> >     "QParser": "ExtendedDismaxQParser",
> >     "altquerystring": null,
> >     "boost_queries": null,
> >     "parsed_boost_queries": [],
> >     "boostfuncs": null,
> >     "filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >     "parsed_filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >
> > In case it matters, I'm also getting a hit when I search for "the." or
> > "the]" or "the/" or "the," or "the=" etc.
> >
> > Thanks in advanced.
> >
> > Steve
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message