lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <ravi.bhas...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Tue, 06 Oct 2009 21:58:29 GMT
You dont see any facet fields in my query because I have configured them in
the solrconfig.xml to give specific fields as facets by default in the
dismax and standard handlers so that I dont have to specify all those fields
individually everytime I query, all I need to do is just set facet=true
thats all

  <requestHandler name="dismax" class="solr.SearchHandler" default="true">
    <lst name="defaults">
     <str name="defType">dismax</str>
     <str name="echoParams">explicit</str>
     <float name="tie">0.01</float>
     <str name="qf">
        systemid^20.0 headline^20.0 keyword^18.0 person^18.0
organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0
blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5
multimediablurb^1.5
     </str>
     <str name="pf">
        headline^20.5 keyword^18.5 person^18.5 organization^18.5
usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5
articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0
     </str>
     <str name="bf">
        recip(rord(pubdatetime),1,1000,1000)^1.0
     </str>
     <str name="fl">
        *
     </str>
     <str name="mm">
        2&lt;-1 5&lt;-3 6&lt;90%
     </str>
     <int name="ps">100</int>
     <str name="q.alt">*:*</str>
     <!-- example highlighter config, enable per-query with hl=true -->
     <str name="hl.fl">keyword</str>
     <!-- for this field, we want no fragmenting, just highlighting -->
     <str name="f.body.hl.fragsize">0</str>
     <!-- instructs Solr to return the field itself if no query terms are
found -->
     <str name="f.name.hl.alternateField">keyword</str>
     <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
     <str name="facet">false</str>
     <int name="facet.mincount">1</int>
     <int name="f.keyword.facet.mincount">5</int>
     <int name="f.keywordlower.facet.mincount">5</int>
     <int name="f.keywordformatted.facet.mincount">5</int>
     <int name="f.person.facet.mincount">5</int>
     <int name="f.personformatted.facet.mincount">5</int>
     <int name="f.organization.facet.mincount">5</int>
     <str name="facet.field">contenttype</str>
     <str name="facet.field">keyword</str>
     <str name="facet.field">keywordlower</str>
     <str name="facet.field">keywordformatted</str>
     <str name="facet.field">person</str>
     <str name="facet.field">personformatted</str>
     <str name="facet.field">organization</str>
     <str name="facet.field">usstate</str>
     <str name="facet.field">country</str>
     <str name="facet.field">subject</str>
    </lst>
  </requestHandler>


On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambrano <czambran@gmail.com>wrote:

> I am stumped then. I had a similar issue when I was using a field that was
> being heavily tokenized, but I corrected the issue by using a
> field(generated using copyField) that doesn't get analyzed at all.
>
> On the query you provided before I didn't see the parameters to tell solr
> for which field it should produce facets.
>
> Something like:
>
>
> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*
>
>
>
>
> On 10/06/2009 04:09 PM, Ravi Kiran wrote:
>
>> Yes Exactly the same
>>
>> On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano<czambran@gmail.com
>> >wrote:
>>
>>
>>
>>> And you had the analyzer for that field set-up the same way as shown on
>>> your previous e-mail when you indexed the data?
>>>
>>>
>>>
>>>
>>> On 10/06/2009 03:46 PM, Ravi Kiran wrote:
>>>
>>>
>>>
>>>> I did infact check it out any there is no weirdness in analysis
>>>> page...see
>>>> result below
>>>>
>>>> Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>>>> position 1 term text New York term type word source start,end 0,8
>>>> payload
>>>>  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
>>>> text
>>>> New
>>>> York term type word source start,end 0,8 payload
>>>>  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>>>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term
>>>> text
>>>> New
>>>> York term type word source start,end 0,8 payload
>>>>  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>>>> expand=false, ignoreCase=true}  term position 1 term text New York term
>>>> type
>>>> word source start,end 0,8 payload
>>>>  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
>>>> position 1 term text New York term type word source start,end 0,8
>>>> payload
>>>>  Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}
>>>>  term
>>>> position 1 term text New York term type word source start,end 0,8
>>>> payload
>>>>  org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
>>>> text
>>>> New
>>>> York term type word source start,end 0,8 payload
>>>>  org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>>>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term
>>>> text
>>>> New
>>>> York term type word source start,end 0,8 payload
>>>>  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>>>> expand=false, ignoreCase=true}  term position 1 term text New York term
>>>> type
>>>> word source start,end 0,8 payload
>>>>  org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
>>>> position 1 term text New York term type word source start,end 0,8
>>>> payload
>>>>
>>>>
>>>> On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano<czambran@gmail.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Have you tried using the Analysis page to see what tokens are generated
>>>>> for
>>>>> the string "New York"? It could be one of the token filter is adding
>>>>> the
>>>>> token 'new' for all strings that start with 'new'
>>>>>
>>>>>
>>>>> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hello All,
>>>>>>               Iam getting some ghost facets in solr 1.4. Can anybody
>>>>>> kindly
>>>>>> help me understand why I get them and how to eliminate them. My
>>>>>> schema.xml
>>>>>> snippet is given at the end. Iam indexing Named Entities extracted
via
>>>>>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
>>>>>> is
>>>>>> that it will use all words as a single token, am I right ? for
>>>>>> example:
>>>>>> "New
>>>>>> York" will be indexed as 'New York' and will not be split right???
>>>>>> However
>>>>>> I
>>>>>> see then splitup in facets as follows when running the query "
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
>>>>>> "...but
>>>>>> when I search with standard handler qt=standard&q=keyword:"New"
I dont
>>>>>> find
>>>>>> any doc which has just "New". After digging in a bit I found that
if
>>>>>> several
>>>>>> keywords have a common starting word it is being pulled out as another
>>>>>> facet
>>>>>> like the following. Any help is greatly appreciated
>>>>>>
>>>>>> Result
>>>>>> ------------
>>>>>> <int name="New">47</int>       -------->    Ghost
>>>>>> <int name="New Hampshire">7</int>
>>>>>> <int name="New Jersey">16</int>
>>>>>> <int name="New Orleans">10</int>
>>>>>> <int name="New York">147</int>
>>>>>> <int name="New York City">23</int>
>>>>>> <int name="New York Giants">8</int>
>>>>>> <int name="New York Islanders">5</int>
>>>>>> <int name="New York Mercantile Exchange">6</int>
>>>>>> <int name="New York Mets">8</int>
>>>>>> <int name="New York Stock Exchange">10</int>
>>>>>> <int name="New York Times">8</int>
>>>>>> <int name="New York University">5</int>
>>>>>> <int name="New Zealand">7</int>
>>>>>>
>>>>>> <int name="Energy">7</int>       --------------> 
  Ghost
>>>>>> <int name="Energy Department">5</int>
>>>>>> <int name="Energy Information Administration">5</int>
>>>>>>
>>>>>>
>>>>>> <int name="Federal">7</int>     -------------->  
 Ghost
>>>>>> <int name="Federal Deposit Insurance Corp.">6</int>
>>>>>> <int name="Federal Reserve">26</int>
>>>>>> <int name="Federal Reserve Chairman">6</int>
>>>>>>
>>>>>> <int name="North">27</int>
>>>>>> <int name="North Carolina">8</int>
>>>>>> <int name="North Dakota">7</int>
>>>>>> <int name="North Korea">12</int>
>>>>>>
>>>>>> Schema.xml
>>>>>> -----------------
>>>>>>
>>>>>>     <fieldType name="keywordText" class="solr.TextField"
>>>>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>>>>>       <analyzer type="index">
>>>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>         <filter class="solr.TrimFilterFactory" />
>>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>> words="stopwords.txt,entity-stopwords.txt"
>>>>>> enablePositionIncrements="true"/>
>>>>>>
>>>>>>         <filter class="solr.SynonymFilterFactory"
>>>>>> synonyms="synonyms.txt"
>>>>>> ignoreCase="true" expand="false" />
>>>>>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>>       </analyzer>
>>>>>>       <analyzer type="query">
>>>>>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>         <filter class="solr.TrimFilterFactory" />
>>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>> words="stopwords.txt,entity-stopwords.txt"
>>>>>> enablePositionIncrements="true"
>>>>>> />
>>>>>>         <filter class="solr.SynonymFilterFactory"
>>>>>> synonyms="synonyms.txt"
>>>>>> ignoreCase="true" expand="false" />
>>>>>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>>       </analyzer>
>>>>>>     </fieldType>
>>>>>>
>>>>>>     <field name="person" type="keywordText" indexed="true"
>>>>>> stored="true"
>>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>>> termOffsets="false"/>
>>>>>>     <field name="organization" type="keywordText" indexed="true"
>>>>>> stored="true" multiValued="true" termVectors="false"
>>>>>> termPositions="false"
>>>>>> termOffsets="false"/>
>>>>>>     <field name="location" type="keywordText" indexed="true"
>>>>>> stored="true"
>>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>>> termOffsets="false"/>
>>>>>>     <field name="keyword" type="keywordText" indexed="true"
>>>>>> stored="true"
>>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>>> termOffsets="false"/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message