lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Custom analyzer & frequency
Date Tue, 21 Nov 2017 17:00:21 GMT
One thing you might do is use the termfreq function to see that it
looks like in the index. Also the schema/analysis page will put terms
in "buckets" by power-of-2 so that might help too.

Best,
Erick

On Tue, Nov 21, 2017 at 7:55 AM, Barbet Alain <alian123soleil@gmail.com> wrote:
> You rock, thank you so much for this clear answer, I loose 2 days for
> nothing as I've already the term freq but now I've an answer :-)
> (And yes I check it's the doc freq, not the term freq).
>
> Thank you again !
>
> 2017-11-21 16:34 GMT+01:00 Emir Arnautović <emir.arnautovic@sematext.com>:
>> Hi Alain,
>> As explained in prev mail that is doc frequency and each doc is counted once. I am
not sure if Luke can provide you information about overall term frequency - sum of term frequency
of all docs.
>>
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 21 Nov 2017, at 16:30, Barbet Alain <alian123soleil@gmail.com> wrote:
>>>
>>> $ cat add_test.sh
>>> DATA='
>>> <add>
>>>  <doc>
>>>    <field name="docid">666</field>
>>>    <field name="titi_txt_fr">toto titi tata toto tutu titi</field>
>>>  </doc>
>>> </add>
>>> '
>>> $ sh add_test.sh
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <response>
>>> <lst name="responseHeader"><int name="status">0</int><int
>>> name="QTime">484</int></lst>
>>> </response>
>>>
>>>
>>> $ curl 'http://localhost:8983/solr/alian_test/terms?terms.fl=titi_txt_fr&terms.sort=index'
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <response>
>>> <lst name="responseHeader"><int name="status">0</int><int
>>> name="QTime">0</int></lst><lst name="terms"><lst
>>> name="titi_txt_fr"><int name="tata">1</int><int
>>> name="titi">1</int><int name="toto">1</int><int
>>> name="tutu">1</int></lst></lst>
>>> </response>
>>>
>>>
>>> So it's not only on Luke Side, it's come from Solr. Does it sound normal ?
>>>
>>> 2017-11-21 11:43 GMT+01:00 Barbet Alain <alian123soleil@gmail.com>:
>>>> Hi,
>>>>
>>>> I build a custom analyzer & setup it in solr, but doesn't work as I expect.
>>>> I always get 1 as frequency for each word even if it's present
>>>> multiple time in the text.
>>>>
>>>> So I try with default analyzer & find same behavior:
>>>> My schema
>>>>
>>>>  <fieldType name="text_ami" class="solr.TextField">
>>>>    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
>>>>  </fieldType>
>>>>  <field name="docid" type="string" indexed="true" required="true"
>>>> stored="true"/>
>>>>  <field name="test_text" type="nametext"/>
>>>>
>>>> alian@yoda:~/solr> cat add_test.sh
>>>> DATA='
>>>> <add>
>>>>  <doc>
>>>>    <field name="docid">666</field>
>>>>    <field name="test_text">toto titi tata toto tutu titi</field>
>>>>  </doc>
>>>> </add>
>>>> '
>>>> curl -X POST -H 'Content-Type: text/xml'
>>>> 'http://localhost:8983/solr/alian_test/update?commit=true'
>>>> --data-binary "$DATA"
>>>>
>>>> When I test in solr interface / analyze, I find the right behavior
>>>> (find titi & toto 2 times).
>>>> But when I look in solr index with Luke or solr interface / schema,
>>>> the top term always get 1 as frequency. Can someone give me the thing
>>>> I forget ?
>>>>
>>>> (solr 6.5)
>>>>
>>>> Thank you !
>>

Mime
View raw message