lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guilherme Viteri <gvit...@ebi.ac.uk>
Subject Re: When search term has two stopwords ('and' and 'a') together, it doesn't work
Date Thu, 07 Nov 2019 12:56:21 GMT
Hi Paras, everyone

Thank you again for your inputs and suggestions. I sorry to hear you had trouble with the
attachments I will host it somewhere and share the links. 
I don't tweak my index, I get the data from the graph database, create a document as they
are and save to solr.

So, I am sending the new analysis screen querying the way you suggested. Also the results
with params and solr query url.

During the process of querying what you asked I found something really weird (at least for
me). By accident, I ended up querying the using the default handler (/select) and it worked.
Then If I use the one I must use, then sadly doesn't work. I am posting both results and I
will also post the handlers as well.

Here is the link with all the files mentioned before
https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
If the link doesn't work www dot dropbox dot com slash sh slash fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a
? dl equals 0

Thanks

> On 7 Nov 2019, at 05:23, Paras Lehana <paras.lehana@indiamart.com> wrote:
> 
> Hi Guilherme.
> 
> I am sending they analysis result and the json result as requested.
> 
> 
> Thanks for the effort. Luckily, I can see your attachments (low quality
> though).
> 
> From the analysis screen, the analysis is working as expected. One of the
> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> document containing "Lymphoid and a non-Lymphoid cell" I can initially
> think of is: the stopword "a" is probably present in post-analysis either
> of query or index. Did you tweak your index time analysis after indexing?
> 
> Do two things:
> 
>   1. Post the analysis screen for and index=*"Immunoregulatory
>   interactions between a Lymphoid and a non-Lymphoid cell"* and
> "query=*"lymphoid
>   and a non-lymphoid cell"*. Try hosting the image and providing the link
>   here.
>   2. Give the same JSON output as you have sent but this time with
>   *"echoParams=all"*. Also, post the exact Solr query url.
> 
> 
> 
> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerickson@gmail.com> wrote:
> 
>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>> Apache server is fairly aggressive about stripping attachments though, so
>> it’s also possible they didn’t make it through.
>> 
>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gviteri@ebi.ac.uk> wrote:
>>> 
>>> Thanks Erick.
>>> 
>>>> First, your index and analysis chains are considerably different, this
>> can easily be a source of problems. In particular, using two different
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>> you’re totally sure you understand the consequences. Additionally, your use
>> of the length filter is suspicious, especially since your problem statement
>> is about the addition of a single letter term and the min length allowed on
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>> filtered out in both cases, but maybe you’ve found something odd about the
>> interactions.
>>> I will investigate the min length and post the results later.
>>> 
>>>> Second, I have no idea what this will do. Are the equal signs typos?
>> Used by custom code?
>>> This the url in my application, not solr params. That's the query string.
>>> 
>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>> all the params with an equal-sign are totally ignored unless it’s just a
>> typo.
>>> This is part of the application. Species will be used later on in solr
>> to filter out the result. That's not solr. That my app params.
>>> 
>>>> Third, the easiest way to see what’s happening under the covers is to
>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>> relevance calculations for the nonce, or specify “&debug=query” to skip
>> that part.
>>> The two json files i've sent, they are debugQuery=on and the explain tag
>> is present.
>>> I will try the searching the way you mentioned.
>>> 
>>> Thank for your inputs
>>> 
>>> Guilherme
>>> 
>>>> On 6 Nov 2019, at 14:14, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>>>> 
>>>> Fwd to another server
>>>> 
>>>> First, your index and analysis chains are considerably different, this
>> can easily be a source of problems. In particular, using two different
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>> you’re totally sure you understand the consequences. Additionally, your use
>> of the length filter is suspicious, especially since your problem statement
>> is about the addition of a single letter term and the min length allowed on
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>> filtered out in both cases, but maybe you’ve found something odd about the
>> interactions.
>>>> 
>>>> Second, I have no idea what this will do. Are the equal signs typos?
>> Used by custom code?
>>>> 
>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>> all the params with an equal-sign are totally ignored unless it’s just a
>> typo.
>>>> 
>>>> Third, the easiest way to see what’s happening under the covers is to
>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>> relevance calculations for the nonce, or specify “&debug=query” to skip
>> that part.
>>>> 
>>>> 90% + of the time, the question “why didn’t this query do what I
>> expect” is answered by looking at the “&debug=query” output and the
>> analysis page in the admin UI. NOTE: for the analysis page be sure to look
>> at _both_ the query and index output. Also, and very important about the
>> analysis page (and this is confusing) is that this _assumes_ that what you
>> put in the text boxes have made it through the query parser intact and is
>> analyzed by the field selected. Consider the search "q=field:word1 word2".
>> Now you type “word1 word2” into the analysis text box and it looks like
>> what you expect. That’s misleading because the query is _parsed_ as
>> "field:word1 default_search_field:word2”. This is where “&debug=query”
>> helps.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.lehana@indiamart.com>
>> wrote:
>>>>> 
>>>>> Hi Walter,
>>>>> 
>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>> will
>>>>>> not be in the index, so they can never match a query.
>>>>> 
>>>>> 
>>>>> I think the OP's concern is different results when adding a stopword.
I
>>>>> think he's using the filter factory correctly - the query chain
>> includes
>>>>> the filter as well so it should remove "a" while querying.
>>>>> 
>>>>> *@Guilherme*, please post results for both the query, the document in
>>>>> result you are concerned about and post full result of analysis screen
>> (for
>>>>> both query and index).
>>>>> 
>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wunder@wunderwood.org>
>> wrote:
>>>>> 
>>>>>> No.
>>>>>> 
>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those
words
>>>>>> will not be in the index, so they can never match a query.
>>>>>> 
>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain
in
>>>>>> schema.xml.
>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
>> config.
>>>>>> 3. Reindex all of the documents.
>>>>>> 
>>>>>> When indexed with the new analysis chain, the stopwords will not
be
>>>>>> removed and they will be searchable.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gviteri@ebi.ac.uk>
>> wrote:
>>>>>>> 
>>>>>>> Ok. I am kind a lost now.
>>>>>>> If I open up the console > analysis and perform it, that's
the final
>>>>>> result.
>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>> 
>>>>>>> Your suggestion is: get rid of the <filter stopword.txt>
in the
>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt","
")
>> then
>>>>>> add to solr. Is that correct ?
>>>>>>> 
>>>>>>> Thanks David
>>>>>>> 
>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>> hastings.recursive@gmail.com
>>>>>> <mailto:hastings.recursive@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Fwd to another server
>>>>>>>> 
>>>>>>>> no,
>>>>>>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>> words="stopwords.txt"/>
>>>>>>>> 
>>>>>>>> is still using stopwords and should be removed, in my opinion
of
>> course,
>>>>>>>> based on your use case may be different, but i generally
axe any
>>>>>> reference
>>>>>>>> to them at all
>>>>>>>> 
>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gviteri@ebi.ac.uk
>>>>>> <mailto:gviteri@ebi.ac.uk>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> Haven't I done this here ?
>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>       <analyzer type="index">
>>>>>>>>>           <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>           <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>           <filter class="solr.LengthFilterFactory"
min="2"
>>>>>> max="20"/>
>>>>>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>       </analyzer>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>> hastings.recursive@gmail.com
>>>>>> <mailto:hastings.recursive@gmail.com>>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Fwd to another server
>>>>>>>>>> 
>>>>>>>>>> The first thing you should do is remove any reference
to stop
>> words
>>>>>> and
>>>>>>>>>> never use them, then re-index your data and try it
again.
>>>>>>>>>> 
>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>> gviteri@ebi.ac.uk
>>>>>> <mailto:gviteri@ebi.ac.uk>>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I am performing a search to match a name (text_field),
however
>> this
>>>>>> term
>>>>>>>>>>> contains 'and' and 'a' and it doesn't return
any records. If i
>> remove
>>>>>>>>> 'a'
>>>>>>>>>>> then it works.
>>>>>>>>>>> e.g
>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>> doesn't work:
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> <
>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>> 
>>>>>>>>>>> <
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>> works:
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>> <
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>> 
>>>>>>>>>>> interested in the first result
>>>>>>>>>>> 
>>>>>>>>>>> schema.xml
>>>>>>>>>>> <field name="name"                       
  type="text_field"
>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>> required="true"
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>> 
>>>>>>>>>>>       <analyzer type="query">
>>>>>>>>>>>           <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>           <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>           <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>           <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>           <filter class="solr.LengthFilterFactory"
min="2"
>>>>>>>>> max="20"/>
>>>>>>>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>           <filter class="solr.StopFilterFactory"
>>>>>> ignoreCase="true"
>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>       </analyzer>
>>>>>>>>>>> 
>>>>>>>>>>>   <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>> positionIncrementGap="100" omitNorms="false"
>
>>>>>>>>>>>       <analyzer type="index">
>>>>>>>>>>>           <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>           <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>           <filter class="solr.LengthFilterFactory"
min="2"
>>>>>>>>> max="20"/>
>>>>>>>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>           <filter class="solr.StopFilterFactory"
>>>>>> ignoreCase="true"
>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>       </analyzer>
>>>>>>>>>>>       <analyzer type="query">
>>>>>>>>>>>           <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>           <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>           <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>           <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>           <filter class="solr.LengthFilterFactory"
min="2"
>>>>>>>>> max="20"/>
>>>>>>>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>           <filter class="solr.StopFilterFactory"
>>>>>> ignoreCase="true"
>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>       </analyzer>
>>>>>>>>>>>   </fieldType>
>>>>>>>>>>> 
>>>>>>>>>>> stopwords.txt
>>>>>>>>>>> #Standard english stop words taken from Lucene's
StopAnalyzer
>>>>>>>>>>> a
>>>>>>>>>>> b
>>>>>>>>>>> c
>>>>>>>>>>> ....
>>>>>>>>>>> an
>>>>>>>>>>> and
>>>>>>>>>>> are
>>>>>>>>>>> 
>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>> 
>>>>>>>>>>> Is there anything I could do to prevent this
?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Guilherme
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> --
>>>>> Regards,
>>>>> 
>>>>> *Paras Lehana* [65871]
>>>>> Development Engineer, Auto-Suggest,
>>>>> IndiaMART Intermesh Ltd.
>>>>> 
>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>> Noida, UP, IN - 201303
>>>>> 
>>>>> Mob.: +91-9560911996
>>>>> Work: 01203916600 | Extn:  *8173*
>>>>> 
>>>>> --
>>>>> IMPORTANT:
>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>> 
>>> 
>> 
>> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message