lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Umesh Prasad <umesh.i...@gmail.com>
Subject Re: Searching words with spaces for word without spaces in solr
Date Sun, 03 Aug 2014 01:44:24 GMT
 I would suggest  breaking the problem in smaller parts
1.  Identify variations(say compound words) offline (where you can combine
multiple sources to ensure much better quality).
2. Expand the user query during search time using your sources. So query
will become
    icecream OR  (ice cream)   (with q.op=AND)
   Parse the query using LuceneQuery parser. If you are using
dismax/edismax then I would suggest plugging a custom query parser which
combines queries from LuceneQueryParser and dismaxQuery. (dismax/edsimax
doesn't support full lucene query syntax)





On 31 July 2014 22:39, sunshine glass <sunshineglassof2day@gmail.com> wrote:

> *Point 1:*
> On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James <James.Dyer@ingramcontent.com
> >
>  wrote:
>
> > If a user is searching on "ice cream" but your index has "icecream", you
> > can treat this like a spelling error.  WordBreakSolrSpellChecker would
> > identify the fact that  while "ice cream" is not in your index,
> "icecream"
> > and then you can re-query for the corrected version without the space.
> >
>
> What if I have  1M records for "ice cream" & same number for "icecream".
> Then trick will not work here. What is desire in this case is that either I
> search for "ice cream" or "icecream", Solr should return 2M results.
>
> *Point 2:*
> On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James <James.Dyer@ingramcontent.com
> >
>  wrote:
> The problem with solving this with analyers, is that you can analyze
> "ice-cream" as either "ice cream" or "icecream" (split or catenate on
> hyphen).  You can even analyze "IceCream > Ice Cream" (catenate on case
> change).  But how is your analyzer going to know that "icecream" should
> index as two tokens: "ice" "cream" ?  You're asking analysis to do too much
> in this case. This is where spellcheck can bridge the gap.
>
> I don't want "icecream" to be indexed as "ice" or "cream". I agree that
> this is not feasible. What I am looking forward is to create shingles at
> query time as well. In more words, while querying "ice cream", Can't it
> search as "ice" or "cream" or "icecream".
> That is forming shingles at query time.
>
> There is a long list of such words in my inde. So, I does want to implement
> via synonym filter factory.
>
>
> On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James <James.Dyer@ingramcontent.com
> >
> wrote:
>
> > If a user is searching on "ice cream" but your index has "icecream", you
> > can treat this like a spelling error.  WordBreakSolrSpellChecker would
> > identify the fact that  while "ice cream" is not in your index,
> "icecream"
> > and then you can re-query for the corrected version without the space.
> >
> > The problem with solving this with analyers, is that you can analyze
> > "ice-cream" as either "ice cream" or "icecream" (split or catenate on
> > hyphen).  You can even analyze "IceCream > Ice Cream" (catenate on case
> > change).  But how is your analyzer going to know that "icecream" should
> > index as two tokens: "ice" "cream" ?  You're asking analysis to do too
> much
> > in this case.  This is where spellcheck can bridge the gap.
> >
> > Of course, if you have a discrete list of words you want split like this,
> > then you can do it with analysis using index-time synonyms.  In this
> case,
> > you need to provide it with the list.  See
> >
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> > for more information.
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: sunshine glass [mailto:sunshineglassof2day@gmail.com]
> > Sent: Thursday, July 31, 2014 10:32 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Searching words with spaces for word without spaces in solr
> >
> > I am not clear with this. This link is related to spell check. Can you
> > elaborate it more ?
> >
> >
> > On Wed, Jul 30, 2014 at 9:17 PM, Dyer, James <
> James.Dyer@ingramcontent.com
> > >
> > wrote:
> >
> > > In addition to the analyzer configuration you're using, you might want
> to
> > > also use WordBreakSolrSpellChecker to catch possible matches that can't
> > > easily be solved through analysis.  For more information, see the
> section
> > > for it at
> > https://cwiki.apache.org/confluence/display/solr/Spell+Checking
> > >
> > > James Dyer
> > > Ingram Content Group
> > > (615) 213-4311
> > >
> > > -----Original Message-----
> > > From: sunshine glass [mailto:sunshineglassof2day@gmail.com]
> > > Sent: Wednesday, July 30, 2014 9:38 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Searching words with spaces for word without spaces in
> solr
> > >
> > > This is the new configuration:
> > >
> > >     <fieldType name="text" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >       <analyzer type="index">
> > > >         <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > > > outputUnigrams="true" tokenSeparator=""/>
> > > >         <filter class="solr.WordDelimiterFilterFactory"
> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > >         <filter class="solr.LowerCaseFilterFactory"/>
> > > >         <filter class="solr.SnowballPorterFilterFactory"
> > > > language="English" protected="protwords.txt"/>
> > > >           <filter class="solr.SynonymFilterFactory"
> > > > synonyms="stemmed_synonyms_text_prime_index.txt" ignoreCase="true"
> > > > expand="true"/>
> > > >       </analyzer>
> > > >       <analyzer type="query">
> > > >         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >         <filter class="solr.LowerCaseFilterFactory"/>
> > > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > > words="stopwords_text_prime_search.txt"
> enablePositionIncrements="true"
> > > />
> > > >         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > > > outputUnigrams="true" tokenSeparator=""/>
> > > >         <filter class="solr.WordDelimiterFilterFactory"
> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > > catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
> > > >         <filter class="solr.SnowballPorterFilterFactory"
> > > > language="English" protected="protwords.txt"/>
> > > >       </fieldType>
> > > >
> > > >
> > > These are current docs in my index:
> > >
> > > <result name="response" numFound="3" start="0">
> > > <doc>
> > > <str name="id">2</str>
> > > <str name="title">Icecream</str>
> > > <long name="_version_">1475063961342705664</long>
> > > </doc>
> > > <doc>
> > > <str name="id">3</str>
> > > <str name="title">Ice-cream</str>
> > > <long name="_version_">1475063961344802816</long>
> > > </doc>
> > > <doc>
> > > <str name="id">1</str>
> > > <str name="title">Ice Cream</str>
> > > <long name="_version_">1475063961203245056</long>
> > > </doc>
> > > </result>
> > > </response>
> > >
> > > Query:
> > >
> >
> http://localhost:8983/solr/collection1/select?q=title:ice+cream&debug=true
> > >
> > > Response:
> > >
> > > <result name="response" numFound="2" start="0">
> > > <doc>
> > > <str name="id">1</str>
> > > <str name="title">Ice Cream</str>
> > > <long name="_version_">1475063961203245056</long>
> > > </doc>
> > > <doc>
> > > <str name="id">3</str>
> > > <str name="title">Ice-cream</str>
> > > <long name="_version_">1475063961344802816</long>
> > > </doc>
> > > </result>
> > > <lst name="debug">
> > > <str name="rawquerystring">title:ice cream</str>
> > > <str name="querystring">title:ice cream</str>
> > > <str name="parsedquery">
> > > (+(title:ice DisjunctionMaxQuery((title:cream))))/no_coord
> > > </str>
> > > <str name="parsedquery_toString">+(title:ice (title:cream))</str>
> > > <lst name="explain">
> > > <str name="1">
> > > 0.875 = (MATCH) sum of: 0.4375 = (MATCH) weight(title:ice in 0)
> > > [DefaultSimilarity], result of: 0.4375 = score(doc=0,freq=2.0 =
> > > termFreq=2.0 ), product of: 0.70710677 = queryWeight, product of: 1.0 =
> > > idf(docFreq=2, maxDocs=3) 0.70710677 = queryNorm 0.61871845 =
> fieldWeight
> > > in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 =
> > > termFreq=2.0 1.0 = idf(docFreq=2, maxDocs=3) 0.4375 = fieldNorm(doc=0)
> > > 0.4375 = (MATCH) weight(title:cream in 0) [DefaultSimilarity], result
> of:
> > > 0.4375 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.70710677
> =
> > > queryWeight, product of: 1.0 = idf(docFreq=2, maxDocs=3) 0.70710677 =
> > > queryNorm 0.61871845 = fieldWeight in 0, product of: 1.4142135 =
> > > tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 1.0 = idf(docFreq=2,
> > > maxDocs=3) 0.4375 = fieldNorm(doc=0)
> > > </str>
> > > <str name="3">
> > > 0.70710677 = (MATCH) sum of: 0.35355338 = (MATCH) weight(title:ice in
> 2)
> > > [DefaultSimilarity], result of: 0.35355338 = score(doc=2,freq=1.0 =
> > > termFreq=1.0 ), product of: 0.70710677 = queryWeight, product of: 1.0 =
> > > idf(docFreq=2, maxDocs=3) 0.70710677 = queryNorm 0.5 = fieldWeight in
> 2,
> > > product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 =
> > > idf(docFreq=2, maxDocs=3) 0.5 = fieldNorm(doc=2) 0.35355338 = (MATCH)
> > > weight(title:cream in 2) [DefaultSimilarity], result of: 0.35355338 =
> > > score(doc=2,freq=1.0 = termFreq=1.0 ), product of: 0.70710677 =
> > > queryWeight, product of: 1.0 = idf(docFreq=2, maxDocs=3) 0.70710677 =
> > > queryNorm 0.5 = fieldWeight in 2, product of: 1.0 = tf(freq=1.0), with
> > freq
> > > of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=2, maxDocs=3) 0.5 =
> > > fieldNorm(doc=2)
> > > </str>
> > > </lst>
> > >
> > > Still not working ????
> > >
> > >
> > > On Fri, May 30, 2014 at 9:21 PM, Erick Erickson <
> erickerickson@gmail.com
> > >
> > > wrote:
> > >
> > > > I'd spend some time with the admin/analysis page to understand the
> > exact
> > > > tokenization going on here. For instance, sequencing the
> > > > shinglefilterfactory before worddelimiterfilterfactory may produce
> > > > "interesting" resutls. And then throwing the Snowball factory at it
> and
> > > > putting synonyms in front.... I suspect you're not indexing or
> > searching
> > > > what you think you are.
> > > >
> > > > Second, what happens when you query with &debug=query? That'll show
> you
> > > > what the search string looks like.
> > > >
> > > > If that doesn't help, please post the results of looking at those
> > things
> > > > here, that'll provide some information for us to work with.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > >
> > > > On Fri, May 30, 2014 at 3:32 AM, sunshine glass <
> > > > sunshineglassof2day@gmail.com> wrote:
> > > >
> > > > > Hi Folks,
> > > > >
> > > > > Any updates ??
> > > > >
> > > > >
> > > > > On Wed, May 28, 2014 at 12:13 PM, sunshine glass <
> > > > > sunshineglassof2day@gmail.com> wrote:
> > > > >
> > > > > > Dear Team,
> > > > > >
> > > > > > How can I handle compound word searches in solr ?.
> > > > > > How can i search "hand bag" if I have "handbag" in my index.
> While
> > > > using
> > > > > > shingle in query analyzer, the query "ice cube" creates three
> > tokens
> > > as
> > > > > > "ice","cube", "icecube". Only ice and cubes are searched but
not
> > > > > > "icecubes".i.e not working for pair though I am using shingle
> > filter.
> > > > > >
> > > > > > Here's the schema config.
> > > > > >
> > > > > >
> > > > > >    1.  <fieldType name="text" class="solr.TextField"
> > > > > >    positionIncrementGap="100">
> > > > > >    2.       <analyzer type="index">
> > > > > >    3.         <filter class="solr.SynonymFilterFactory"
> > > > > >    synonyms="synonyms_text_prime_index.txt" ignoreCase="true"
> > > > > expand="true"/>
> > > > > >    4.         <charFilter
> class="solr.HTMLStripCharFilterFactory"/>
> > > > > >    5.         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > > > >    6.          <filter class="solr.ShingleFilterFactory"
> > > > > >    maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
> > > > > >    7.          <filter class="solr.WordDelimiterFilterFactory"
> > > > > >    catenateWords="1" catenateNumbers="1" catenateAll="1"
> > > > > preserveOriginal="1"
> > > > > >    generateWordParts="1" generateNumberParts="1"/>
> > > > > >    8.         <filter class="solr.LowerCaseFilterFactory"/>
> > > > > >    9.         <filter class="solr.SnowballPorterFilterFactory"
> > > > > >    language="English" protected="protwords.txt"/>
> > > > > >    10.       </analyzer>
> > > > > >    11.       <analyzer type="query">
> > > > > >    12.         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > > > >    13.         <filter class="solr.SynonymFilterFactory"
> > > > > >    synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> > > > > >    14.         <filter class="solr.ShingleFilterFactory"
> > > > > >    maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
> > > > > >    15.         <filter class="solr.WordDelimiterFilterFactory"
> > > > > >    preserveOriginal="1"/>
> > > > > >    16.         <filter class="solr.LowerCaseFilterFactory"/>
> > > > > >    17.         <filter class="solr.SnowballPorterFilterFactory"
> > > > > >    language="English" protected="protwords.txt"/>
> > > > > >    18.       </analyzer>
> > > > > >    19.     </fieldType>
> > > > > >
> > > > > >    Any help is appreciated.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
---
Thanks & Regards
Umesh Prasad

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message