lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kaushik <kaushika...@gmail.com>
Subject Re: Mutli term synonyms
Date Wed, 29 Apr 2015 13:17:17 GMT
Hi Roman,

Following is my use case:

*Schema.xml*...

   <field name="name" type="text_autophrase" indexed="true" stored="true"/>

<fieldType name="text_autophrase" class="solr.TextField"
           positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter
class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory"
                phrases="autophrases.txt" includeTokens="false"
                replaceWhitespaceWith="X" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
                ignoreCase="true" expand="true" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
                words="stopwords.txt" enablePositionIncrements="true" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
                ignoreCase="true" expand="true" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
                words="stopwords.txt" enablePositionIncrements="true" />
      </analyzer>
    </fieldType>

*SolrConfig.xml...*

name="/autophrase" class="solr.SearchHandler">
   <lst name="defaults">
     <str name="echoParams">explicit</str>
     <int name="rows">10</int>
     <str name="df">name</str>
     <str name="defType">autophrasingParser</str>
   </lst>
  </requestHandler>

  <queryParser name="autophrasingParser"
               class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
    <str name="phrases">autophrases.txt</str>
    <str name="replaceWhitespaceWith">X</str>
  </queryParser>


*Synonyms.txt....*
PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
[II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
[VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20
[WHO-DD],POLYSORBATE 20 [VANDF]

*Autophrase.txt...*

Has all the above phrases in one column

*Indexed document....*

<doc>
  <field name="id">31</field>
  <field name="name">Polysorbate 20</field>
  </doc>

So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to
see the record containig Polysorbate 20. i.e.
http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true
should have retrieved it; but it doesnt.

What could I be doing wrong?

On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla <roman.chyla@gmail.com> wrote:

> I'm not sure I understand - the autophrasing filter will allow the
> parser to see all the tokens, so that they can be parsed (and
> multi-token synonyms) identified. So if you are using the same
> analyzer at query and index time, they should be able to see the same
> stuff.
>
> are you using multi-token synonyms, or just entries that look like
> multi synonym? (in the first case, the tokens are separated by null
> byte) - in the second case, they are just strings even with
> whitespaces, your synonym file must contain exactly the same entries
> as your analyzer sees them (and in the same order; or you have to use
> the same analyzer to load the synonym files)
>
> can you post the relevant part of your schema.xml?
>
>
> note: I can confirm that multi-token synonym expansion can be made to
> work, even in complex cases - we do it - but likely, if you need
> multi-token synonyms, you will also need a smarter query parser.
> sometimes your users will use query strings that contain overlapping
> synonym entries, to handle that, you will have to know how to generate
> all possible 'reads', example
>
> synonym:
>
> foo bar, foobar
> hey foo, heyfoo
>
> user input:
>
> hey foo bar
>
> possible readings:
>
> ((hey foo) +bar) OR (hey +(foo bar))
>
> i'm simplifying it here, the fun starts when you are seeing a phrase query
> :)
>
> On Tue, Apr 28, 2015 at 10:31 AM, Kaushik <kaushikadya@gmail.com> wrote:
> > Hi there,
> >
> > I tried the solution provided in
> >
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> > .The mentioned solution works when the indexed data does not have alpha
> > numerics or special characters. But in  my case the synonyms are
> something
> > like the below.
> >
> >
> >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
> > SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
> > 300  POLYSORBATE
> > 20 [FHFI]  FEMA NO. 2915
> >
> > They have alpha numerics, special characters, spaces, etc. Is there a way
> > to implment synonyms even in such case?
> >
> > Thanks,
> > Kaushik
> >
> > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> > daniel.davis@nih.gov> wrote:
> >
> >> Handling MESH descriptor preferred terms and such is similar.   I
> >> encountered this during evaluation of Solr for a project here at NLM.
>  We
> >> decided to use Solr for different projects instead.     I considered the
> >> following approaches:
> >>  - use a custom tokenizer at index time that indexed all of the multiple
> >> term alternatives.
> >>  - index the data, and then have an enrichment process that queries on
> >> each source synonym, and generates an update to add the target synonyms.
> >>    Follow this with an optimize.
> >>  - During the indexing process, but before sending the data to Solr,
> >> process the data to tokenize and add synonyms to another field.
> >>
> >> Both the custom tokenizer and enrichment process share the feature that
> >> they use Solr's own tokenizer rather than duplicate it.   The enrichment
> >> process seems to me only workable in environments where you can re-index
> >> all data periodically, so no continuous stream of data to index that
> needs
> >> to be handled relatively quickly once it is generated.    The last
> method
> >> of pre-processing the data seems the least desirable to me from a
> blue-sky
> >> perspective, but is probably the easiest to implement and the most
> >> independent of Solr.
> >>
> >> Hope this helps,
> >>
> >> Dan Davis, Systems/Applications Architect (Contractor),
> >> Office of Computer and Communications Systems,
> >> National Library of Medicine, NIH
> >>
> >> -----Original Message-----
> >> From: Kaushik [mailto:kaushikadya@gmail.com]
> >> Sent: Monday, April 20, 2015 10:47 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Mutli term synonyms
> >>
> >> Hello,
> >>
> >> Reading up on synonyms it looks like there is no real solution for multi
> >> term synonyms. Is that right? I have a use case where I need to map one
> >> multi term phrase to another. i.e. Tween 20 needs to be translated to
> >> Polysorbate 40.
> >>
> >> Any thoughts as to how this can be achieved?
> >>
> >> Thanks,
> >> Kaushik
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message