lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Bickerstaff <j...@johnbickerstaff.com>
Subject Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser
Date Mon, 30 May 2016 21:02:02 GMT
So I'm looking at the solution mentioned here:
https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/

The thing that's troubling me slightly is that the way it's documented it
seems to be missing a small but important link...

What exactly causes the results listed to be returned?

Here's my thought process:

1. The entry for /autophrase searchHandler does not specify a default
search field.
2. The field type "text_autophrase" is set up as the one with the
AutoPhrasingFilterFactory as part of it's indexing

There isn't any mention (perhaps because it's too obvious) of the need to
copy or otherwise get data into the "text_autophrase" field at index time.

There isn't any explicit listing of "text_autophrase" as the default search
field in the /autophrase search handler

There isn't any explicit statement of "df=text_autophrase" in the query
statment: [/autophrase?q=New+York]

Therefore it seems to me that if someone tries to implement this, they're
going to be disappointed in the results unless they:
a. copy or otherwise get ALL the text they're interested in -- into the
"text_autophrase" field as part of the schema.xml setup (to happen at index
time)
b. somehow explicitly declare "text_autophrase" as the default search field
- either in the searchHandler or wherever else the default field is
configured.

If anyone out there has done this specific approach - could you validate
whether my thought process is correct and / or if I'm missing something?
Yes - I get that I can set it all up and try - but it's what I don't know I
don't know that bothers me...

On Fri, May 27, 2016 at 11:57 AM, John Bickerstaff <john@johnbickerstaff.com
> wrote:

> Thank you Steve -- very helpful.
>
> I can see that whatever implementation I decide to try, some testing will
> be in order.  If anyone is aware of significant gotchas with this synonym
> thing that are not mentioned in the already-listed URLs, please feel free
> to comment.
>
> On Fri, May 27, 2016 at 10:28 AM, Steve Rowe <sarowe@gmail.com> wrote:
>
>> I’m working on addressing problems using multi-term synonyms at query
>> time in Lucene and Solr.
>>
>> I recommend these two blogs for understanding the issues (the second one
>> was mentioned earlier in this thread):
>>
>> <
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> >
>> <https://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/>
>>
>> In addition to the already-mentioned projects, there is also:
>>
>> <https://issues.apache.org/jira/browse/SOLR-5379>
>>
>> All of these projects try in various ways to work around the fact that
>> Lucene’s QueryParser splits on whitespace before sending text to analysis,
>> one token at a time, so in a synonym filter, multi-word synonyms can never
>> match and add alternatives.  See <
>> https://issues.apache.org/jira/browse/LUCENE-2605>, where I’ve posted a
>> patch to directly address that problem - note that it’s still a work in
>> progress.
>>
>> Once LUCENE-2605 has been fixed, there is still work to do getting
>> (e)dismax to work with the modified Lucene QueryParser, and addressing
>> problems with how queries are constructed from Lucene’s “sausagized” token
>> stream.
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>> > On May 26, 2016, at 2:21 PM, John Bickerstaff <john@johnbickerstaff.com>
>> wrote:
>> >
>> > Thanks Chris --
>> >
>> > The two projects I'm aware of are:
>> >
>> > https://github.com/healthonnet/hon-lucene-synonyms
>> >
>> > and the one referenced from the Lucidworks page here:
>> >
>> https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>> >
>> > ... which is here :
>> https://github.com/LucidWorks/auto-phrase-tokenfilter
>> >
>> > Is there anything else out there that you would recommend I look at?
>> >
>> > On Thu, May 26, 2016 at 12:01 PM, Chris Morley <chris@depahelix.com>
>> wrote:
>> >
>> >> Chris Morley here, from Wayfair.  (Depahelix = my domain)
>> >>
>> >> Suyash Sonawane and I have worked on multiple word synonyms at Wayfair.
>> >> We worked mostly off of Ted Sullivan's work and also off of some
>> >> suggestions from Koorosh Vakhshoori.  We have gotten to a point where
>> we
>> >> have a more sophisticated internal implementation, however, we've found
>> >> that it is very difficult to make it do what you want it to do, and
>> also be
>> >> sufficiently performant.  Watch out for exceptional situations with mm
>> >> (minimum should match).
>> >>
>> >> Trey Grainger (now at Lucidworks) and Simon Hughes of Dice.com have
>> also
>> >> done work in this area.
>> >>
>> >> It should be very possible to get this kind of thing working on
>> >> SolrCloud.  I haven't tried it yet but I think theoretically, it should
>> >> just work.  The synonyms stuff is mostly about doing things at index
>> time
>> >> and query time.  The index time stuff should translate to SolrCloud
>> >> directly, while the query time stuff might pose some issues, but
>> probably
>> >> not too bad, if there are any issues at all.
>> >>
>> >> I've had decent luck porting our various plugins from 4.10.x to 5.5.0
>> >> because a lot of stuff is just Java, and it still works within the
>> Jetty
>> >> context.
>> >>
>> >> -Chris.
>> >>
>> >>
>> >>
>> >>
>> >> ----------------------------------------
>> >> From: "John Bickerstaff" <john@johnbickerstaff.com>
>> >> Sent: Thursday, May 26, 2016 1:51 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax
>> parser
>> >> Hey Jeff (or anyone interested in multi-word synonyms) here are some
>> >> potentially interesting links...
>> >>
>> >> http://wiki.apache.org/solr/QueryParser (search the page for
>> >> synonum_edismax)
>> >>
>> >> https://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
>> (blog
>> >> post about what became the synonym_edissmax Query Parser)
>> >>
>> >>
>> >>
>> https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>> >>
>> >> This last was useful for lots of reasons and contains links to other
>> >> interesting, related web pages...
>> >>
>> >> On Thu, May 26, 2016 at 11:45 AM, Jeff Wartes <jwartes@whitepages.com>
>> >> wrote:
>> >>
>> >>> Oh, interesting. I've certainty encountered issues with multi-word
>> >>> synonyms, but I hadn't come across this. If you end up using it with
a
>> >>> recent solr verison, I'd be glad to hear your experience.
>> >>>
>> >>> I haven't used it, but I am aware of one other project in this vein
>> that
>> >>> you might be interested in looking at:
>> >>> https://github.com/LucidWorks/auto-phrase-tokenfilter
>> >>>
>> >>>
>> >>> On 5/26/16, 9:29 AM, "John Bickerstaff" <john@johnbickerstaff.com>
>> >> wrote:
>> >>>
>> >>>> Ahh - for question #3 I may have spoken too soon. This line from
the
>> >>>> github repository readme suggests a way.
>> >>>>
>> >>>> Update: We have tested to run with the jar in $SOLR_HOME/lib as
well,
>> >> and
>> >>>> it works (Jetty).
>> >>>>
>> >>>> I'll try that and only respond back if that doesn't work.
>> >>>>
>> >>>> Questions 1 and 2 still stand of course... If anyone on the list
has
>> >>>> experience in this area...
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> On Thu, May 26, 2016 at 10:25 AM, John Bickerstaff <
>> >>> john@johnbickerstaff.com
>> >>>>> wrote:
>> >>>>
>> >>>>> Hi all,
>> >>>>>
>> >>>>> I'm creating a Solr Cloud that will index and search medical
text.
>> >>>>> Multi-word synonyms are a pretty important factor.
>> >>>>>
>> >>>>> I find that there are some challenges around multi-word synonyms
>> and I
>> >>>>> also found on the wiki that there is a recommended 3rd-party
parser
>> >>>>> (synonym_edismax parser) created by Nolan Lawson and found here:
>> >>>>> https://github.com/healthonnet/hon-lucene-synonyms
>> >>>>>
>> >>>>> Here's the thing - the instructions on the github site involve
>> >> bringing
>> >>>>> the jar file into the war file - which is not applicable any
more...
>> >> at
>> >>>>> least I think it's not...
>> >>>>>
>> >>>>> I have three questions:
>> >>>>>
>> >>>>> 1. Is this still a good solution for multi-word synonyms (I.e.
Solr
>> >>> Cloud
>> >>>>> doesn't break it in some way)
>> >>>>> 2. Is there a tool or plug-in out there that the contributors
would
>> >>>>> recommend above this one?
>> >>>>> 3. Assuming 1 = yes and 2 = no, can anyone tell me an updated
>> >> procedure
>> >>>>> for bringing it in to Solr Cloud (I'm running 5.4.x)
>> >>>>>
>> >>>>> Thanks
>> >>>>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message