lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Chyla <roman.ch...@gmail.com>
Subject Re: Reverse query?
Date Fri, 02 Oct 2015 17:32:39 GMT
I'd like to offer another option:

you say you want to match long query into a document - but maybe you
won't know whether to pick "Mad Max" or "Max is" (not mentioning the
performance hit of "*mad max*" search - or is it not the case
anymore?). Take a look at the NGram tokenizer (say size of 2; or
bigger). What it does, it splits the input into overlapping segments
of 'X' words (words, not characters - however, characters work too -
just pick bigger N)

mad max
max 1979
1979 australian

i'd recommend placing stopfilter before the ngram

 - then for the long query string of "Hey Mad Max is 1979...." you
wold search "hey mad" OR "mad max" OR "max 1979"... (perhaps the query
tokenizer could be convinced to the search for you automatically). And
voila, the more overlapping segments there, the higher the search
result.

hth,

roman



On Fri, Oct 2, 2015 at 12:03 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> The admin/analysis page is your friend here, find it and use it ;)
> Note you have to select a core on the admin UI screen before you can
> see the choice.
>
> Because apart from the other comments, KeywordTokenizer is a red flag.
> It does NOT break anything up into tokens, so if your doc contains:
> Mad Max is a 1979 Australian
> as the whole field, the _only_ match you'll ever get is if you search exactly
> "Mad Max is a 1979 Australian"
> Not Mad, not mad, not Max, exactly all 6 words separated by exactly one space.
>
> Andrea's suggestion is the one you want, but be sure you use one of
> the tokenizing analysis chains, perhaps start with text_en (in the
> stock distro). Be sure to completely remove your node/data directory
> (as in rm -rf data) after you make the change.
>
> And really, explore the admin/analysis page; it's where a LOT of these
> kinds of problems find solutions ;)
>
> Best,
> Erick
>
> On Fri, Oct 2, 2015 at 7:57 AM, Ravi Solr <ravisolr@gmail.com> wrote:
>> Hello Remi,
>>             Iam assuming the field where you store the data is analyzed.
>> The field definition might help us answer your question better. If you are
>> using edismax handler for your search requests, I believe you can achieve
>> you goal by setting set your "mm" to 100%, phrase slop "ps" and query slop
>> "qs" parameters to zero. I think that will force exact matches.
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Fri, Oct 2, 2015 at 9:48 AM, Andrea Roggerone <
>> andrearoggerone.osrc@gmail.com> wrote:
>>
>>> Hi Remy,
>>> The question is not really clear, could you explain a little bit better
>>> what you need? Reading your email I understand that you want to get
>>> documents containing all the search terms typed. For instance if you search
>>> for "Mad Max", you wanna get documents containing both Mad and Max. If
>>> that's your need, you can use a phrase query like:
>>>
>>> *"*Mad Max*"~2*
>>>
>>> where enclosing your keywords between double quotes means that you want to
>>> get both Mad and Max and the optional parameter ~2 is an example of *slop*.
>>> If you need more info you can look for *Phrase Query* in
>>> https://wiki.apache.org/solr/SolrRelevancyFAQ
>>>
>>> On Fri, Oct 2, 2015 at 2:33 PM, remi tassing <tassingremi@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> > I have medium-low experience on Solr and I have a question I couldn't
>>> quite
>>> > solve yet.
>>> >
>>> > Typically we have quite short query strings (a couple of words) and the
>>> > search is done through a set of bigger documents. What if the logic is
>>> > turned a little bit around. I have a document and I need to find out what
>>> > strings appear in the document. A string here could be a person name
>>> > (including space for example) or a location...which are indexed in Solr.
>>> >
>>> > A concrete example, we take this text from wikipedia (Mad Max):
>>> > "*Mad Max is a 1979 Australian dystopian action film directed by George
>>> > Miller <https://en.wikipedia.org/wiki/George_Miller_%28director%29>.
>>> > Written by Miller and James McCausland from a story by Miller and
>>> producer
>>> > Byron Kennedy <https://en.wikipedia.org/wiki/Byron_Kennedy>, it tells
a
>>> > story of societal breakdown
>>> > <https://en.wikipedia.org/wiki/Societal_collapse>, murder, and vengeance
>>> > <https://en.wikipedia.org/wiki/Revenge>. The film, starring the
>>> > then-little-known Mel Gibson <https://en.wikipedia.org/wiki/Mel_Gibson>,
>>> > was released internationally in 1980. It became a top-grossing Australian
>>> > film, while holding the record in the Guinness Book of Records
>>> > <https://en.wikipedia.org/wiki/Guinness_Book_of_Records> for decades
as
>>> > the
>>> > most profitable film ever created,[1]
>>> > <https://en.wikipedia.org/wiki/Mad_Max_%28franchise%29#cite_note-1>
and
>>> > has
>>> > been credited for further opening the global market to Australian New
>>> Wave
>>> > <https://en.wikipedia.org/wiki/Australian_New_Wave> films.*
>>> > <https://en.wikipedia.org/wiki/Mad_Max_%28franchise%29#cite_note-2>
>>> > <https://en.wikipedia.org/wiki/Mad_Max_%28franchise%29#cite_note-3>"
>>> >
>>> > I would like it to match "Mad Max" but not "Mad" or "Max" seperately, and
>>> > "George Miller", "global market" ...
>>> >
>>> > I've tried the keywordTokenizer but it didn't work. I suppose it's ok for
>>> > the index time but not query time (in this specific case)
>>> >
>>> > I had a look at Luwak but it's not what I'm looking for (
>>> >
>>> >
>>> http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/
>>> > )
>>> >
>>> > The typical name search doesn't seem to work either,
>>> > https://dzone.com/articles/tips-name-search-solr
>>> >
>>> > I was thinking this problem must have already be solved...or?
>>> >
>>> > Remi
>>> >
>>>

Mime
View raw message