lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elisabeth benoit <elisaelisael...@gmail.com>
Subject Re: ngrams with position
Date Thu, 10 Mar 2016 11:02:47 GMT
That's the use cas, yes. Find Amsterdam with Asmtreadm.

And yes, we're only doing approximative search if we get 0 result.

I don't quite get why pf2 pf3 not a good solution.

We're actually testing a solution close to phonetic. Some kind of word
reduction.

Thanks for the suggestion (and the link), this makes me think maybe
phonetic is the good solution.

Thanks for your help,
Elisabeth

2016-03-10 11:32 GMT+01:00 Alessandro Benedetti <abenedetti@apache.org>:

> mmmm If I followed your use case is:
>
> I type Asmtreadm and I want document matching Amsterdam ( even if the edit
> distance is greater than 2) .
> First of all is something I hope you do only if you get 0 results, if not
> the overhead can be great and you are going to lose a lot of precision
> causing confusion in the customer.
>
> Pf2 and Pf3 is ngram of white space separated tokens, to make partial
> phrase query to affect the scoring.
> Not a good fit for your problem.
>
> More than grams, have you considered using some sort of phonetic matching ?
> Could this help :
> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
>
> Cheers
>
> On 10 March 2016 at 08:47, elisabeth benoit <elisaelisaelisa@gmail.com>
> wrote:
>
> > I am trying to do approximative search with solr. We've tried fuzzy
> search,
> > and spellcheck search, it's working ok but edit distance is limited (to 2
> > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've
> had
> > performance issues, and I don't think you can have an edit distance more
> > than 2.
> >
> > What we used to do with a database was more efficient: storing trigrams
> > with position, and then searching arround that position (not precisely at
> > that position, since it's approximative search)
> >
> > Position is to avoid  for a trigram like ams (amsterdam) to get answers
> > where the same trigram is for instance at the end of the word. I would
> like
> > answers with the same relative position between trigrams to score higher.
> > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any
> > other way. Please tell me if you do.
> >
> > From you're answer, I get that position is stored, but I dont understand
> > how I can preserve relative order between trigrams, apart from using pf2
> > pf3.
> >
> > Best regards,
> > Elisabeth
> >
> > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti <abenedetti@apache.org>:
> >
> > > if you store the positions for your tokens ( and it is by default if
> you
> > > don't omit them), you have the relative position in the index. [1]
> > > I attach a blog post of mine, describing a little bit more in details
> the
> > > lucene internals.
> > >
> > > Apart from that, can you explain the problem you are trying to solve ?
> > > The high level user experience ?
> > > What kind of search/autocompletion/relevancy tuning are you trying to
> > > achieve ?
> > > Maybe we can help better if we start from the problem :)
> > >
> > > Cheers
> > >
> > > [1]
> > >
> > >
> >
> http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html
> > >
> > > On 9 March 2016 at 15:02, elisabeth benoit <elisaelisaelisa@gmail.com>
> > > wrote:
> > >
> > > > Hello Alessandro,
> > > >
> > > > You may be right. What would you use to keep relative order between,
> > for
> > > > instance, grams
> > > >
> > > > __a
> > > > _am
> > > > ams
> > > > mst
> > > > ste
> > > > ter
> > > > erd
> > > > rda
> > > > dam
> > > > am_
> > > >
> > > > of amsterdam? pf2 and pf3? That's all I can think about. Please let
> me
> > > know
> > > > if you have more insights.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > > > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti <
> abenedetti@apache.org
> > >:
> > > >
> > > > > Elizabeth,
> > > > > out of curiousity, could we know what you are trying to solve with
> > that
> > > > > complex way of tokenisation ?
> > > > > Solr is really good in storing positions along with token, so I am
> > > > curious
> > > > > to know why your are mixing the things up.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On 8 March 2016 at 10:08, elisabeth benoit <
> > elisaelisaelisa@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for your answer Emir,
> > > > > >
> > > > > > I'll check that out.
> > > > > >
> > > > > > Best regards,
> > > > > > Elisabeth
> > > > > >
> > > > > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic <
> > > > emir.arnautovic@sematext.com
> > > > > >:
> > > > > >
> > > > > > > Hi Elisabeth,
> > > > > > > I don't think there is such token filter, so you would
have to
> > > create
> > > > > > your
> > > > > > > own token filter that takes token and emits ngram token
of
> > specific
> > > > > > length.
> > > > > > > It should not be too hard to create such filter - you can
take
> a
> > > look
> > > > > how
> > > > > > > nagram filter is coded - yours should be simpler than that.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Emir
> > > > > > >
> > > > > > >
> > > > > > > On 08.03.2016 08:52, elisabeth benoit wrote:
> > > > > > >
> > > > > > >> Hello,
> > > > > > >>
> > > > > > >> I'm using solr 4.10.1. I'd like to index words with
ngrams of
> > fix
> > > > > lenght
> > > > > > >> with a position in the end.
> > > > > > >>
> > > > > > >> For instance, with fix lenght 3, Amsterdam would be
something
> > > like:
> > > > > > >>
> > > > > > >>
> > > > > > >> a0 (two spaces added at beginning)
> > > > > > >> am1
> > > > > > >> ams2
> > > > > > >> mst3
> > > > > > >> ste4
> > > > > > >> ter5
> > > > > > >> erd6
> > > > > > >> rda7
> > > > > > >> dam8
> > > > > > >> am9 (one more space in the end)
> > > > > > >>
> > > > > > >> The number at the end being the position.
> > > > > > >>
> > > > > > >> Does anyone have a clue how to achieve this?
> > > > > > >>
> > > > > > >> Best regards,
> > > > > > >> Elisabeth
> > > > > > >>
> > > > > > >>
> > > > > > > --
> > > > > > > Monitoring * Alerting * Anomaly Detection * Centralized
Log
> > > > Management
> > > > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > --------------------------
> > > > >
> > > > > Benedetti Alessandro
> > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > >
> > > > > "Tyger, tyger burning bright
> > > > > In the forests of the night,
> > > > > What immortal hand or eye
> > > > > Could frame thy fearful symmetry?"
> > > > >
> > > > > William Blake - Songs of Experience -1794 England
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message