lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Susheel Kumar <susheel2...@gmail.com>
Subject Re: catch alls and nuances
Date Wed, 03 Feb 2016 00:12:11 GMT
Hi John - You can take more close look on different options with
WordDelimeterFilterFactory at
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
to see if they meet your requirement and use Analysis tab in Solr Admin UI.
If still have question, you can share what exact search requirement(s) you
are trying to meet and how does your current analysis looks like for
text_general then perhaps one can suggest/help you out there.

Thnx
Susheel

On Tue, Feb 2, 2016 at 5:21 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> bq: Have now begun writing my own.
>
> I hope by that you mean defining your own <fieldType>,
> at least until you're sure that none of the zillion things
> you can do with an analysis chain don't suit your needs.
>
> If you haven't already looked _seriously_ at the admin/analysis
> page (you have to choose a core to have that available). Fuzzy
> matching won't help you with the 1234-LT example at all.
>
> BTW, you (perhaps unintentionally) changed the problem
> 1234LT as input is vastly different from 1234-LT. The latter
> will be made into two tokens by some tokenizers. Whereas
> 1234LT is always passed through the tokenizers as a single
> "word", _then_ broken up by WordDelimiterFilterFactory if
> its a filter in the analysis chain.
>
> Do note that when I use "tokenizer" I'm referring to the
> specific class that breaks the incoming stream up. The
> simplest example is WhitespaceTokenizer, which.. you
> guessed it, breaks up the stream on whitespace.
>
> Once something gets through the one and only tokenizer
> in an analysis chain, each token passes through 0
> or more "Filters", and WordDelimiterFilterFactory is
> one of these.
>
> Pardon me for being somewhat pedantic here but unless the
> analysis chain is understood, you'll go through endless
> thrashing. This is where the admin/analysis page is
> invaluable.
>
> Best,
> Erick
>
> On Tue, Feb 2, 2016 at 12:49 PM, John Blythe <john@curvolabs.com> wrote:
> > I had been using text_general at the time of my email's writing. Have
> tried
> > a couple of the other stock ones (text_en, text_en_splitting, _tight).
> Have
> > now begun writing my own. I began to wonder if simply doing one of the
> > above, such as text_general, with a fuzzy distance (probably just ~1)
> would
> > be best suited. Another example would be an indexed value of "Phasaix"
> > (which is a typo in the original data) being searched for with the
> correct
> > spelling of "Phasix" and returning nothing. Adding ~1 in that case works.
> > For some reason it doesn't in the case of the 1234-L and 1234-LT example.
> >
> > Thanks for any insight-
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | john@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, Feb 1, 2016 at 3:30 PM, Erick Erickson <erickerickson@gmail.com>
> > wrote:
> >
> >> Likely you also have WordDelimiterFilterFactory in
> >> your fieldType, that's what will split on alphanumeric
> >> transitions.
> >>
> >> So you should be able to use wildcards here, i.e. 1234L*
> >>
> >> However, that'll only work if you have preserveOriginal set in
> >> WordDelimiterFilterFactory in your indexing chain.
> >>
> >> And just to make life "interesting", there are some peculiarities
> >> with parsing wildcards at query time, so be sure to see the
> >> admin/analysis page....
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 12:20 PM, John Blythe <john@curvolabs.com>
> wrote:
> >> > Hi there
> >> >
> >> > I have a a catch all field called 'text' that I copy my item
> description,
> >> > manufacturer name, and the item's catalog number into. I'm having an
> >> issue
> >> > with keeping the broadness of the tokenizers in place whilst still
> >> allowing
> >> > some good precision in the case of very specific queries.
> >> >
> >> > The results are generally good. But, for instance, the products named
> >> 1234L
> >> > and 1234LT aren't behaving how i would like. If I search 1234 they
> both
> >> > show. If I search 1234L only the first one is returned. I'm guessing
> this
> >> > is due to the splitting of the numeric and string portions. The "1234"
> >> and
> >> > the "L" both hit in the first case ("1234" and "L") but the L is of no
> >> > value in the "1234" and "LT" indexed item.
> >> >
> >> > What is the best way around this so that a small levenstein distance,
> for
> >> > instance, is picked up?
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message