lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Davidson <alex.david...@bluewire-technologies.com>
Subject RE: Avoid letter searches
Date Thu, 29 Dec 2016 19:41:40 GMT
Hi,

If you want to be able to efficiently search for documents containing terms matching a given
prefix, n-grams are what you need. Generally speaking, you want your query to reference as
few terms as possible in order to reduce the number of bitsets which need to be loaded and
processed.

In the system I helped develop, for the term ‘index’ we also generate ‘i’, ‘in’,
‘ind’, and ‘inde’ when indexing the document. This means that the query becomes a
lot simpler and faster (only needs to look up a single term), but you lose the ability to
force whole-word matches on the field. The index may also be considerably larger, depending
on the type of data being indexed.

By adding a payload to indicate the number of letters removed from the original term (eg.
2 for the ‘ind’ prefix of ‘index’) you can implement a custom scorer which down-weights
matches on short prefixes of long terms, so shorter words (closer matches) are preferred.
We did some additional tweaking to the scoring so that we don’t need to store payloads for
every prefix, which reduces index size and improves searching speed. In practice, the search
time is related to the total number of matches more than anything else, so consider eg. skipping
single-letter prefixes altogether.

Also maybe look into using Solr instead of coding against Lucene directly, since it probably
already has features to do this.

Alex Davidson
Bluewire Technologies

From: Koga, Diego
Sent: 29 December 2016 18:14
To: user@lucenenet.apache.org
Subject: Re: Avoid letter searches

LOL....

Let's go back. This is the first time I am using Lucene and this is an
old project that I am refactoring exactly because the way that was,
was really odd.

It might have a lost of things wrong yet. And the problem is, that I
don't much the correct way. I figured out some good practices googling
but could not find much.

I really appreciate any insights that you guys could give me.

I did not know the answer for your question too Itamar and I was
checking here with my co-workers. The ideia is too search like on a
regular SQL select (where field like 'wor%'), it used to be with two
wildcard to reproduce this (where field like '%wor%'), in order to get
"word, sword" results.

Well, talking about results we also need to talk about analyzers.

I created a custom analyzer because they only want to tokenize by
whitespace and hyphen; and all tokens should be lowercase.

    public class CustomAnalyzer : Analyzer
    {
        public override TokenStream TokenStream(string fieldName,
TextReader reader)
        {
            TokenStream t = null;
            t = new CustomTokenizer(reader);
            t = new LowerCaseFilter(t);

            return t;
        }
    }


    public class CustomTokenizer : WhitespaceTokenizer
    {
        public CustomTokenizer(TextReader @in) : base(@in)
        {
        }

        public CustomTokenizer(AttributeSource source, TextReader @in)
: base(source, @in)
        {
        }

        public CustomTokenizer(AttributeFactory factory, TextReader
@in) : base(factory, @in)
        {
        }

        protected override bool IsTokenChar(char c)
        {
            if (c.Equals('-'))
            {
                return false;
            }

            return base.IsTokenChar(c);
        }
    }


What do you guys say?

Thanks in advance




Att.,
------------------
Koga, Diego


On Thu, Dec 29, 2016 at 11:58 AM, Itamar Syn-Hershko <itamar@code972.com> wrote:
> Diego, what are you trying to do? It looks like you are using Lucene in an
> incorrect way. You shouldn't be using wildcards all around.
>
> --
>
> Itamar Syn-Hershko
> Freelance Developer & Consultant
> Elasticsearch Consulting Partner
> Microsoft MVP | Lucene.NET PMC
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> http://BigDataBoutique.co.il/
>
> On Thu, Dec 29, 2016 at 6:49 PM, Koga, Diego <dikoga@gmail.com> wrote:
>
>> But that is to avoid indexing, isn't it?
>>
>> If so, I'll still have the problem because my searches uses the
>> wildcard "*" at the end.
>>
>> Or does it filter also the query when it parses:
>>
>>             var parser = new
>> MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30,
>> fieldsToSearch, _analyzer);
>>
>>             keyword = keyword.Replace("-", " ");
>>
>>             keyword = QueryParser.Escape(keyword);
>>
>>             var main = parser.Parse(string.Join(" ",
>> keyword.Trim().Split(' ').Where(x =>
>> !string.IsNullOrEmpty(x)).Select(x => x.Trim() == "*" ? x.Trim() :
>> x.Trim() + "*")));
>>
>>
>>
>>
>> Att.,
>> ------------------
>> Koga, Diego
>>
>>
>> On Thu, Dec 29, 2016 at 11:26 AM, Itamar Syn-Hershko <itamar@code972.com>
>> wrote:
>> > Yes,
>> > https://lucene.apache.org/core/4_5_0/analyzers-common/
>> org/apache/lucene/analysis/miscellaneous/LengthFilter.html
>> >
>> > https://github.com/apache/lucenenet/blob/master/src/
>> Lucene.Net.Core/Analysis/LengthFilter.cs
>> >
>> > --
>> >
>> > Itamar Syn-Hershko
>> > http://code972.com | @synhershko <https://twitter.com/synhershko>
>> > Freelance Developer & Consultant
>> > Lucene.NET committer and PMC member
>> >
>> > On Thu, Dec 29, 2016 at 6:17 PM, Koga, Diego <dikoga@gmail.com> wrote:
>> >
>> >> Guys,
>> >>
>> >> I am facing an issue if the search is letters like: a b c d e f g.
>> >>
>> >> These letters are everywhere which causes high amounts of processing
>> >> and does not mean anything at the end.
>> >>
>> >> Is there any way to avoid it other than split by spaces and check the
>> >> length of the string?
>> >>
>> >>
>> >> Thanks,
>> >>
>> >>
>> >> Att.,
>> >> ------------------
>> >> Koga, Diego
>> >>
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message