lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koga, Diego" <dik...@gmail.com>
Subject Re: Avoid letter searches
Date Thu, 29 Dec 2016 18:14:10 GMT
LOL....

Let's go back. This is the first time I am using Lucene and this is an
old project that I am refactoring exactly because the way that was,
was really odd.

It might have a lost of things wrong yet. And the problem is, that I
don't much the correct way. I figured out some good practices googling
but could not find much.

I really appreciate any insights that you guys could give me.

I did not know the answer for your question too Itamar and I was
checking here with my co-workers. The ideia is too search like on a
regular SQL select (where field like 'wor%'), it used to be with two
wildcard to reproduce this (where field like '%wor%'), in order to get
"word, sword" results.

Well, talking about results we also need to talk about analyzers.

I created a custom analyzer because they only want to tokenize by
whitespace and hyphen; and all tokens should be lowercase.

    public class CustomAnalyzer : Analyzer
    {
        public override TokenStream TokenStream(string fieldName,
TextReader reader)
        {
            TokenStream t = null;
            t = new CustomTokenizer(reader);
            t = new LowerCaseFilter(t);

            return t;
        }
    }


    public class CustomTokenizer : WhitespaceTokenizer
    {
        public CustomTokenizer(TextReader @in) : base(@in)
        {
        }

        public CustomTokenizer(AttributeSource source, TextReader @in)
: base(source, @in)
        {
        }

        public CustomTokenizer(AttributeFactory factory, TextReader
@in) : base(factory, @in)
        {
        }

        protected override bool IsTokenChar(char c)
        {
            if (c.Equals('-'))
            {
                return false;
            }

            return base.IsTokenChar(c);
        }
    }


What do you guys say?

Thanks in advance




Att.,
------------------
Koga, Diego


On Thu, Dec 29, 2016 at 11:58 AM, Itamar Syn-Hershko <itamar@code972.com> wrote:
> Diego, what are you trying to do? It looks like you are using Lucene in an
> incorrect way. You shouldn't be using wildcards all around.
>
> --
>
> Itamar Syn-Hershko
> Freelance Developer & Consultant
> Elasticsearch Consulting Partner
> Microsoft MVP | Lucene.NET PMC
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> http://BigDataBoutique.co.il/
>
> On Thu, Dec 29, 2016 at 6:49 PM, Koga, Diego <dikoga@gmail.com> wrote:
>
>> But that is to avoid indexing, isn't it?
>>
>> If so, I'll still have the problem because my searches uses the
>> wildcard "*" at the end.
>>
>> Or does it filter also the query when it parses:
>>
>>             var parser = new
>> MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30,
>> fieldsToSearch, _analyzer);
>>
>>             keyword = keyword.Replace("-", " ");
>>
>>             keyword = QueryParser.Escape(keyword);
>>
>>             var main = parser.Parse(string.Join(" ",
>> keyword.Trim().Split(' ').Where(x =>
>> !string.IsNullOrEmpty(x)).Select(x => x.Trim() == "*" ? x.Trim() :
>> x.Trim() + "*")));
>>
>>
>>
>>
>> Att.,
>> ------------------
>> Koga, Diego
>>
>>
>> On Thu, Dec 29, 2016 at 11:26 AM, Itamar Syn-Hershko <itamar@code972.com>
>> wrote:
>> > Yes,
>> > https://lucene.apache.org/core/4_5_0/analyzers-common/
>> org/apache/lucene/analysis/miscellaneous/LengthFilter.html
>> >
>> > https://github.com/apache/lucenenet/blob/master/src/
>> Lucene.Net.Core/Analysis/LengthFilter.cs
>> >
>> > --
>> >
>> > Itamar Syn-Hershko
>> > http://code972.com | @synhershko <https://twitter.com/synhershko>
>> > Freelance Developer & Consultant
>> > Lucene.NET committer and PMC member
>> >
>> > On Thu, Dec 29, 2016 at 6:17 PM, Koga, Diego <dikoga@gmail.com> wrote:
>> >
>> >> Guys,
>> >>
>> >> I am facing an issue if the search is letters like: a b c d e f g.
>> >>
>> >> These letters are everywhere which causes high amounts of processing
>> >> and does not mean anything at the end.
>> >>
>> >> Is there any way to avoid it other than split by spaces and check the
>> >> length of the string?
>> >>
>> >>
>> >> Thanks,
>> >>
>> >>
>> >> Att.,
>> >> ------------------
>> >> Koga, Diego
>> >>
>>

Mime
View raw message