lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Poll <gkp...@gmail.com>
Subject Re: [Lucene.Net] How to index/search a file name
Date Thu, 08 Sep 2011 17:42:37 GMT
Just to give a feedback, in case someone is interested -

ModifiedStandardAnalyzer class seems to work perfectly as a Standard
Analyzer but accent insensitive... A small difference occured with the last
character, but it does not belong to the portuguese alphabet, so I think
there's no problem in ignoring it in my case...

Thanks Digy!

Test results:

(tokenizing the expression  "Name.Surname@gmail.com 123.456 3,5 AT&T João
Avião Calção ğüşıöç%ĞÜŞİÖÇ$ΑΒΓΔΕΖ#АБВГДЕ SSß")

StandardAnalyzer:

[name.surname@gmail.com] [123.456] [3,5] [at&t] [joão] [avião] [calção]
[güsıöç] [güsiöç] [aß?de?] [??????] [ssß]

ModifiedStandardAnalyzer: (accent insensitive)

[name.surname@gmail.com] [123.456] [3,5] [at&t] [joao] [aviao] [calcao]
[gusioc] [gusioc] [aß?de?] [??????] [ssss]

Thanx
Gustavo Poll

2011/9/6 Gustavo Poll <gkpoll@gmail.com>

> thanks, I'll do it...
>
> 2011/9/6 Digy <digydigy@gmail.com>
>
>> That can be a starting point (Just play a little bit with with tokenizers
>> & filters )
>>
>>
>>
>>    public class ModifiedStandardAnalyzer : Analyzer
>>
>>    {
>>
>>        public override TokenStream TokenStream(System.String fieldName,
>> System.IO.TextReader reader)
>>
>>        {
>>
>>            StandardTokenizer tokenStream = new StandardTokenizer(reader,
>> true);
>>
>>            TokenStream result = new StandardFilter(tokenStream);
>>
>>            result = new LowerCaseFilter(result);
>>
>>            result = new ASCIIFoldingFilter(result);
>>
>>            return result;
>>
>>        }
>>
>>    }
>>
>>
>>
>> DIGY
>>
>>
>>
>> -----Original Message-----
>> From: Gustavo Poll [mailto:gkpoll@gmail.com]
>> Sent: Tuesday, September 06, 2011 10:06 PM
>> To: lucene-net-user@lucene.apache.org
>> Subject: Re: [Lucene.Net] How to index/search a file name
>>
>>
>>
>> thanks again... Ok, it is not..
>>
>>
>>
>> standard analyzer:
>>
>>
>>
>> [name.surname@gmail.com] [123.456] [3,5] [at&t] [güsıöç] [güsiöç]
>> [aß?de?]
>>
>> [??????] [ssß]
>>
>>
>>
>> UnaccentedWordAnalyzer:
>>
>>
>>
>> [name] [surname] [gmail] [com] [123] [456] [3] [5] [at] [t] [gusioc]
>>
>> [gusioc] [aß?de?] [??????] [ssss]
>>
>>
>>
>>
>>
>> StandardAnalyzer would be perfect to my application if it was accent
>>
>> insensitive... Can anyone tell me please, the easiest way to code such
>>
>> analyzer? (accent insensitive Standard Analyzer)
>>
>>
>>
>> I hear it is not a good idea to make a class that inherits
>> StandardAnalyzer
>>
>> cause StandardAnalyzer should be a final class.. Is this coherent?
>>
>>
>>
>> Appreciate any help please...
>>
>> Gustavo Poll
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2011/9/6 Digy <digydigy@gmail.com>
>>
>>
>>
>> > A function is worth a thousand words J
>>
>> >
>>
>> >
>>
>> >
>>
>> >
>>
>> >
>>
>> >        void Test()
>>
>> >
>>
>> >        {
>>
>> >
>>
>> >            Analyzer[] analyzers = new Analyzer[] { new
>> StandardAnalyzer(),
>>
>> > new Lucene.Net.Analysis.Ext.UnaccentedWordAnalyzer() };
>>
>> >
>>
>> >            string input = "Name.Surname@gmail.com 123.456 3,5 AT&T
>>
>> > ğüşıöç%ĞÜŞİÖÇ$ΑΒΓΔΕΖ#АБВГДЕ SSß";
>>
>> >
>>
>> >
>>
>> >
>>
>> >            foreach (Analyzer analyzer in analyzers)
>>
>> >
>>
>> >            {
>>
>> >
>>
>> >                TokenStream ts = analyzer.TokenStream("", new
>>
>> > StringReader(input));
>>
>> >
>>
>> >                Lucene.Net.Analysis.Token t = ts.Next();
>>
>> >
>>
>> >                while (t != null)
>>
>> >
>>
>> >                {
>>
>> >
>>
>> >                    Console.Write("[" + t.TermText() + "] ");
>>
>> >
>>
>> >                    t = ts.Next();
>>
>> >
>>
>> >                }
>>
>> >
>>
>> >                Console.WriteLine(); Console.WriteLine();
>>
>> >
>>
>> >
>>
>> >
>>
>> >            }
>>
>> >
>>
>> >        }
>>
>> >
>>
>> >
>>
>> >
>>
>> > DIGY
>>
>> >
>>
>> >
>>
>> >
>>
>> >
>>
>> >
>>
>> > -----Original Message-----
>>
>> > From: Gustavo Poll [mailto:gkpoll@gmail.com]
>>
>> > Sent: Tuesday, September 06, 2011 9:00 PM
>>
>> > To: lucene-net-user@lucene.apache.org
>>
>> > Subject: Re: [Lucene.Net] How to index/search a file name
>>
>> >
>>
>> >
>>
>> >
>>
>> > thanks DIGY, I have interest in that too... Let me see if i understood:
>>
>> >
>>
>> >
>>
>> >
>>
>> > UnaccentedWordAnalyzer  is like Standard Analyzer, but accent
>> insensitive?
>>
>> >
>>
>> >
>>
>> >
>>
>> > Thanks!
>>
>> >
>>
>> > Gustavo Poll
>>
>> >
>>
>> >
>>
>> >
>>
>> >
>>
>> >
>>
>> > 2011/9/6 digy digy <digydigy@gmail.com>
>>
>> >
>>
>> >
>>
>> >
>>
>> > > That may help
>>
>> >
>>
>> > >
>>
>> >
>>
>> > > UnaccentedWordAnalyzer @
>>
>> >
>>
>> > >
>>
>> >
>>
>> > >
>>
>> >
>> https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Core/Analysis/Ext/Analysis.Ext.cs
>>
>> >
>>
>> > >
>>
>> >
>>
>> > >
>>
>> >
>>
>> > > DIGY
>>
>> >
>>
>> > >
>>
>> >
>>
>> > > On Tue, Sep 6, 2011 at 12:31 PM, Floyd Wu <floyd.wu@gmail.com> wrote:
>>
>> >
>>
>> > >
>>
>> >
>>
>> > > > Hi everyone,
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > I have a question that annoying me many times. my situation is that
>> I
>>
>> >
>>
>> > > need
>>
>> >
>>
>> > > > to index file name and need to be searchable using partial file
>> name.
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > example--> 2009&2010Q2_ABCD_Report.xls (the file name)
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > When I shot queries
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > filename:ABCD    no match return.
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > filename:2010Q2_ABCD     match
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > filename:Report*    match
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > I'm using StandardAnalyzer and Lucene.Net version is 2.9.3. Current
>>
>> >
>>
>> > > > filename
>>
>> >
>>
>> > > > field is set to tokenized/indexed/store
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > What I want is when user type any part of file name that lucene.Net
>> can
>>
>> >
>>
>> > > > match.
>>
>> >
>>
>> > > > (string like 2009 or 2010Q2 or ABCD or Report or xls or Report.xls)
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > Please help on this or kindly direct me a way to solve it.
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > > > Floyd
>>
>> >
>>
>> > > >
>>
>> >
>>
>> > >
>>
>> >
>>
>> >
>>
>> >
>>
>> > -----
>>
>> >
>>
>> > Bu iletide virüs bulunamadı.
>>
>> >
>>
>> > AVG tarafından kontrol edildi - www.avg.com
>>
>> >
>>
>> > Sürüm: 2012.0.1796 / Virüs Veritabanı: 2082/4480 - Sürüm Tarihi:
>> 06.09.2011
>>
>> >
>>
>> >
>>
>>
>>
>> -----
>>
>> Bu iletide virüs bulunamadı.
>>
>> AVG tarafından kontrol edildi - www.avg.com
>>
>> Sürüm: 2012.0.1796 / Virüs Veritabanı: 2082/4480 - Sürüm Tarihi:
>> 06.09.2011
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message