lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "380382856@qq.com" <380382...@qq.com>
Subject Re: Re: any analyzer will keep punctuation?
Date Wed, 08 Mar 2017 06:30:12 GMT
i think Ahmet is right. use WhiteSpace tokeniser will separate doc into token.and then you
use custom filter can delete some punctuation you want to remove.Realization a custom filter
is not very difficult.  



380382856@qq.com
 
发件人: Yonghui Zhao
发送时间: 2017-03-08 12:22
收件人: Ahmet Arslan
抄送: java-user@lucene.apache.org
主题: Re: any analyzer will keep punctuation?
Hi Ahmet,
 
Thanks for your reply, but I didn't quite get your idea.
I want to get an analyzer like standard analyzer but with punctuation
customized.
I think one way is customizing an analyzer  with a customizer  tokenizer
like StandardTokenizer.
In my tokenizer I will re-write StandardTokenizerImpl which seems a little
complicate.
I don't understand how "a customised word delimiter filter factory" works
in tokenizer.
 
 
2017-03-06 22:26 GMT+08:00 Ahmet Arslan <iorixxx@yahoo.com>:
 
> Hi Zhao,
>
> WhiteSpace tokeniser followed by a customised word delimiter filter
> factory would be solution.
> Please see types attribute of the word delimiter filter for customising
> characters.
>
> ahmet
>
>
>
> On Monday, March 6, 2017 12:22 PM, Yonghui Zhao <zhaoyonghui@gmail.com>
> wrote:
> Yes whitespace analyzer will keep punctuation, but it only breaks word by
> space.
>
>
> I didn’t explain my requirement clearly.
>
> I want to an analyzer like standard analyzer but may keep some punctuation
> configured.
>
>
> 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <iorixxx@yahoo.com.invalid>:
>
> > Hi,
> >
> > Whitespace analyser/tokenizer for example.
> >
> > Ahmet
> >
> >
> >
> > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <zhaoyonghui@gmail.com>
> > wrote:
> > Lucene standard anlyzer will remove almost all punctuation.
> > In some cases, we want to keep some punctuation, for example in music
> > search, some singer name and album name could be a punctuation.
> >
> > Is there any analyzer that we can customized punctuation to be removed?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message