lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerome Lanneluc <jerome_lanne...@fr.ibm.com>
Subject Re: Japanese analyzer
Date Fri, 18 Jan 2013 15:17:18 GMT
Thanks Dawid, that was it. I'm now using an empty stoptags set and I'm 
seeing all the expected tokens.

Jerome



From:   Dawid Weiss <dawid.weiss@gmail.com>
To:     java-user@lucene.apache.org, 
Date:   01/18/2013 02:52 PM
Subject:        Re: Japanese analyzer



Jerome,

Some of the tokens are removed because their part of speech tags are
in the stoptags file? That's my guess at least -- you can always try
to copy/paste Japanese analyzer and change the token stream
components:

  protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
    Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, 
mode);
    TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
    stream = new JapanesePartOfSpeechStopFilter(true, stream,
stoptags);    << this is the thing I was talking about.
    stream = new CJKWidthFilter(stream);
    stream = new StopFilter(matchVersion, stream, stopwords);
    stream = new JapaneseKatakanaStemFilter(stream);
    stream = new LowerCaseFilter(matchVersion, stream);
    return new TokenStreamComponents(tokenizer, stream);
  }

Dawid

On Fri, Jan 18, 2013 at 2:46 PM, Jerome Lanneluc
<jerome_lanneluc@fr.ibm.com> wrote:
> Thanks for your answer.
>
> No those words are not part of the stop word file (I'm using the one 
that
> comes with the Japanese analyzer in lucene-kuromoji-3.6.1.jar.
>
> My Japanese contact told me that the first sentence means "I am 
Japanese"
> and the second one is a unit of length.
>
> Jerome
>
>
>
> From:   Swapnil Patil <ping.swapnil@gmail.com>
> To:     java-user@lucene.apache.org,
> Date:   01/18/2013 02:33 PM
> Subject:        Re: Japanese analyzer
>
>
>
> Hi,
>
> I just translated these words, using google translate look like Japanese
> I [
> Can you check if these words are  in your stopword file.
> if these words exits in your stop word file than you will not get them 
in
> token stream.
>
> -Swapnil
>
> On Fri, Jan 18, 2013 at 6:58 PM, Jerome Lanneluc
> <jerome_lanneluc@fr.ibm.com
>> wrote:
>
>> [˽ ձ
>
>
>
> Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
> Compagnie IBM France
> Sige Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
> RCS Nanterre 552 118 465
> Forme Sociale : S.A.S.
> Capital Social : 653.242.306,20 
> SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Sige Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message