lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject Re: CJK Analyzer indexing japanese word document
Date Wed, 17 Mar 2004 05:53:32 GMT
please check the java i/o's  ByteStream ==> CharactorStream

Che Dong
----- Original Message ----- 
From: "Chandan Tamrakar" <chandan@ccnep.com.np>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, March 17, 2004 12:37 PM
Subject: Re: CJK Analyzer indexing japanese word document


> thanks smith  . How do i convert SJIS encoding to be converted into unicode
> ? As far as i know java converts ascii and latin1 into unicode by default
> which xml parsers you are using to translate to unicode ?
> 
> ----- Original Message -----
> From: "Scott Smith" <ssmith@mainstreamdata.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Wednesday, March 17, 2004 4:27 AM
> Subject: RE: CJK Analyzer indexing japanese word document
> 
> 
> > I have used this analyzer with Japanese and it works fine.  In fact, I'm
> > currently doing English, several western European languages, traditional
> > and simplified Chinese and Japanese.  I throw them all in the same index
> > and have had no problem other than my users wanted the search limited by
> > language.  I solved that problem by simply adding a keyword field to the
> > Document which has the 2-letter language code.  I then automatically add
> > the term indicating the language as an additional constraint when the
> > user specifies the search.
> >
> > You do need to be sure that the Shift-JIS gets converted to unicode
> > before you put it in the Document (and pass it to the analyzer).
> > Internally, I believe lucene wants everything in unicode (as any good
> > java program would). Originally, I had problems with Asian languages and
> > eventually determined my xml parser wasn't translating my Shift-JIS,
> > Big5, etc. to unicode.  Once I fixed that, life was good.
> >
> > -----Original Message-----
> > From: Che Dong [mailto:chedong@hotmail.com]
> > Sent: Tuesday, March 16, 2004 8:31 AM
> > To: Lucene Users List
> > Subject: Re: CJK Analyzer indexing japanese word document
> >
> > some Korean friends tell me they use it successfully for Korean. So I
> > think its also work for Japanese. mostly the problem is locale settings
> >
> > Please check weblucene project for xml indexing samples:
> > http://sourceforge.net/projects/weblucene/
> >
> > Che Dong
> > ----- Original Message -----
> > From: "Chandan Tamrakar" <chandan@ccnep.com.np>
> > To: <lucene-user@jakarta.apache.org>
> > Sent: Tuesday, March 16, 2004 4:31 PM
> > Subject: CJK Analyzer indexing japanese word document
> >
> >
> > >
> > > I am using a CJKAnalyzer from apache sandbox , I have set the java
> > > file.encoding setting to SJIS
> > > and  i am able to index and search the japanese html page . I can see
> > the
> > > index dumps as i expected , However when i index a word document
> > containing
> > > japanese characters it is not indexing as expected . Do I need to
> > change
> > > anything with CJKTokenizer and CJKAnalyzer classes?
> > > I have been able to index a word document with StandardAnalyzers.
> > >
> > > thanks in advace
> > > chandan
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
Mime
View raw message