lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Memo: Re: Asian languages
Date Thu, 27 May 2004 08:05:52 GMT

Hi Christophe,

we're currently indexing Chinese pages with little difficulty. You can use
the standard analyzer to index the documents and it will tokenize the
content into individual characters. If you want to create a list of 'stop'
words you will need to create your own analyzer and supply it with a list
of unicode characters to stop. We are indexing HTML pages using a spider to
traverse the site and have subclassed Document into HTML_Document. This
allows us to set the content encoding for the input stream reader - as our
system default is iso_8859-1 in common with most western machines - which
enables it to correctly process the unicode characters. You may need to do
this too.

Hope this helps


"Christophe Lombart" <> on 26 May
2004 19:16

Please respond to "Lucene Users List" <>

To:    "Lucene Users List" <>

Subject:    Asian languages

Which  asian languages are supported by Lucene ?
What about corean, japanese, thaï, ... ?
If they are not yet supported, what I need to do ?


To unsubscribe, e-mail:
For additional commands, e-mail:

 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.


This transmission has been issued by a member of the HSBC Group 
("HSBC") for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message