lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alex.bou...@hsbcam.com
Subject Memo: Re: Asian languages
Date Thu, 27 May 2004 08:44:54 GMT




Sorry Christophe,

I mis-informed you. We did NOT subclass Document, we simply created an
HTMLDocument class with methods that return Lucene Documents with the
required fields added and that is where the content-encoding was set.

Alex.




Alex BOURNE/IBEU/HSBC@HSBC on 27 May 2004 09:05

Please respond to "Lucene Users List" <lucene-user@jakarta.apache.org>

To:    "Lucene Users List" <lucene-user@jakarta.apache.org>
cc:
bcc:

Subject:    Re: Asian languages






Hi Christophe,

we're currently indexing Chinese pages with little difficulty. You can use
the standard analyzer to index the documents and it will tokenize the
content into individual characters. If you want to create a list of 'stop'
words you will need to create your own analyzer and supply it with a list
of unicode characters to stop. We are indexing HTML pages using a spider to
traverse the site and have subclassed Document into HTML_Document. This
allows us to set the content encoding for the input stream reader - as our
system default is iso_8859-1 in common with most western machines - which
enables it to correctly process the unicode characters. You may need to do
this too.

Hope this helps

Alex.




"Christophe Lombart" <christophe.lombart@sword-technologies.com> on 26 May
2004 19:16

Please respond to "Lucene Users List" <lucene-user@jakarta.apache.org>

To:    "Lucene Users List" <lucene-user@jakarta.apache.org>
cc:
bcc:

Subject:    Asian languages


Which  asian languages are supported by Lucene ?
What about corean, japanese, thaï, ... ?
If they are not yet supported, what I need to do ?

Thanks,
Christophe

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



******************************************************************
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
******************************************************************








_____________________________________________________

This transmission has been issued by a member of the HSBC Group
("HSBC") for the information of the addressee only and should not be
reproduced and / or distributed to any other person. Each page
attached hereto must be read in conjunction with any disclaimer which
forms part of it. This transmission is neither an offer nor the
solicitation
of an offer to sell or purchase any investment. Its contents are based
on information obtained from sources believed to be reliable but HSBC
makes no representation and accepts no responsibility or liability as to
its completeness or accuracy.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



******************************************************************
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
******************************************************************








_____________________________________________________

This transmission has been issued by a member of the HSBC Group 
("HSBC") for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message