james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sharma, Ashish" <ashish.shar...@hp.com>
Subject RE: Character corruption with Traditional chinese
Date Tue, 14 Feb 2012 11:53:37 GMT
Oleg,

I am using 'mime4j' as follows:

		MimeConfig mime4jParserConfig = new MimeConfig();
		BodyDescriptorBuilder bodyDescriptorBuilder = new DefaultBodyDescriptorBuilder();
		MimeStreamParser mime4jParser = new MimeStreamParser(mime4jParserConfig,DecodeMonitor.SILENT,bodyDescriptorBuilder);
		mime4jParser.setContentDecoding(true);
		mime4jParser.setContentHandler(contentHandler);		
		
		mime4jParser.parse(rawEmailFile);
		
		return ((CustomContentHandler)contentHandler).getEmail();

Here, as you can see I am using the content decoding as provided by mime4j for email body
parts.

The contentHandler that I am using is just listening for basic events and is of following
type:

	public class CustomContentHandler extends AbstractContentHandler {	
	
		 public void field(Field field) throws MimeException {}	
	

		public void body(BodyDescriptor bd, InputStream is) throws MimeException, IOException {
		((MaximalBodyDescriptor)bd).setCharset(getFallbackCharset(bd.getCharset()));		
		}

		...

I modified the code in 'MaximalBodyDescriptor' to set charset in my contentHandler as you
hinted.

This arrangement solved my problem of character corruption.

But the problem I am having is that for the above code to work I need to modify the code in
'mime4j' that I want to avoid.

Can you suggest some workaround here?

Thanks
Ashish

-----Original Message-----
From: Oleg Kalnichevski [mailto:olegk@apache.org] 
Sent: Tuesday, February 14, 2012 2:42 AM
To: mime4j-dev@james.apache.org
Subject: RE: Character corruption with Traditional chinese

On Mon, 2012-02-13 at 14:58 +0000, Sharma, Ashish wrote:
> Hi,
> 
> Since I have no control on the email clients sending the mails, kindly suggests suitable
measures that I can take up on my end to mitigate the problem of character corruption.
> 
> I think modifying the charset during email body decoding will work for such cases, can
somebody post relevant api hooks of mime4j that I can use for the idea that I have put forward
(is it feasible too?) ?
> 
> Thanks
> Ashish
> 

I am not sure I understand the problem you are having. MimeStreamParser
passes an instance of BodyDescriptor for each body part it encounters.
BodyDescriptor contains the charset of the body part (if specified)
among other things. It is up to individual ContentHandler implementation
to decide whether or not that charset is valid. ContentHandler can
always choose to use a different charset encoding instead of the one
specified by the BodyDescriptor.

Oleg 

> -----Original Message-----
> From: Tze-Kei Lee [mailto:chikei@gmail.com] 
> Sent: Monday, February 13, 2012 5:45 PM
> To: mime4j-dev@james.apache.org
> Subject: Re: Character corruption with Traditional chinese
> 
> Hi,
> 
> It looks like the email client composed the email made mistake when
> pick charset.
> 
> GB 2312 contains only Simplified Chinese while CP 932 or GB 18030 is
> extended to include Traditional Chinese (and Japanese, Korean), and
> the first sentence in the email is using the extended code points.
> 
> Best Regards
> 
> Tze-Kei
> 
> On Mon, Feb 13, 2012 at 7:32 PM, Sharma, Ashish <ashish.sharma3@hp.com> wrote:
> > Hi,
> >
> > I use mime4j 0.7.2 for email parsing.
> >
> > I am getting problem of character set corruption for Traditional Chinese characters.
> >
> > Sample email that is creating problems is at:
> >
> > http://pastebin.com/Q38VXsLb
> >
> > Here I noticed that when the email is parsed with default charset encoding (charset
encoding that was recived from email server) of :
> >
> > charset="gb2312"
> >
> > I get the character set corruption, while if I manually change this charset encoding
in the email stream to :
> >
> > charset="gb18030"
> >
> > and then parse it via mime4j, there is no character corruption.
> >
> > Can somebody please explain why I am getting this behavior?
> >
> > Moreover is there a way in mime4j where I can substitute character sets for the
above kind of specific cases?
> >
> > Thanks
> > Ashish
> >
> >
> >


Mime
View raw message