james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject RE: Character corruption with Traditional chinese
Date Tue, 14 Feb 2012 12:19:49 GMT
On Tue, 2012-02-14 at 11:53 +0000, Sharma, Ashish wrote:
> Oleg,
> 
> I am using 'mime4j' as follows:
> 
> 		MimeConfig mime4jParserConfig = new MimeConfig();
> 		BodyDescriptorBuilder bodyDescriptorBuilder = new DefaultBodyDescriptorBuilder();
> 		MimeStreamParser mime4jParser = new MimeStreamParser(mime4jParserConfig,DecodeMonitor.SILENT,bodyDescriptorBuilder);
> 		mime4jParser.setContentDecoding(true);
> 		mime4jParser.setContentHandler(contentHandler);		
> 		
> 		mime4jParser.parse(rawEmailFile);
> 		
> 		return ((CustomContentHandler)contentHandler).getEmail();
> 
> Here, as you can see I am using the content decoding as provided by mime4j for email
body parts.
> 
> The contentHandler that I am using is just listening for basic events and is of following
type:
> 
> 	public class CustomContentHandler extends AbstractContentHandler {	
> 	
> 		 public void field(Field field) throws MimeException {}	
> 	
> 
> 		public void body(BodyDescriptor bd, InputStream is) throws MimeException, IOException
{
> 		((MaximalBodyDescriptor)bd).setCharset(getFallbackCharset(bd.getCharset()));		
> 		}
> 
> 		...
> 
> I modified the code in 'MaximalBodyDescriptor' to set charset in my contentHandler as
you hinted.
> 

There is absolutely no need or good reason for modifying
MaximalBodyDescriptor. Just use a different charset when processing body
content.

Oleg


> This arrangement solved my problem of character corruption.
> 
> But the problem I am having is that for the above code to work I need to modify the code
in 'mime4j' that I want to avoid.
> 
> Can you suggest some workaround here?
> 
> Thanks
> Ashish
> 
> -----Original Message-----
> From: Oleg Kalnichevski [mailto:olegk@apache.org] 
> Sent: Tuesday, February 14, 2012 2:42 AM
> To: mime4j-dev@james.apache.org
> Subject: RE: Character corruption with Traditional chinese
> 
> On Mon, 2012-02-13 at 14:58 +0000, Sharma, Ashish wrote:
> > Hi,
> > 
> > Since I have no control on the email clients sending the mails, kindly suggests
suitable measures that I can take up on my end to mitigate the problem of character corruption.
> > 
> > I think modifying the charset during email body decoding will work for such cases,
can somebody post relevant api hooks of mime4j that I can use for the idea that I have put
forward (is it feasible too?) ?
> > 
> > Thanks
> > Ashish
> > 
> 
> I am not sure I understand the problem you are having. MimeStreamParser
> passes an instance of BodyDescriptor for each body part it encounters.
> BodyDescriptor contains the charset of the body part (if specified)
> among other things. It is up to individual ContentHandler implementation
> to decide whether or not that charset is valid. ContentHandler can
> always choose to use a different charset encoding instead of the one
> specified by the BodyDescriptor.
> 
> Oleg 
> 
> > -----Original Message-----
> > From: Tze-Kei Lee [mailto:chikei@gmail.com] 
> > Sent: Monday, February 13, 2012 5:45 PM
> > To: mime4j-dev@james.apache.org
> > Subject: Re: Character corruption with Traditional chinese
> > 
> > Hi,
> > 
> > It looks like the email client composed the email made mistake when
> > pick charset.
> > 
> > GB 2312 contains only Simplified Chinese while CP 932 or GB 18030 is
> > extended to include Traditional Chinese (and Japanese, Korean), and
> > the first sentence in the email is using the extended code points.
> > 
> > Best Regards
> > 
> > Tze-Kei
> > 
> > On Mon, Feb 13, 2012 at 7:32 PM, Sharma, Ashish <ashish.sharma3@hp.com> wrote:
> > > Hi,
> > >
> > > I use mime4j 0.7.2 for email parsing.
> > >
> > > I am getting problem of character set corruption for Traditional Chinese characters.
> > >
> > > Sample email that is creating problems is at:
> > >
> > > http://pastebin.com/Q38VXsLb
> > >
> > > Here I noticed that when the email is parsed with default charset encoding
(charset encoding that was recived from email server) of :
> > >
> > > charset="gb2312"
> > >
> > > I get the character set corruption, while if I manually change this charset
encoding in the email stream to :
> > >
> > > charset="gb18030"
> > >
> > > and then parse it via mime4j, there is no character corruption.
> > >
> > > Can somebody please explain why I am getting this behavior?
> > >
> > > Moreover is there a way in mime4j where I can substitute character sets for
the above kind of specific cases?
> > >
> > > Thanks
> > > Ashish
> > >
> > >
> > >
> 
> 



Mime
View raw message