poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MSB <markbrd...@tiscali.co.uk>
Subject Re: DO NOT REPLY [Bug 47875] New: reading word written in Chinese, paragraph nums is not correct.
Date Tue, 22 Sep 2009 06:17:42 GMT

This is a guess, and a highly speculative one at this point as I have not
looked at the source code for HWPF, but it might be that there is confusion
surrounding the paragraph mark character.

Each paragraph is terminated with a 'special' control character that Word
refers to as a paragraph mark. It could be - and that is could bearing in
mind that HWPF is very immature at this point - that once the document is
encoded into Chinese, there are issues detecting the paragraph mark
correctly.

One easy was to check would be to see where HWPF is failing to detect the
end of paragraph. Does it always have problems if the paragraph ends with
the same character for example? Bearing in mind HWPFs immaturity, there
could also be problems associated with character encoding and the way the
application converts the raw bytes of information read from the file into
unicode characters. Aside from that, I am sorry to say that I do not have
anything concrete to contribute to the discussion.

Yours

Mark B


Bugzilla from bugzilla@apache.org wrote:
> 
> https://issues.apache.org/bugzilla/show_bug.cgi?id=47875
> 
>            Summary: reading word written in Chinese, paragraph nums is not
>                     correct.
>            Product: POI
>            Version: 3.2-FINAL
>           Platform: PC
>         OS/Version: Windows XP
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: HWPF
>         AssignedTo: dev@poi.apache.org
>         ReportedBy: inthendsun@gmail.com
> 
> 
> FileInputStream fileIn = new FileInputStream("D:\\111.doc"); 
> 
> WordExtractor extractor = new WordExtractor(fileIn); 
> 
> String[] paras =extractor.getParagraphText(); 
> System.out.println(paras.length); 
> 
> 
> why the paragraph nums is not correct? Reading in English looks like no
> problem. But my word is written in Chinese.
> 
> thanks!
> 
> -- 
> Configure bugmail:
> https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/DO-NOT-REPLY--Bug-47875--New%3A-reading-word-written-in-Chinese%2C-paragraph-nums-is-not-correct.-tp25519112p25530585.html
Sent from the POI - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message