poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject [Bug 60936] Figure out charset in Word 6.0 files
Date Thu, 30 Mar 2017 16:11:32 GMT
https://bz.apache.org/bugzilla/show_bug.cgi?id=60936

--- Comment #1 from Tim Allison <tallison@mitre.org> ---
I'm not able easily to figure out how the code page was encoded.

I could only find Win1252 encoded docs (on a quick look) in Tika's regression
corpus.

I was able to generate a win1250 via OpenOffice, which I'll attach shortly.

>From that file, it looks like the codepage _might_ be encoded in 2 ways.

1) (pure guess) in the font information, value "EE" at 133B is the code for
Windows-1250. 

2) "0504" at 0F5E-0F5F specifies the Czech language


To test my guesses, I tried modifying each.

1) If I modify the "EE" to "00" default, ansi, the text is still correctly
rendered in Word.

2) However, if I modify the 0504 to 0409 (U.S. English), the text is corrupted.

This means that Word and OpenOffice are inferring the code page from the
language, and preferring that information to the codepage...unless I'm wrong
about "EE".

I propose opening a half-step issue (60942) to avoid the Unicode check for Word
6.0.  This at least prevents quite a few exceptions in our test corpus.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message