poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Gioran <himi...@gmail.com>
Subject Re: hslf way of getting slideshow text language
Date Sat, 24 Feb 2007 18:04:22 GMT
regards everyone,

it appears that i have some useful results regarding content language 
information extraction from ppt documents using hslf. Note that all 
information below is derived from reverse engineering and its correctness can 
be argued upon until more experimentation.

The data set i used consisted of many ppt documents from different versions of 
ms powerpoint in Greek, French, English and Italian, and i have noticed the 
following patterns:

4008 atoms are followed by a 4010 atom that stores language information. This 
is true also for 4000 atoms that have unicode text *with* non-Unicode (my 
guess, the system's default language is stored as unicode, all others as 
ascii - more to find on that). Now, the 4010 atoms that correspond to 4008 
atoms have a fairly consistent appearence, that is:

first 4 bytes as known (record type and code)
next 4 bytes length (also known)

what follows are records that hold information regarding language ID and 
spelling info in the following format:

first, the no of characters this record applies to (4 bytes)

the next bytes are a bit more complicated. So far I have encountered 2 types 
of data. Either the value 0x00000006 or 0x000k00000007 (the lengths are 
correct, that is they *are* different) that certainly have spelling 
information, which is apparent from the transition from the second to the 
first when the "ignore spelling" option is selected for that text. k above is 
a value that varies and i have been unable so far to attribute to some 
property, presumably due to the simplicity of the text i have used.

After that comes the language information (2 bytes, as known for ms formats) 
and then a trailer value of 2 bytes that is constantly 0x0000

The atom ends with bytes that i have been so far unable to control as to how 
they appear. They are of small length, say 8 or 10 bytes, mostly zero. They 
could be information that applies to the slide as a whole, but i have nothing 
on it yet.

Hope this information is of some value for more generalized results. Time 
contstraints do not allow me to work on this much however, so some feedback 
is as always appreciated. I will post any futher findings i come up with.

Chris Gioran

To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

View raw message