poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Valyanskiy <max...@jet.msk.su>
Subject Re: requested document- poi question
Date Thu, 02 Dec 2010 20:45:33 GMT
Hello!

02.12.2010, в 19:11, randeel wimalagunarathne написал(а):

> Hi Max,
> 
> yes, thats what i am trying to do. Can you help me with that?
> How did you find that there are 2 xslx files and one xls file?
> Thank you for providing me the help.
> 

Word stores embedded objects in "ObjectPool"  directory entry and name of that entries starts
with "_" symbol. 

If this directory contains "Package" entry then it contains OOXML based document as a raw
(ZIP) stream (you can use DocumentInputStream to read get that binary stream). Otherwise it
is some OLE-based format or some binary embedded in Ole10Native stream.

I recommend you to look at this two source files from Apache Tika project:

1) parse function in WordExtractor:
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

2) handleEmbeddedOfficeDoc at AbstractPOIFSExtractor:
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java

best wishes, Max
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message