Added: poi/site/publish/hdgf/index.html URL: http://svn.apache.org/viewvc/poi/site/publish/hdgf/index.html?rev=1423805&view=auto ============================================================================== --- poi/site/publish/hdgf/index.html (added) +++ poi/site/publish/hdgf/index.html Wed Dec 19 09:27:20 2012 @@ -0,0 +1,263 @@ + + + + + + + + + +POI-HDGF - Java API To Access Microsoft Visio Format Files + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + + + + +
+ Search Apache POI
+
+
+
+
+
+
+

POI-HDGF - Java API To Access Microsoft Visio Format Files

+
+
+ + + + + +
+

Overview

+
+ + + +

HDGF is the POI Project's pure Java implementation of the Visio file format.

+ +

Currently, HDGF provides a low-level, read-only api for + accessing Visio documents. It also provides a + way + to extract the textual content from a file. +

+ +

At this time, there is no usermodel api or similar, + only low level access to the streams, chunks and chunk commands. + Users are advised to check the unit tests to see how everything + works. They are also well advised to read the documentation + supplied with + vsdump + to get a feel for how Visio files are structured.

+ +

To get a feel for the contents of a file, and to track down + where data of interest is stored, HDGF comes with + VSDDumper + to print out the contents of the file. Users should also make + use of + vsdump + to probe the structure of files.

+ +
+
Note
+
+ This code currently lives the + scratchpad area + of the POI SVN repository. + Ensure that you have the scratchpad jar or the scratchpad + build area in your + classpath before experimenting with this code. +
+
+ + + +
+

Steps required for write support

+
+ + +

Currently, HDGF is only able to read visio files, it is + not able to write them back out again. We believe the + following are the steps that would need to be taken to + implement it.

+ +
    + +
  1. Re-write the decompression support in LZW4HDGF as + HDGFLZW, which will be much better documented, and also + under the ASL. Completed October 2007 +
  2. + +
  3. Add compression support to HDGFLZW. + In progress - works for small streams but encoding + goes wrong on larger ones +
  4. + +
  5. Have HDGF just write back the raw bytes it read in, and + have a test to ensure the file is un-changed.
  6. + +
  7. Have HDGF generate the bytes to write out from the + Stream stores, using the compressed data as appropriate, + without re-compressing. Plus test to ensure file is + un-changed.
  8. + +
  9. Have HDGF generate the bytes to write out from the + Stream stores, re-compressing any streams that were + decompressed. Plus test to ensure file is un-changed.
  10. + +
  11. Have HDGF re-generate the offsets in pointers for the + locations of the streams. Plus test to ensure file is + un-changed.
  12. + +
  13. Have HDGF re-generate the bytes for all the chunks, from + the chunk commands. Tests to ensure the chunks are + serialized properly, and then that the file is un-changed
  14. + +
  15. Alter the data of one command, but keep it the same + length, and check visio can open the file when written + out.
  16. + +
  17. Alter the data of one command, to a new length, and + check that visio can open the file when written out.
  18. + +
+ + + + +
by Nick Burch
+
+
+
+
+ + + + + + Propchange: poi/site/publish/hdgf/index.html ------------------------------------------------------------------------------ svn:executable = * Added: poi/site/publish/hmef/index.html URL: http://svn.apache.org/viewvc/poi/site/publish/hmef/index.html?rev=1423805&view=auto ============================================================================== --- poi/site/publish/hmef/index.html (added) +++ poi/site/publish/hmef/index.html Wed Dec 19 09:27:20 2012 @@ -0,0 +1,430 @@ + + + + + + + + + +POI-HMEF - Java API To Access Microsoft Transport Neutral Encoding Files (TNEF) + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + + + + +
+ Search Apache POI
+
+
+
+
+
+
+

POI-HMEF - Java API To Access Microsoft Transport Neutral Encoding Files (TNEF)

+
+
+ + + + + +
+

Overview

+
+ + + +

HMEF is the POI Project's pure Java implementation of Microsoft's + TNEF (Transport Neurtral Encoding Format), aka winmail.dat, + which is used by Outlook and Exchange in some situations.

+ +

Currently, HMEF provides a read-only api for accessing common + message and attachment attributes, including the message body + and attachment files. In addition, it's possible to have + read-only access to all of the underlying TNEF and MAPI + attributes of the message and attachments.

+ +

HMEF also provides a command line tool for extracting out + the message body and attachment files from a TNEF (winmail.dat) + file.

+ + +
+
Note
+
+ This code currently lives the + scratchpad area + of the POI SVN repository. + Ensure that you have the scratchpad jar or the scratchpad + build area in your classpath before experimenting with this code. +
+
+ +
+
Note
+
+ This code is a new POI feature, and the first release that will + contain it will be POI 3.8 beta 2. Until then, you will need to + build your own jars from a svn + checkout. +
+
+ + + + +
+

Using HMEF to access TNEF (winmail.dat) files

+
+ + + + +
+

Easy extraction of message body and attachment files

+
+ + + +

The class org.apache.poi.hmef.extractor.HMEFContentsExtractor + provides both command line and Java extraction. It allows the + saving of the message body (an RTF file), and all of the + attachment files, to a single directory as specified.

+ + +

From the command line, simply call the class specifying the + TNEF file to extract, and the directory to place the extracted + files into, eg:

+ +
+              java -classpath poi-3.8-FINAL.jar:poi-scratchpad-3.8-FINAL.jar org.apache.poi.hmef.extractor.HMEFContentsExtractor winmail.dat /tmp/extracted/
+           
+ + +

From Java, there are two method calls on the class, one to + extract the message body RTF to a file, and the other to extract + all the attachments to a directory. A typical use would be:

+ +
+public void extract(String winmailFilename, String directoryName) throws Exception {
+   HMEFContentsExtractor ext = new HMEFContentsExtractor(new File(winmailFilename));
+      
+   File dir = new File(directoryName);
+   File rtf = new File(dir, "message.rtf");
+   if(! dir.exists()) {
+       throw new FileNotFoundException("Output directory " + dir.getName() + " not found");
+   }
+      
+   System.out.println("Extracting...");
+   ext.extractMessageBody(rtf);
+   ext.extractAttachments(dir);
+   System.out.println("Extraction completed");
+}
+           
+ + + + +
+

Attachment attributes and contents

+
+ + + +

To get at your attachments, simply call the + getAttachments() method on a HMEFMessage + instance, and you'll receive a list of all the attachments.

+ +

When you have a org.apache.poi.hmef.Attachment object, + there are several helper methods available. These will all + return the value of the appropriate underlying attachment + attributes, or null if for some reason the attribute isn't + present in your file.

+ +
    + +
  • +getFilename() - returns the name of the attachment + file, possibly in 8.3 format
  • + +
  • +getLongFilename() - returns the full name of the + attachment file
  • + +
  • +getExtension() - returns the extension of the + attachment file, including the "."
  • + +
  • +getModifiedDate() - returns the date that the + attachment file was last edited on
  • + +
  • +getContents() - returns a byte array of the contents + of the attached file
  • + +
  • +getRenderedMetaFile() - returns a byte array of + a windows meta file representation of the attached file
  • + +
+ + + + +
+

Message attributes and message body

+
+ + + +

A org.apache.poi.hmef.HMEFMessage instance is created + from an InputStream of the underlying TNEF (winmail.dat) + file.

+ +

From a HMEFMessage, there are three main methods of + interest to call:

+ +
    + +
  • +getBody() - returns a String containing the RTF + contents of the message body.
  • + +
  • +getSubject() - returns the message subject
  • + +
  • +getAttachments() - returns the list of + Attachment objects for the message
  • + +
+ + + + +
+

Low level attribute access

+
+ + + +

Both Messages and Attachments contain two kinds of attributes. + These are TNEFAttribute and MAPIAttribute.

+ +

TNEFAttribute is specific to TNEF files in terms of the + available types and properties. In general, Attachments have a + few more useful ones of these then Messages.

+ +

MAPIAttributes hold standard MAPI properties and values, and + work in a similar way to HSMF + (Outlook) does. There are typically many of these on both + Messages and Attachments. Note - see limitations +

+ +

Both HMEFMessage and Attachment supports + support two different ways of getting to attributes of interest. + Firstly, they support list getters, to return all attributes + (either TNEF or MAPI). Secondly, they support specific getters by + TNEF or MAPI property.

+ +
+HMEFMessage msg = new HMEFMessage(new FileInputStream(file));
+for(TNEFAttribute attr : msg.getMessageAttributes) {
+   System.out.println("TNEF : " + attr);
+}
+for(MAPIAttribute attr : msg.getMessageMAPIAttributes) {
+   System.out.println("MAPI : " + attr);
+}
+System.out.println("Subject is " + msg.getMessageMAPIAttribute(MAPIProperty.CONVERSATION_TOPIC));
+
+for(Attachment attach : msg.getAttachments()) {
+   for(TNEFAttribute attr : attach.getAttributes) {
+      System.out.println("A.TNEF : " + attr);
+   }
+   for(MAPIAttribute attr : attach.getMAPIAttributes) {
+      System.out.println("A.MAPI : " + attr);
+   }
+   System.out.println("Filename is " + attach.getAttribute(TNEFProperty.CID_ATTACHTITLE));
+   System.out.println("Extension is " + attach.getMAPIAttribute(MAPIProperty.ATTACH_EXTENSION));
+}
+           
+ + + + + +
+

Investigating a TNEF file

+
+ + + +

To get a feel for the contents of a file, and to track down + where data of interest is stored, HMEF comes with + HMEFDumper + to print out the contents of the file.

+ + + + +
+

Limitations

+
+ + + +

HMEF is currently a work-in-progress, and not everything + works yet. The current limitations are:

+ +
    + +
  • Non-standard MAPI properties from the range 0x8000 to 0x8fff + may not be being quite correctly turned into attributes. + The values show up, but the name and type may not always + be correct.
  • + +
  • All testing so far has been performed on a small number of + English documents. We think we're correctly turning bytes into + Java unicode strings, but we need a few non-English sample + files in the test suite to verify this!
  • + +
+ + + +
by Nick Burch
+
+
+
+
+ + + + + + Propchange: poi/site/publish/hmef/index.html ------------------------------------------------------------------------------ svn:executable = * Added: poi/site/publish/howtobuild.html URL: http://svn.apache.org/viewvc/poi/site/publish/howtobuild.html?rev=1423805&view=auto ============================================================================== --- poi/site/publish/howtobuild.html (added) +++ poi/site/publish/howtobuild.html Wed Dec 19 09:27:20 2012 @@ -0,0 +1,431 @@ + + + + + + + + + +Apache POI - How To Build + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + + + + +
+ Search Apache POI
+
+
+
+
+
+
+

Apache POI - How To Build

+
+
+ + + + +
+

JDK Version

+
+ + +

+ POI 3.5 and later requires the JDK version 1.5 or later. + Versions prior to 3.5 require JDK 1.4+ +

+ + + +
+

Install Apache Ant

+
+ + +

+ The POI build system requires Apache Ant + +

+ +

+ Specifically the build has been tested to work with Ant version + 1.7.1. To install the product download the distribution and follow the instructions. +

+ +

+ Remember to set the ANT_HOME environment variable and add ANT_HOME/bin + to your shell's PATH. +

+ + + +
+

Install JUnit

+
+ + +

+ Running unit tests and building a distribution requires JUnit. +

+ +

+ Just pick the latest versions of the jars from + SourceForge and place + them in ANT_HOME/lib. Make sure that optional.jar is in ANT_HOME/lib. +

+ + + +
+

Install Apache Forrest

+
+ + +

+ The POI build system requires Apache Forrest to build the documentation. +

+ +

+ Specifically the build has been tested to work with Forrest 0.5. This is an old release which is available + here. +

+ +

+ Remember to set the FORREST_HOME environment variable. +

+ + + +
+

Building Targets with Ant

+
+ + +

+ The main targets of interest to our users are: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Ant TargetDescription
cleanErase all build work products (ie. everything in the + build directory
compileCompiles all files from main, ooxml and scratchpad
testRun all unit tests from main, ooxml and scratchpad
jarProduce jar files
assembleProduce .zip and tar.gz distribution packages
docsGenerate all documentation (Requires Apache Forrest)
+ + + + +
by Glen Stampoultzis, Tetsuya Kitahata, David Fisher
+
+
+
+
+ + + + + + Propchange: poi/site/publish/howtobuild.html ------------------------------------------------------------------------------ svn:executable = * Added: poi/site/publish/hpbf/file-format.html URL: http://svn.apache.org/viewvc/poi/site/publish/hpbf/file-format.html?rev=1423805&view=auto ============================================================================== --- poi/site/publish/hpbf/file-format.html (added) +++ poi/site/publish/hpbf/file-format.html Wed Dec 19 09:27:20 2012 @@ -0,0 +1,374 @@ + + + + + + + + + +POI-HPBF - A Guide to the Publisher File Format + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + + + + +
+ Search Apache POI
+
+
+
+
+
+
+

POI-HPBF - A Guide to the Publisher File Format

+
+
+ + + + + +
+

Document Streams

+
+ +

+ The file is made up of a number of POIFS streams. A typical + file will be made up as follows: +

+ +
+Root Entry -
+  Objects -
+    (no children)
+  SummaryInformation <(0x05)SummaryInformation>
+  DocumentSummaryInformation <(0x05)DocumentSummaryInformation>
+  Escher -
+    EscherStm
+    EscherDelayStm
+  Quill -
+    QuillSub -
+      CONTENTS
+      CompObj <(0x01)CompObj>
+  Envelope
+  Contents
+  Internal <(0x03)Internal>
+  CompObj <(0x01)CompObj>
+  VBA -
+    (no children)
+
+ + + +
+

Changing Text

+
+ +

If you make a change to the text of a file, but not change + how much text there is, then the CONTENTS stream + will undergo a small change, and the Contents stream + will undergo a large change.

+ +

If you make a change to the text of a file, and change the + amount of text there is, then both the Contents and + the CONTENTS streams change.

+ + + +
+

Changing Shapes

+
+ +

If you alter the size of a textbox, but make no text changes, + then both Contents and CONTENTS streams + change. There are no changes to the Escher streams.

+ +

If you set the background colour of a textbox, but make + no changes to the text, (to finish off)

+ + + +
+

Structure of CONTENTS

+
+ +

First we have "CHNKINK ", followed by 24 bytes.

+ +

Next we have 20 sequences of 24 bytes each. If the first two bytes + at 0x1800, then that sequence entry exists, but if it's 0x0000 then + the entry doesn't exist. If it does exist, we then have 4 bytes of + upper case ASCII text, followed by three little endian shorts. + The first of these seems to be the count of that type, the second is + usually 1, the third is usually zero. The we have another 4 bytes of + upper case ASCII text, normally but not always the same as the first + text. Finally, we have an unsigned little endian 32 bit offset to + the start of the data for this, then an unsigned little endian + 32 bit offset of the length of this section.

+ +

Normally, the first sequence entry is for TEXT, and the text data + will start at 0x200. After that is normally two or three STSH entries + (so the first short has values 0, then 1, then 2). After that it + seems to vary.

+ +

At 0x200 we have the text, stored as little endian 16 bit unicode.

+ +

After the text comes all sorts of other stuff, presumably as + described by the sequences.

+ +

For a contents stream of length 7168 / 0x1c00 bytes, the start + looks something like:

+ +
+CHNKINK       // "CHNKINK "
+04 00 07 00   // Normally 04 00 07 00
+13 00 00 03   // Normally ## 00 00 03
+00 02 00 00   // Normally 00 ## 00 00
+00 1c 00 00   // Normally length of the stream
+f8 01 13 00   // Normally f8 01 11/13 00
+ff ff ff ff   // Normally seems to be ffffffff
+
+18 00 
+TEXT 00 00 01 00 00 00       // TEXT 0 1 0
+TEXT 00 02 00 00 d0 03 00 00 // TEXT from: 200 (512), len: 3d0 (976)
+18 00 
+STSH 00 00 01 00 00 00       // STSH 0 1 0
+STSH d0 05 00 00 1e 00 00 00 // STSH from: 5d0 (1488), len: 1e (30)
+18 00 
+STSH 01 00 01 00 00 00       // STSH 1 1 0
+STSH ee 05 00 00 b8 01 00 00 // STSH from: 5ee (1518), len: 1b8 (440)
+18 00 
+STSH 02 00 01 00 00 00       // STSH 2 1 0
+STSH a6 07 00 00 3c 00 00 00 // STSH from: 7a6 (1958), len: 3c (60)
+18 00 
+FDPP 00 00 01 00 00 00       // FDPP 0 1 0
+FDPP 00 08 00 00 00 02 00 00 // FDPP from: 800 (2048), len: 200 (512)
+18 00 
+FDPC 00 00 01 00 00 00       // FDPC 0 1 0
+FDPC 00 0a 00 00 00 02 00 00 // FDPC from: a00 (2560), len: 200 (512)
+18 00 
+FDPC 01 00 01 00 00 00       // FDPC 1 1 0
+FDPC 00 0c 00 00 00 02 00 00 // FDPC from: c00 (3072), len: 200 (512)
+18 00 
+SYID 00 00 01 00 00 00       // SYID 0 1 0
+SYID 00 0e 00 00 20 00 00 00 // SYID from: e00 (3584), len: 20 (32)
+18 00 
+SGP  00 00 01 00 00 00       // SGP  0 1 0
+SGP  20 0e 00 00 0a 00 00 00 // SGP  from: e20 (3616), len: a (10)
+18 00 
+INK  00 00 01 00 00 00       // INK  0 1 0
+INK  2a 0e 00 00 04 00 00 00 // INK  from: e2a (3626), len: 4 (4)
+18 00 
+BTEP 00 00 01 00 00 00       // BTEP 0 1 0
+PLC  2e 0e 00 00 18 00 00 00 // PLC  from: e2e (3630), len: 18 (24)
+18 00 
+BTEC 00 00 01 00 00 00       // BTEC 0 1 0
+PLC  46 0e 00 00 20 00 00 00 // PLC  from: e46 (3654), len: 20 (32)
+18 00 
+FONT 00 00 01 00 00 00       // FONT 0 1 0
+FONT 66 0e 00 00 48 03 00 00 // FONT from: e66 (3686), len: 348 (840)
+18 00 
+TCD  03 00 01 00 00 00       // TCD  3 1 0
+PLC  ae 11 00 00 24 00 00 00 // PLC  from: 11ae (4526), len: 24 (36)
+18 00 
+TOKN 04 00 01 00 00 00       // TOKN 4 1 0
+PLC  d2 11 00 00 0a 01 00 00 // PLC  from: 11d2 (4562), len: 10a (266)
+18 00 
+TOKN 05 00 01 00 00 00       // TOKN 5 1 0
+PLC  dc 12 00 00 2a 01 00 00 // PLC  from: 12dc (4828), len: 12a (298)
+18 00 
+STRS 00 00 01 00 00 00       // STRS 0 1 0
+PLC  06 14 00 00 46 00 00 00 // PLC  from: 1406 (5126), len: 46 (70)
+18 00 
+MCLD 00 00 01 00 00 00       // MCLD 0 1 0
+MCLD 4c 14 00 00 16 06 00 00 // MCLD from: 144c (5196), len: 616 (1558)
+18 00 
+PL   00 00 01 00 00 00       // PL   0 1 0
+PL   62 1a 00 00 48 00 00 00 // PL   from: 1a62 (6754), len: 48 (72)
+00 00                        // Blank entry follows
+00 00 00 00 00 00
+00 00 00 00 00 00 00 00 
+00 00 00 00 00 00 00 00
+
+(the text will then start)
+
+ +

We think that the first 4 bytes of text describes the + the function of the data at the offset. The first short is + then the count of that type, eg the 2nd will have 1. We + think that the second 4 bytes of text describes the format + of data block at the offset. The format of the text block + is easy, but we're still trying to figure out the others.

+ + + +
+

Structure of TEXT bit

+
+ +

This is very simple. All the text for the document is + stored in a single bit of the Quill CONTENTS. The text + is stored as little endian 16 bit unicode strings.

+ + + +
+

Structure of PLC bit

+
+ +

The first four bytes seem to hold the count of the + entries in the bit, and the second four bytes seem to hold + the type. There is then some pre-data, and then data for + each of the entries, the exact format dependant on the type.

+ +

Type 0 has 4 2 byte unsigned ints, then a pair of 2 byte + unsigned ints for each entry.

+ +

Type 4 has 4 2 byte unsigned ints, then a pair of 4 byte + unsigned ints for each entry.

+ +

Type 8 has 7 2 byte unsigned ints, then a pair of 4 byte + unsigned ints for each entry.

+ +

Type 12 holds hyperlinks, and is very much more complex. + See org.apache.poi.hpbf.model.qcbits.QCPLCBit + for our best guess as to how the contents match up.

+ + + + +
by Nick Burch
+
+
+
+
+ + + + + + Propchange: poi/site/publish/hpbf/file-format.html ------------------------------------------------------------------------------ svn:executable = * Added: poi/site/publish/hpbf/file-format.xml URL: http://svn.apache.org/viewvc/poi/site/publish/hpbf/file-format.xml?rev=1423805&view=auto ============================================================================== Binary file - no diff available. Propchange: poi/site/publish/hpbf/file-format.xml ------------------------------------------------------------------------------ svn:executable = * Propchange: poi/site/publish/hpbf/file-format.xml ------------------------------------------------------------------------------ svn:mime-type = application/xml Added: poi/site/publish/hpbf/index.html URL: http://svn.apache.org/viewvc/poi/site/publish/hpbf/index.html?rev=1423805&view=auto ============================================================================== --- poi/site/publish/hpbf/index.html (added) +++ poi/site/publish/hpbf/index.html Wed Dec 19 09:27:20 2012 @@ -0,0 +1,215 @@ + + + + + + + + + +POI-HPBF - Java API To Access Microsoft Publisher Format Files + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + + + + +
+ Search Apache POI
+
+
+
+
+
+
+

POI-HPBF - Java API To Access Microsoft Publisher Format Files

+
+
+ + + + + +
+

Overview

+
+ + + +

HPBF is the POI Project's pure Java implementation of the + Publisher file format.

+ +

Currently, HPBF is in an early stage, whilst we try to + figure out the file format. So far, we have basic text + extraction support, and are able to read some parts within + the file. Writing is not yet supported, as we are unable + to make sense of the Contents stream, which we think has + lots of offsets to other parts of the file.

+ +

Our initial aim is to provude a text extractor for the format + (now done), and be able to extract hyperlinks from within + the document (partly supported). Additional low level + code to process the file format may follow, if there + is demand and developer interest warrant it.

+ +

Text Extraction is available via the + org.apache.poi.hpbf.extractor.PublisherTextExtractor + class.

+ +

At this time, there is no usermodel api or similar. + There is only low level support for certain parts of + the file, but by no means all of it.

+ +

Our current understanding of the file format is documented + here.

+ +
+
Note
+
+ This code currently lives the + scratchpad area + of the POI SVN repository. + Ensure that you have the scratchpad jar or the scratchpad + build area in your + classpath before experimenting with this code. +
+
+ + + +
by Nick Burch
+
+
+
+
+ + + + + + Propchange: poi/site/publish/hpbf/index.html ------------------------------------------------------------------------------ svn:executable = * --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@poi.apache.org For additional commands, e-mail: commits-help@poi.apache.org