poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Extracting embedded files from HWPF docs
Date Fri, 07 Jun 2013 13:05:49 GMT
I just tried a pdf embedded within a .doc, and Tika extracted it.  I didn't test an mp3 so
your mileage might vary.

Might want to use Tika (instructions below) or dive into its guts for inspiration on using
HWPF directly (org.apache.tika.parser.microsoft.WordExtractor and org.apache.tika.parser.microsoft.AbstractPOIFSExtractor)
both in the parsers jar.

If you are going the Tika route:
Create a class that implements EmbeddedResourceHandler and override "handle" with something
like this:

1)         @Override

2)         public void handle(String embeddedFileName, MediaType mediaType, InputStream is)


4)                System.err.println("in handle: " + mediaType);

5)                if (embeddedFileName == null || embeddedFileName.equals("")){

6)                       embeddedFileName = "unnamed_file_"+num;

7)                }

8)                //in case the "embeddedFileName" comes with path information, make sure
to take just the name

9)                String actualName = new File(embeddedFileName).getName();

10)               File outFile = //figure out what you want to call the file

11)               System.out.println("about to extract " + outFile);

12)               OutputStream os = null;

13)               try{

14)                      os = new FileOutputStream(outFile);

15)                      System.out.println("about to extract " + outFile);

16)                      IOUtils.copy(is, os);

17)                      os.flush();

18)               } catch (IOException e){

19)                      /* add logging*/

20)               } finally {

21)                      if (os != null){

22)                            try{

23)                                   os.close();

24)                            } catch (IOException e){

25)                                   //swallow

26)                            }

27)                      }

28)               }


30)          }

Then call tika like this (assuming you've named your EmbeddedResourceHandler "WithinDirectoryEmbeddedHandler"):
TikaInputStream is = TikaInputStream.get(f);
              ParserContainerExtractor containerExtractor = new ParserContainerExtractor();
              containerExtractor.extract(is, new ParserContainerExtractor(), new WithinDirectoryEmbeddedHandler(f));

From: Chris Bamford [mailto:cbamford@mimecast.com]
Sent: Friday, June 07, 2013 8:32 AM
To: POI Users List
Subject: Extracting embedded files from HWPF docs

Hi guys,

Is there a way to extract files embedded into Word docs (.doc, not .docx), using the HWPF

I understand that I can extract Pictures with


But I am specifically interested in non-pictures file too (e.g. MP3).


- Chris



[ Our Blog<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=4fe4e2dd06912e7d1cd683bef487ffb9>
]   [ Twitter<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=8e0c629db14f7e6a228bbafae637e470>
]   [ YouTube<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=a3920b681e0e22ee0e6b748e4e78f86c>

Chris Bamford
Senior Developer

m: +44 7860 405292

CityPoint, One Ropemaker Street, London, EC2Y 9AW.

+44 (0) 207 847 8700

The information contained in this communication from cbamford@mimecast.com<mailto:cbamford@mimecast.com>
sent at 2013-06-07 13:32:06 is confidential and may be legally privileged. It is intended
solely for use by user@poi.apache.org<mailto:user@poi.apache.org> and others authorized
to receive it. If you are not user@poi.apache.org<mailto:user@poi.apache.org> you are
hereby notified that any disclosure, copying, distribution or taking action in reliance of
the contents of this information is strictly prohibited and may be unlawful.

Mimecast Ltd. is a company registered in England and Wales with the company number 4698693
VAT No. GB 123 4197 34
Registered Office: CityPoint, One Ropemaker Street, Moorgate, London, EC2Y 9AW
Email Address: info@mimecast.com<mailto:info@mimecast.com>

This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based platform.
For more information please visit http://www.mimecast.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message