poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject Re: How to Convert Doc or Docx File to HTML?
Date Sun, 29 Jan 2012 12:42:41 GMT
On Sun, 29 Jan 2012, abc wrote:
> I was able to reuse XWPFWordExtractorDecorator class. But It is just giving
> me text. How to read the XHTML? Here is what I did,

You probably don't want to call that directly, instead you should be using 
Tika in the normal way. This is taken from the Tika unit tests, and should 
give you an idea:

         StringWriter sw = new StringWriter();
         SAXTransformerFactory factory = (SAXTransformerFactory)
                  SAXTransformerFactory.newInstance();
         TransformerHandler handler = factory.newTransformerHandler();
         handler.getTransformer().setOutputProperty(OutputKeys.METHOD, 
"xml");
         handler.getTransformer().setOutputProperty(OutputKeys.INDENT, 
"yes");
         handler.setResult(new StreamResult(sw));

         // Try with a document containing various tables and formattings
         InputStream input = new FileInputStream("file.docx");
         try {
             Metadata metadata = new Metadata();
             parser.parse(input, handler, metadata, new ParseContext());
             return new XMLResult(sw.toString(), metadata);
         } finally {
             input.close();
         }

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message