poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teresa Kim <teresa....@linguamatics.com.INVALID>
Subject Apach POI doc/docx parser
Date Sun, 06 Oct 2019 04:48:43 GMT
Hi


I have documents (either 'doc' or 'docx') that have a special character 
for 'greater than equal' and using codes in 'WordToHtmlConverter', I see 
those characters are converted into '('.

I tried with the latest apache poi release 4.1.0.


My java code is:


public class TestWordtoHtmlConverter {

     public static void main(String[] args ) {
         try {
         HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(args[0]));

         WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                 DocumentBuilderFactory.newInstance().newDocumentBuilder()
                         .newDocument());

         wordToHtmlConverter.processDocument(wordDocument);
         Document htmlDocument = wordToHtmlConverter.getDocument();
         ByteArrayOutputStream out = new ByteArrayOutputStream();
         DOMSource domSource = new DOMSource(htmlDocument);
         StreamResult streamResult = new StreamResult(out);

         TransformerFactory tf = TransformerFactory.newInstance();
         Transformer serializer = tf.newTransformer();
         serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
         serializer.setOutputProperty(OutputKeys.INDENT, "yes");
         serializer.setOutputProperty(OutputKeys.METHOD, "html");
         serializer.transform(domSource, streamResult);
         out.close();

         String result = new String(out.toByteArray());
         System.out.println(result);
       } catch (Exception e) {
       }

Is there anyway I can correctly identify these symbols?


In the sample document, I am interested in getting 'bad one'.


Thanks

T.





Mime
View raw message