poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Stadler <dominik.stad...@gmx.at>
Subject Re: Apach POI doc/docx parser
Date Mon, 07 Oct 2019 16:24:04 GMT
Hi,

it seems the document does not make it through the list for some reason,
can you report an issue at https://bz.apache.org/bugzilla/ and attach it
there. This way we also have a better trail of work on the problem.

Dominik.

On Mon, Oct 7, 2019 at 6:33 AM Teresa Kim
<teresa.kim@linguamatics.com.invalid> wrote:

> Hi Dominik
>
>
> Sure I attached the symbol_test.doc document in the previous email.
>
> I think I cannot attach the document in email?
>
> Is there anyway I can share the document?
>
>
> Thanks
>
> T.
>
> On 06/10/2019 16:29, Dominik Stadler wrote:
> > Hi,
> >
> > can you share an example document which shows the behavior?
> >
> > Thanks... Dominik.
> >
> >
> > On Sun, Oct 6, 2019 at 6:48 AM Teresa Kim
> > <teresa.kim@linguamatics.com.invalid> wrote:
> >
> >> Hi
> >>
> >>
> >> I have documents (either 'doc' or 'docx') that have a special character
> >> for 'greater than equal' and using codes in 'WordToHtmlConverter', I see
> >> those characters are converted into '('.
> >>
> >> I tried with the latest apache poi release 4.1.0.
> >>
> >>
> >> My java code is:
> >>
> >>
> >> public class TestWordtoHtmlConverter {
> >>
> >>       public static void main(String[] args ) {
> >>           try {
> >>           HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new
> >> FileInputStream(args[0]));
> >>
> >>           WordToHtmlConverter wordToHtmlConverter = new
> WordToHtmlConverter(
> >>
>  DocumentBuilderFactory.newInstance().newDocumentBuilder()
> >>                           .newDocument());
> >>
> >>           wordToHtmlConverter.processDocument(wordDocument);
> >>           Document htmlDocument = wordToHtmlConverter.getDocument();
> >>           ByteArrayOutputStream out = new ByteArrayOutputStream();
> >>           DOMSource domSource = new DOMSource(htmlDocument);
> >>           StreamResult streamResult = new StreamResult(out);
> >>
> >>           TransformerFactory tf = TransformerFactory.newInstance();
> >>           Transformer serializer = tf.newTransformer();
> >>           serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
> >>           serializer.setOutputProperty(OutputKeys.INDENT, "yes");
> >>           serializer.setOutputProperty(OutputKeys.METHOD, "html");
> >>           serializer.transform(domSource, streamResult);
> >>           out.close();
> >>
> >>           String result = new String(out.toByteArray());
> >>           System.out.println(result);
> >>         } catch (Exception e) {
> >>         }
> >>
> >> Is there anyway I can correctly identify these symbols?
> >>
> >>
> >> In the sample document, I am interested in getting 'bad one'.
> >>
> >>
> >> Thanks
> >>
> >> T.
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> >> For additional commands, e-mail: user-help@poi.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message