poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From teressa kim <teressaki...@hotmail.com>
Subject RE: Missing greek character of mu from doc extraction
Date Thu, 25 Jun 2015 08:42:04 GMT
Hi Dominik

This is my java code,  and I enclose a word document for you to have a look.
There are three symbols for Greek mu and the one in the first line next of "5" is not converted
into a html.
It's been missing. Other two symbols are fine.

Thank you
Teresa.


public class TestWordtoHtmlConverter {

    public static void main(String[] args ) {
        try {
        HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(args[0]));

        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());

        wordToHtmlConverter.processDocument(wordDocument);
        Document htmlDocument = wordToHtmlConverter.getDocument();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StreamResult streamResult = new StreamResult(out);

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer serializer = tf.newTransformer();
        serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, streamResult);
        out.close();

        String result = new String(out.toByteArray());
        System.out.println(result);
      } catch (Exception e) {
      }

    }


----------------------------------------
> Date: Sat, 20 Jun 2015 11:47:13 +0200
> Subject: Re: Missing greek character of mu from doc extraction
> From: dominik.stadler@gmx.at
> To: user@poi.apache.org
>
> Hi,
>
> Can you provide a sample document and the java code that you are using
> so it is easier to try to reproduce this?
>
> Thanks... Dominik.
>
> On Thu, Jun 4, 2015 at 10:19 AM, teressa kim <teressakim70@hotmail.com> wrote:
>> Hi
>>
>> I have obsverved that the third greek character of mu "μ" in word doc file is not
extracted when converting to html file using WordToHtmlConverter class. The mu character is
http://www.scarfboy.com/coding/unicode-tool?s=U%2BF06D
>>
>> Further, I also noticed that when I tried to apply the following statement to the
mu character, I got "0028" which I think it should be for "(" left bracket.
>>
>> String hexCode = Integer.toHexString(paragraph.text().codePointAt(index)).toUpperCase();
>>
>> Could you please help me how to extract this mu character from the doc document?
>>
>> Thanks
>> T.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
 		 	   		  

Mime
View raw message