Return-Path: X-Original-To: apmail-poi-user-archive@www.apache.org Delivered-To: apmail-poi-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E340A18687 for ; Thu, 25 Jun 2015 08:42:23 +0000 (UTC) Received: (qmail 95841 invoked by uid 500); 25 Jun 2015 08:42:23 -0000 Delivered-To: apmail-poi-user-archive@poi.apache.org Received: (qmail 95810 invoked by uid 500); 25 Jun 2015 08:42:23 -0000 Mailing-List: contact user-help@poi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "POI Users List" Delivered-To: mailing list user@poi.apache.org Received: (qmail 95799 invoked by uid 99); 25 Jun 2015 08:42:23 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jun 2015 08:42:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C34E5D01DA for ; Thu, 25 Jun 2015 08:42:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.602 X-Spam-Level: X-Spam-Status: No, score=0.602 tagged_above=-999 required=6.31 tests=[FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.429, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id KpoCHEY8PRb4 for ; Thu, 25 Jun 2015 08:42:12 +0000 (UTC) Received: from DUB004-OMC3S17.hotmail.com (dub004-omc3s17.hotmail.com [157.55.2.26]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 9C4444C0EA for ; Thu, 25 Jun 2015 08:42:11 +0000 (UTC) Received: from DUB121-W45 ([157.55.2.8]) by DUB004-OMC3S17.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22751); Thu, 25 Jun 2015 01:42:05 -0700 X-TMN: [RWvMeXeZwXSBwxhDy1yREvO5spDO0AEp] X-Originating-Email: [teressakim70@hotmail.com] Message-ID: Content-Type: multipart/mixed; boundary="_d99af919-9043-47f6-ba98-6c02a66ad62d_" From: teressa kim To: POI Users List Subject: RE: Missing greek character of mu from doc extraction Date: Thu, 25 Jun 2015 08:42:04 +0000 Importance: Normal In-Reply-To: References: , MIME-Version: 1.0 X-OriginalArrivalTime: 25 Jun 2015 08:42:05.0128 (UTC) FILETIME=[D03EF480:01D0AF22] --_d99af919-9043-47f6-ba98-6c02a66ad62d_ Content-Type: text/plain; charset="iso-8859-7" Content-Transfer-Encoding: quoted-printable Hi Dominik=0A= =0A= This is my java code=2C=A0 and I enclose a word document for you to have a = look.=0A= There are three symbols for Greek mu and the one in the first line next of = "5" is not converted into a html.=0A= It's been missing. Other two symbols are fine.=0A= =0A= Thank you=0A= Teresa.=0A= =0A= =0A= public class TestWordtoHtmlConverter {=0A= =0A= =A0=A0=A0 public static void main(String[] args ) {=0A= =A0=A0=A0 =A0=A0=A0 try {=0A= =A0=A0=A0 =A0=A0=A0 HWPFDocumentCore wordDocument =3D WordToHtmlUtils.loadD= oc(new FileInputStream(args[0]))=3B=0A= =0A= =A0=A0=A0 =A0=A0=A0 WordToHtmlConverter wordToHtmlConverter =3D new WordToH= tmlConverter(=0A= =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 DocumentBuilderFactory.newInstance(= ).newDocumentBuilder()=0A= =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 .newDocument())= =3B=0A= =0A= =A0=A0=A0 =A0=A0=A0 wordToHtmlConverter.processDocument(wordDocument)=3B=0A= =A0=A0=A0 =A0=A0=A0 Document htmlDocument =3D wordToHtmlConverter.getDocume= nt()=3B=0A= =A0=A0=A0 =A0=A0=A0 ByteArrayOutputStream out =3D new ByteArrayOutputStream= ()=3B=0A= =A0=A0=A0 =A0=A0=A0 DOMSource domSource =3D new DOMSource(htmlDocument)=3B= =0A= =A0=A0=A0 =A0=A0=A0 StreamResult streamResult =3D new StreamResult(out)=3B= =0A= =0A= =A0=A0=A0 =A0=A0=A0 TransformerFactory tf =3D TransformerFactory.newInstanc= e()=3B=0A= =A0=A0=A0 =A0=A0=A0 Transformer serializer =3D tf.newTransformer()=3B=0A= =A0=A0=A0 =A0=A0=A0 serializer.setOutputProperty(OutputKeys.ENCODING=2C "UT= F-8")=3B=0A= =A0=A0=A0 =A0=A0=A0 serializer.setOutputProperty(OutputKeys.INDENT=2C "yes"= )=3B=0A= =A0=A0=A0 =A0=A0=A0 serializer.setOutputProperty(OutputKeys.METHOD=2C "html= ")=3B=0A= =A0=A0=A0 =A0=A0=A0 serializer.transform(domSource=2C streamResult)=3B=0A= =A0=A0=A0 =A0=A0=A0 out.close()=3B=0A= =0A= =A0=A0=A0 =A0=A0=A0 String result =3D new String(out.toByteArray())=3B=0A= =A0=A0=A0=A0=A0=A0=A0 System.out.println(result)=3B=0A= =A0=A0=A0 =A0 } catch (Exception e) {=0A= =A0=A0=A0 =A0 }=0A= =0A= =A0=A0=A0 }=0A= =0A= =0A= ----------------------------------------=0A= > Date: Sat=2C 20 Jun 2015 11:47:13 +0200=0A= > Subject: Re: Missing greek character of mu from doc extraction=0A= > From: dominik.stadler@gmx.at=0A= > To: user@poi.apache.org=0A= >=0A= > Hi=2C=0A= >=0A= > Can you provide a sample document and the java code that you are using=0A= > so it is easier to try to reproduce this?=0A= >=0A= > Thanks... Dominik.=0A= >=0A= > On Thu=2C Jun 4=2C 2015 at 10:19 AM=2C teressa kim wrote:=0A= >> Hi=0A= >>=0A= >> I have obsverved that the third greek character of mu "=EC" in word doc = file is not extracted when converting to html file using WordToHtmlConverte= r class. The mu character is http://www.scarfboy.com/coding/unicode-tool?s= =3DU%2BF06D=0A= >>=0A= >> Further=2C I also noticed that when I tried to apply the following state= ment to the mu character=2C I got "0028" which I think it should be for "("= left bracket.=0A= >>=0A= >> String hexCode =3D Integer.toHexString(paragraph.text().codePointAt(inde= x)).toUpperCase()=3B=0A= >>=0A= >> Could you please help me how to extract this mu character from the doc d= ocument?=0A= >>=0A= >> Thanks=0A= >> T.=0A= >>=0A= >> ---------------------------------------------------------------------=0A= >> To unsubscribe=2C e-mail: user-unsubscribe@poi.apache.org=0A= >> For additional commands=2C e-mail: user-help@poi.apache.org=0A= >>=0A= >=0A= > ---------------------------------------------------------------------=0A= > To unsubscribe=2C e-mail: user-unsubscribe@poi.apache.org=0A= > For additional commands=2C e-mail: user-help@poi.apache.org=0A= >=0A= = --_d99af919-9043-47f6-ba98-6c02a66ad62d_ Content-Type: text/plain; charset=us-ascii --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For additional commands, e-mail: user-help@poi.apache.org --_d99af919-9043-47f6-ba98-6c02a66ad62d_--