Return-Path: Delivered-To: apmail-poi-user-archive@www.apache.org Received: (qmail 61456 invoked from network); 18 Jan 2010 11:29:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Jan 2010 11:29:31 -0000 Received: (qmail 65150 invoked by uid 500); 18 Jan 2010 11:29:31 -0000 Delivered-To: apmail-poi-user-archive@poi.apache.org Received: (qmail 65111 invoked by uid 500); 18 Jan 2010 11:29:30 -0000 Mailing-List: contact user-help@poi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "POI Users List" Delivered-To: mailing list user@poi.apache.org Received: (qmail 65101 invoked by uid 99); 18 Jan 2010 11:29:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jan 2010 11:29:30 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [193.186.16.13] (HELO sauxb.salomon.at) (193.186.16.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jan 2010 11:29:23 +0000 Received: from servex01.wamas.com (servex01.salomon.at [172.28.2.2]) by sauxb.salomon.at (8.12.10/8.12.10) with ESMTP id o0IBT0aI029739 for ; Mon, 18 Jan 2010 12:29:01 +0100 (MET) X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4325 Content-Class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Subject: Character encoding for each character in Word Document Date: Mon, 18 Jan 2010 12:29:00 +0100 Message-ID: <18597F2B47F1394A9B309945EC724112028020C1@servex01.wamas.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Character encoding for each character in Word Document thread-index: AcqYMW5kXeThvhutQ/aWA8uBPGyUUw== From: "Doppelhofer Andreas" To: X-Scanned-By: MIMEDefang 2.54 on 172.28.2.13 hi all, i have a question getting character encoding for each character (ascii, unicode, iso-8859-5...) in a Word Document. Following code snippet extractes the text and convert it into a "hard coded" Charset Buffer. Is there a way to get the correct character encoding dynamically? Say, the first character "a" is ISO-8859-1 and the second is a russian character (like iso-8859-5) and so on. fs = new POIFSFileSystem(new FileInputStream("test.doc")); HWPFDocument mydoc = null; mydoc = new HWPFDocument(fs); Range myrange = mydoc.getRange(); for (int i = 0; i < myrange.numParagraphs(); i++) { Paragraph myparagraph = myrange.getParagraph(i); String mytext = myparagraph.text(); Charset charset = Charset.forName("ISO-8859-5"); // "hard coded" :-( CharsetDecoder decoder = charset.newDecoder(); ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(mytext)); // do something with bbuf } Thx dops -- Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz Sitz der Gesellschaft: Friesach bei Graz UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K Firmenbuchgericht: Landesgericht fur Zivilrechtssachen Graz --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For additional commands, e-mail: user-help@poi.apache.org