poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doppelhofer Andreas" <Andreas.Doppelho...@salomon.at>
Subject Character encoding for each character in Word Document
Date Mon, 18 Jan 2010 11:29:00 GMT
hi all,
i have a question getting character encoding for each character (ascii,
unicode, iso-8859-5...) in a Word Document.
Following code snippet extractes the text and convert it into a "hard
coded" Charset Buffer.
Is there a way to get the correct character encoding dynamically?
Say, the first character "a" is ISO-8859-1 and the second is a russian
character (like iso-8859-5) and so on.
fs = new POIFSFileSystem(new FileInputStream("test.doc"));
HWPFDocument mydoc = null;

mydoc = new HWPFDocument(fs);
Range myrange = mydoc.getRange();

for (int i = 0; i < myrange.numParagraphs(); i++) {
  Paragraph myparagraph = myrange.getParagraph(i);
  String mytext = myparagraph.text();

  Charset charset = Charset.forName("ISO-8859-5");  // "hard coded" :-(
  CharsetDecoder decoder = charset.newDecoder();

  ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(mytext));

  // do something with bbuf

Thx dops


Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz
Sitz der Gesellschaft: Friesach bei Graz
UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
Firmenbuchgericht: Landesgericht fur Zivilrechtssachen Graz

To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

View raw message