lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bowesman Antony" <...@teamware.com>
Subject OT: Parsing Russian text from RTF
Date Fri, 16 May 2008 02:52:00 GMT
Not directly Lucene related, but I'm out of ideas and I'm not a Russian speaker...

I'm extracting text from RTF to pump into Lucene.  I'm using the original 
RTFEditorKit() code shown in LIA, p252 (actually, it's Nutch's RTFParser)

I have an RTF document, which starts with

---
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset204{\*\fname 
Times New Roman;}Times New Roman CYR;}{\f1\fswiss\fprq2\fcharset0 Arial;}}
{\colortbl ;\red0\green0\blue128;\red0\green0\blue0;}
\viewkind4\uc1\pard\tx360\cf1\f0\fs20\'c1\'ee\'eb\'fc\'f8\'e8\'ed\'f1\'f2\'e2\'ee
---

which should be 'Большинство', but when the RTFReader translationTable always 
maps the RTF bytes to char using latin1 and it never sets the correct 
translationTable.  The "fcharset204" is Russian, apparently CP1251, but there's 
a lovely line in the RTFReader class

/* TODO: per-font font encodings ( \fcharset control word ) ? */

Does anyone know if the RTF above is correct - the only place the translation 
table is set during the parse is when the 'ansi' keyword is set.

Other than that, anyone have any ideas about getting the text out of the RTF 
properly?

Thanks
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message