lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: indexing html and pdf files of Russian language
Date Fri, 12 Dec 2008 13:33:37 GMT
Hi ppuyen,

Sounds like an encoding issue.  How are you reading in/parsing the HTML and PDF files?

Steve

On 12/11/2008 at 8:28 PM, ppuyen wrote:
> 
> Hi, everybody.
> 
> I have a problem. I already did:
> 1. Indexing russian language with text file  successfully
> 2. Iindexing pdf, html, ppt file with StandarAnalyze successfully .
> But now I need to do indexing files html and pdf ... format with
> Russian language but it's only indexing Text file. Didn't do with
> HTML or PDF (I run debug , when indexing file html with Russian
> language ,it showed unreadable character).
> 
> Who can tell me why  ? and help  How i can indexing file HTML, PDF...
> Russian language file ?
> 
> thanks a lot .
> -- View this message in context:
> http://www.nabble.com/indexing-html-and-pdf-files-of-Russian-language-tp20968404p20968404.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Mime
View raw message