lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From svirid <templates2...@gmail.com>
Subject Lucene Indexer Encoding problem
Date Mon, 13 Oct 2008 17:49:08 GMT

Good day guys,

hope u can help me. I am trying to index French and Russian documents with
Lucene and have no luck. I am new in JAVA so basically I really need your
help. 

I was able to get text from pdfs, when I save it its all fine I can clearly
see russian charachters in txt file but when I add it to the Index its all
??? or other garbage.

Here is what I do: 

I first use PDF box to extract text. 

[CODE]
textFile = "c:/java/faq.txt";
pdfFile = "c:/java/faq.pdf"; 

//FIRST I AM GETTING TEXT FROM PDF
document = PDDocument.load( pdfFile );

output = new OutputStreamWriter  ( new FileOutputStream  ( textFile ),
"UTF-8" );	    
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1 );
stripper.setEndPage( 20 );

//THIS SAVES TEXT INTO THE TXT FILE, TXT FILE COMPLETELY FINE
stripper.writeText(document, output);

//BUT WHEN I GET TEXT LIKE THAT TO ADD TO THE INDEX
textData = stripper.getText(document);


Analyzer analyzer = new StandardAnalyzer();        
Directory directory = FSDirectory.getDirectory("c:/java/collection");
IndexWriter iwriter = new IndexWriter(directory, analyzer, new
IndexWriter.MaxFieldLength(250));
Document doc = new Document();
        
doc.add(new Field("fieldname", textData, Field.Store.YES,
Field.Index.NOT_ANALYZED));
iwriter.addDocument(doc);
iwriter.optimize();
iwriter.close();
[/CODE]

This code above properly saves extracted text to the txt file, whioch I dotn
really need. What I want is to get text and add it to the Index right away.
When I open index files in notepad I can see garbage instead of russian
characters. 

Please help. Thank you
-- 
View this message in context: http://www.nabble.com/Lucene-Indexer-Encoding-problem-tp19959504p19959504.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message