lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Problem using Lucene on Ubuntu
Date Mon, 18 Feb 2008 13:44:45 GMT
Not sure about WordExtractor, does it also take a Reader?  I would try:

Reader input = new InputStreamReader(new FileInputStream(file),  
WordExtractor extractor = new WordExtractor(input);
content = extractor.getText();

Note: ENCODING is whatever encoding the file is in, as in "UTF-8", if  
that is what your files are in.  If you don't know the encoding, you  
will need to add in some type of character encoding detection tool.   
Search the web for that, as I know there are some out there (I don't  
know any names off hand).

Bottom line, it sounds like you need to figure out how to load your  
files based on their encoding.  That problem is not really core to  
Lucene, but you should be able to search the archives here to find  
others with similar questions.


On Feb 18, 2008, at 8:13 AM, kratoras wrote:

> No problem about the misunderstanding.
> I am using
> InputStream input =new URL (  "file:///"+file.getAbsolutePath()
> ).openStream ();
> WordExtractor  extractor = new WordExtractor(input);
> content=extractor.getText();
> where the wordextractor is  
> org.apache.poi.hwpf.extractor.WordExtractor;
> The wordextractor takes an inputstream as an arguement. Should i  
> determine
> the encoding of the inputstream and how?
> -- 
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message