lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <kelvin-li...@relevanz.com>
Subject RE: Using lucene with HSSF from Apache
Date Sun, 04 May 2003 05:38:34 GMT
It's doable, because if you open any ms office document in your text 
editor, you'll see that text is all there, surrounded with binary 
characters and other kinds of mumbo-jumbo. The biggest minus is that 
you'll exceed the max number of terms in a hurry, even if you set it 
to like a million, once you hit a reasonably large file (>5MB). 

What I do is filter out all the unreadable stuff using some regex 
filters. Only minus is that its somewhat slower coz of the line by 
line processing, but I'm sure its _much_ faster than attempting to 
add all those nonsense data.

HTH

On Fri, 2 May 2003 08:30:15 -0700 (PDT), Shoba Ramachandran wrote:
>Hi Michel,
>
>Are you able to index and search xls and doc files
>with just Lucene using SimpleAnalyzer????
>There is no need for POI?
>With Lucene, you are able to extract the xls content
>as text?
>
>Let me try as you explained.
>Thanks very much for your reply.
>Shoba
>
>--- MMachado@LEVI.com wrote:
>>Hi,
>>I did it, but I use only lucene. You need to create
>>an IndexWriter with
>>SimpleAnalyzer, an InputStream as new
>>FileInputStream, create Document with
>>two Fields: one contains the file path and one
>>contains the file's content).
>>That's all.
>>Michel
>>
>>-----Original Message-----
>>From: Shoba Ramachandran
>>[mailto:shoba_duruvan@yahoo.com]
>>Sent: Wednesday, April 30, 2003 6:10 PM
>>To: lucene-user@jakarta.apache.org
>>Subject: Using lucene with HSSF from Apache
>>
>>Hi,
>>
>>Has anyone tried to index xls and doc files?
>>I'm trying to do with HSSF from apache and using
>>lucene1.2
>>
>>This code returns me binary and printing it out
>>gives
>>junk chracters. File indexed like this returns
>>nothing
>>upon search.
>>
>>public static byte[] parse(File file) throws
>>Exception
>>{
>>POIFSFileSystem fs = new POIFSFileSystem(new
>>FileInputStream(file));
>>HSSFWorkbook wb = new HSSFWorkbook(fs);
>>byte[] xlsInfo = wb.getBytes();
>>System.out.println("xls content :  "+
>>xlsInfo.toString());
>>return xlsInfo;
>>}
>>
>>Thanks in advance for your help
>>Shoba
>>
>>
>>__________________________________
>>Do you Yahoo!?
>>The New Yahoo! Search - Faster. Easier. Bingo.
>>http://search.yahoo.com
>>
>>
>---------------------------------------------------------------------

>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>
>>
>---------------------------------------------------------------------

>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>
>
>
>__________________________________
>Do you Yahoo!?
>The New Yahoo! Search - Faster. Easier. Bingo.
>http://search.yahoo.com
>
>---------------------------------------------------------------------

>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message