lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelvin Tan <kelvin-li...@relevanz.com>
Subject RE: Using lucene with HSSF from Apache
Date Tue, 06 May 2003 00:49:41 GMT
Perhaps others can be more helpful here. I haven't had difficulty using a 
million terms, but perhaps its just my usage.

On Mon, 5 May 2003 07:38:05 -0700 (PDT), Shoba Ramachandran said:
>Thanks very much HTH,
>What would be the good number for max.no. of terms for
>a very huge document say 20MB PDF file.
>
>Thanks
>Shoba
>
>--- kelvin-lists@relevanz.com wrote:
>>It's doable, because if you open any ms office
>>document in your text
>>editor, you'll see that text is all there,
>>surrounded with binary
>>characters and other kinds of mumbo-jumbo. The
>>biggest minus is that
>>you'll exceed the max number of terms in a hurry,
>>even if you set it
>>to like a million, once you hit a reasonably large
>>file (>5MB).
>>
>>What I do is filter out all the unreadable stuff
>>using some regex
>>filters. Only minus is that its somewhat slower coz
>>of the line by
>>line processing, but I'm sure its _much_ faster than
>>attempting to
>>add all those nonsense data.
>>
>>HTH
>>
>>On Fri, 2 May 2003 08:30:15 -0700 (PDT), Shoba
>>Ramachandran wrote:
>>>Hi Michel,
>>>
>>>Are you able to index and search xls and doc files
>>>with just Lucene using SimpleAnalyzer????
>>>There is no need for POI?
>>>With Lucene, you are able to extract the xls
>>content
>>>as text?
>>>
>>>Let me try as you explained.
>>>Thanks very much for your reply.
>>>Shoba
>>>
>>>--- MMachado@LEVI.com wrote:
>>>>Hi,
>>>>I did it, but I use only lucene. You need to
>>create
>>>>an IndexWriter with
>>>>SimpleAnalyzer, an InputStream as new
>>>>FileInputStream, create Document with
>>>>two Fields: one contains the file path and one
>>>>contains the file's content).
>>>>That's all.
>>>>Michel
>>>>
>>>>-----Original Message-----
>>>>From: Shoba Ramachandran
>>>>[mailto:shoba_duruvan@yahoo.com]
>>>>Sent: Wednesday, April 30, 2003 6:10 PM
>>>>To: lucene-user@jakarta.apache.org
>>>>Subject: Using lucene with HSSF from Apache
>>>>
>>>>Hi,
>>>>
>>>>Has anyone tried to index xls and doc files?
>>>>I'm trying to do with HSSF from apache and using
>>>>lucene1.2
>>>>
>>>>This code returns me binary and printing it out
>>>>gives
>>>>junk chracters. File indexed like this returns
>>>>nothing
>>>>upon search.
>>>>
>>>>public static byte[] parse(File file) throws
>>>>Exception
>>>>{
>>>>POIFSFileSystem fs = new POIFSFileSystem(new
>>>>FileInputStream(file));
>>>>HSSFWorkbook wb = new HSSFWorkbook(fs);
>>>>byte[] xlsInfo = wb.getBytes();
>>>>System.out.println("xls content :  "+
>>>>xlsInfo.toString());
>>>>return xlsInfo;
>>>>}
>>>>
>>>>Thanks in advance for your help
>>>>Shoba
>>>>
>>>>
>>>>__________________________________
>>>>Do you Yahoo!?
>>>>The New Yahoo! Search - Faster. Easier. Bingo.
>>>>http://search.yahoo.com
>>>>
>>>>
>>
>>--------------------------------------------------------------------
>>-
>>
>>>>To unsubscribe, e-mail:
>>>>lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail:
>>>>lucene-user-help@jakarta.apache.org
>>>>
>>>>
>>
>>--------------------------------------------------------------------
>>-
>>
>>>>To unsubscribe, e-mail:
>>>>lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail:
>>>>lucene-user-help@jakarta.apache.org
>>>>
>>>
>>>
>>>__________________________________
>>>Do you Yahoo!?
>>>The New Yahoo! Search - Faster. Easier. Bingo.
>>>http://search.yahoo.com
>>>
>>
>>--------------------------------------------------------------------
>>-
>>
>>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>>
>>
>>
>>
>>
>>
>---------------------------------------------------------------------
>>To unsubscribe, e-mail:
>>lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail:
>>lucene-user-help@jakarta.apache.org
>>
>
>
>__________________________________
>Do you Yahoo!?
>The New Yahoo! Search - Faster. Easier. Bingo.
>http://search.yahoo.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message