lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shoba Ramachandran <shoba_duru...@yahoo.com>
Subject RE: Using lucene with HSSF from Apache
Date Mon, 05 May 2003 14:38:05 GMT
Thanks very much HTH,
What would be the good number for max.no. of terms for
a very huge document say 20MB PDF file.

Thanks
Shoba

--- kelvin-lists@relevanz.com wrote:
> It's doable, because if you open any ms office
> document in your text
> editor, you'll see that text is all there,
> surrounded with binary
> characters and other kinds of mumbo-jumbo. The
> biggest minus is that
> you'll exceed the max number of terms in a hurry,
> even if you set it
> to like a million, once you hit a reasonably large
> file (>5MB).
> 
> What I do is filter out all the unreadable stuff
> using some regex
> filters. Only minus is that its somewhat slower coz
> of the line by
> line processing, but I'm sure its _much_ faster than
> attempting to
> add all those nonsense data.
> 
> HTH
> 
> On Fri, 2 May 2003 08:30:15 -0700 (PDT), Shoba
> Ramachandran wrote:
> >Hi Michel,
> >
> >Are you able to index and search xls and doc files
> >with just Lucene using SimpleAnalyzer????
> >There is no need for POI?
> >With Lucene, you are able to extract the xls
> content
> >as text?
> >
> >Let me try as you explained.
> >Thanks very much for your reply.
> >Shoba
> >
> >--- MMachado@LEVI.com wrote:
> >>Hi,
> >>I did it, but I use only lucene. You need to
> create
> >>an IndexWriter with
> >>SimpleAnalyzer, an InputStream as new
> >>FileInputStream, create Document with
> >>two Fields: one contains the file path and one
> >>contains the file's content).
> >>That's all.
> >>Michel
> >>
> >>-----Original Message-----
> >>From: Shoba Ramachandran
> >>[mailto:shoba_duruvan@yahoo.com]
> >>Sent: Wednesday, April 30, 2003 6:10 PM
> >>To: lucene-user@jakarta.apache.org
> >>Subject: Using lucene with HSSF from Apache
> >>
> >>Hi,
> >>
> >>Has anyone tried to index xls and doc files?
> >>I'm trying to do with HSSF from apache and using
> >>lucene1.2
> >>
> >>This code returns me binary and printing it out
> >>gives
> >>junk chracters. File indexed like this returns
> >>nothing
> >>upon search.
> >>
> >>public static byte[] parse(File file) throws
> >>Exception
> >>{
> >>POIFSFileSystem fs = new POIFSFileSystem(new
> >>FileInputStream(file));
> >>HSSFWorkbook wb = new HSSFWorkbook(fs);
> >>byte[] xlsInfo = wb.getBytes();
> >>System.out.println("xls content :  "+
> >>xlsInfo.toString());
> >>return xlsInfo;
> >>}
> >>
> >>Thanks in advance for your help
> >>Shoba
> >>
> >>
> >>__________________________________
> >>Do you Yahoo!?
> >>The New Yahoo! Search - Faster. Easier. Bingo.
> >>http://search.yahoo.com
> >>
> >>
>
>---------------------------------------------------------------------
> 
> >>To unsubscribe, e-mail:
> >>lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail:
> >>lucene-user-help@jakarta.apache.org
> >>
> >>
>
>---------------------------------------------------------------------
> 
> >>To unsubscribe, e-mail:
> >>lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail:
> >>lucene-user-help@jakarta.apache.org
> >>
> >
> >
> >__________________________________
> >Do you Yahoo!?
> >The New Yahoo! Search - Faster. Easier. Bingo.
> >http://search.yahoo.com
> >
>
>---------------------------------------------------------------------
> 
> >To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> 
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message