lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Honey George <honey_geo...@yahoo.com>
Subject Re: Existing Parsers
Date Mon, 13 Sep 2004 07:38:34 GMT
Hi Chris,
   I do not have a stats but I think the performance
is reasonable. I use xpdf for PDF & wvWare for DOC.
The size of my index is ~2GB (this is not limited to
only pdf & doc). For avoiding memory problems, I have
set an upperbound to the size of the documents that
can be indexed. For example in my case I do not index
documents if the size is more that 4MB. You could try
something like that.

Thanks & Regards,
   George

 --- Chris Fraschetti <fraschetti@gmail.com> wrote: 
> Some of the tools listed use cmd line execs to
> output a doc of some
> sort to text and then I grab the text and add it to
> a lucene doc, etc
> etc...
> 
> Any stats on the scalability of that? In large scale
> applications, I'm
> assuming this will cause some serious issues...
> anyone have any input
> on this?
> 
> -Chris Fraschetti
> 
> 
> On Thu, 09 Sep 2004 09:54:43 -0700, David Spencer
> <dave-lucene-user@tropo.com> wrote:
> > Honey George wrote:
> > 
> > > Hi,
> > >   I know some of them.
> > > 1. PDF
> > >  + http://www.pdfbox.org/
> > >  + http://www.foolabs.com/xpdf/download.html
> > >    - I am using this and found good. It even
> supports
> > 
> > My dated experience from 2 years ago was that (the
> evil, native code)
> > foolabs pdf parser was the best, but obviously
> things could have changed.
> > 
> >
>
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg02912.html
> > 
> > >      various languages.
> > > 2. word
> > >   + http://sourceforge.net/projects/wvware
> > > 3. excel
> > >   +
> http://www.jguru.com/faq/view.jsp?EID=1074230
> > >
> > > -George
> > >  --- dhatcher@webtads.com wrote:
> > >
> > >>Anyone know of any reliable parsers out there
> for
> > >>pdf word
> > >>excel or powerpoint?
> > 
> > For powerpoint it's not easy. I've been using this
> and it has worked
> > fine util recently and seems to sometimes go into
> an infinite loop now
> > on some recent PPTs. Native code and a package
> that seems to be dormant
> > but to some extent it does the job. The file
> "ppthtml" does the work.
> > 
> > http://chicago.sourceforge.net/xlhtml
> > 
> > 
> > 
> > >>
> > >>
> > >
> > >
>
---------------------------------------------------------------------
> > >
> > >>To unsubscribe, e-mail:
> > >>lucene-user-unsubscribe@jakarta.apache.org
> > >>For additional commands, e-mail:
> > >>lucene-user-help@jakarta.apache.org
> > >>
> > >>
> > >
> > >
> > >
> > >
> > >
> > >
> > >
>
___________________________________________________________ALL-NEW
> Yahoo! Messenger - all new features - even more fun!
>  http://uk.messenger.yahoo.com
> > >
> > >
>
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> > >
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> > 
> >
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 
>  


	
	
		
___________________________________________________________ALL-NEW Yahoo! Messenger - all
new features - even more fun!  http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message