lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew C. Oliver" <>
Subject Re: PDF / Word document parsers
Date Wed, 01 May 2002 02:24:42 GMT

For Word, IT is possible if you're willing to hack to use Ryan's
prototype code that is being refactored into HDF.  It converts DOC->FOP
and well we have XML parsers.  

Obviously (Excel) HSSF @ POI is pretty robust at this stage.  

Document summary (HPSF) information is read only at this stage, but that
should be fine for your needs.

So essentially you can grab what you need via POI.  The HDF is going to
be the most work 

checkout for more details.

On Fri, 2002-04-19 at 02:25, Kelvin Tan wrote:
> Anita,
> I've experienced a moderate amount of success using Etymon for PDF parsing.
> It does consume quite alot of memory for larger PDF documents, but otherwise
> it's ok. What difficulties are you facing?
> For MS Word parsing, The Jakarta POI project is working something out, but
> in the meanwhile I've managed to search MS Word documents by reading the
> file and stripping out nonsense characters. It's a hack I think, but if I
> increase the indexWriter's maxFieldLength to about a million, I can search
> like 13-15MB word documents with ease.
> Kelvin
> ----- Original Message -----
> From: "Anita Srinivas" <>
> To: "Lucene Users List" <>
> Sent: Friday, April 19, 2002 2:13 PM
> Subject: PDF / Word document parsers
> Hi...
> I have been looking for PDF and Word document parsers.  I have tried the
> contributions page on the Lucene site as suggested by a Lucene User. The
> PJEtymon does not have a Windows version.  The XPDF does not do the parsing
> very well.
> Can someone  suggest some better Word document or PDF parsers other than the
> ones I mentioned here, .
> Thanks
> Anita Srinivas
> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>
-- - port of Excel/Word/OLE 2 Compound
                            format to java 
			- fix java generics!
The avalanche has already started. It is too late for the pebbles to
-Ambassador Kosh

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message