lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pete Lewis" <>
Subject Re: RE : Parsers
Date Thu, 29 May 2003 08:05:38 GMT
Hi Victor


In the past I have used the Inso OutsideIn filters and found them very good;
however I'd like to come up with a pure Java solution, so if there is a Java
equivalent to the Inso filters I be grateful for any details.  Failing that,
I thought that I'd go for individual parsers initially using the file
extensions to select the correct parser but in the future adding a file type
recogniser for files without extensions.  Hence my request for anyone
knowing of good parsers particularly for the most common formats.

That being said, has anyone come across a Powerpoint parser?


----- Original Message -----
From: "Victor Hadianto" <>
To: "Lucene Users List" <>
Sent: Thursday, May 29, 2003 12:01 AM
Subject: Re: RE : Parsers

> > The text extractors work very well for Word and pdf
> > documents.
> > They use both PDFBox and POI.
> >
> > For Excel, using POI directly is very easy. Tell me if you want to see
> > code samples.
> >
> > I'm looking myself for a Powerpoint text extractor, if you know one...
> Another solution is to use Microsoft Office itself. You can setup a server
> that serve request to convert Microsoft Office doc. There are many ways of
> doing this, for example using Python to directly call Office then put your
> python script in a webserver.
> Or you can set a .Net conversion server and you can call this .Net service
> using a Web Service, and many other interesting technique.
> victor
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message