lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pete Lewis" <p...@uptima.co.uk>
Subject Re: RE : Parsers
Date Thu, 29 May 2003 08:05:38 GMT
Hi Victor

Thanks.

In the past I have used the Inso OutsideIn filters and found them very good;
however I'd like to come up with a pure Java solution, so if there is a Java
equivalent to the Inso filters I be grateful for any details.  Failing that,
I thought that I'd go for individual parsers initially using the file
extensions to select the correct parser but in the future adding a file type
recogniser for files without extensions.  Hence my request for anyone
knowing of good parsers particularly for the most common formats.

That being said, has anyone come across a Powerpoint parser?

Pete

----- Original Message -----
From: "Victor Hadianto" <victorh@nuix.com.au>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, May 29, 2003 12:01 AM
Subject: Re: RE : Parsers


> > The www.textmining.org text extractors work very well for Word and pdf
> > documents.
> > They use both PDFBox and POI.
> >
> > For Excel, using POI directly is very easy. Tell me if you want to see
> > code samples.
> >
> > I'm looking myself for a Powerpoint text extractor, if you know one...
>
> Another solution is to use Microsoft Office itself. You can setup a server
> that serve request to convert Microsoft Office doc. There are many ways of
> doing this, for example using Python to directly call Office then put your
> python script in a webserver.
>
> Or you can set a .Net conversion server and you can call this .Net service
> using a Web Service, and many other interesting technique.
>
> victor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message