lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject ppt text extraction - Re: SearchBlox J2EE Search Component Version 1.2 released
Date Tue, 17 Feb 2004 13:53:11 GMT
Eric Jain wrote:

>>- Support for PowerPoint documents
>>    
>>
>
>May I ask how you extract text from PowerPoint documents? Any open
>source tool, or your own code?
>  
>

FYI I recently discovered "ppthtml" in this package: 
http://chicago.sourceforge.net/xlhtml/

Also "antiword" seems to work well for word docs.

Also also also....I use a utility from xpdf 
(http://www.foolabs.com/xpdf/) for pdf text
extraction.

When you get down to it, I have found that "portable c" tools (above) 
work better
than the pure java ones avail.  To be fair however I have found that POI 
does work fine
for XLS docs.

 - Dave

>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message