jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ján Halaša <halasal...@aura.cz>
Subject Text filters for binary documents
Date Fri, 27 May 2005 12:15:07 GMT
Hi everybody,

I have converted some text filters (for extracting text content from 
binary files) from Jakarta Slide project so that they implement 
TextFilter interface. Slide sources share the same Apache 2.0 license. 
I used the existing TextPlainTextFilter class as a template, so they do 
not accept multi-valued properties.
I'll be glad if someone take a look and integrates them with Jackrabbit 
somehow.

http://www.halasa.com/jackrabbit/ApplicationMsExcelTextFilter.java
http://www.halasa.com/jackrabbit/ApplicationMsWordTextFilter.java
http://www.halasa.com/jackrabbit/ApplicationPdfTextFilter.java

You will need these extra libraries:
PDFBox-0.7.1.jar (http://www.pdfbox.org/)
poi-2.5.1-final-20040804.jar (http://jakarta.apache.org/poi/)
tm-extractors-0.4.jar (http://www.textmining.org/)

Jan

Mime
View raw message