lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Text extraction from ms word doc
Date Mon, 11 Jan 2010 21:12:04 GMT
Have you tried antiword?

http://www.winfield.demon.nl/


       karl

11 jan 2010 kl. 21.04 skrev maxSchlein:

>
> I was looking for an option for Text extraction from a word doc.
>
> Currently I am using POI; however, when there is a table in the doc,  
> for
> each column POI brings back a .  The whitespace analyzer is not  
> filtering
> out this character.  So whatever word or phrase that is the last  
> word or
> phrase within a table column is not found during searching.  That  
> is, if the
> word dog is the only word in a column, a search for the word dog would
> return nothing because the word that was indexed was "dog".
>
> I can create a filter to fix this, using Apache's
> StringUtils.isAsciiPrintable, but I would rather not.
>
> Any and all help is welcome and thanked.
> -- 
> View this message in context: http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message