poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MSB <markbrd...@tiscali.co.uk>
Subject Re: WordExtractor.getText() returns  on word docs.
Date Mon, 11 Jan 2010 16:24:59 GMT

I accept that this is far from a suitable solution but it could offer you a
short term fix. Simply use the replace() method that is defined on the
java.lang.String class to replace all of those characters/character strings
with something that makes more sense in this case; I would guess with


Mark B

maxSchlein wrote:
> I tried what you suggested:
>           WordExtractor wordExt = new WordExtractor(is);
>           String bodyText = WordExtractor.stripFields(wordExt.getText());
> But the  is still in the text.
> maxSchlein wrote:
>> It appears that when I use WordExtractor.getText(), and there are tables
>> in the document, it returns  for every table column.  Is there a way to
>> have this filtered out other than looping thru the returned text.  Or is
>> there something else I should be doing?  Thanks in advance for the
>> help...
>> The reason this is an issue is I am using Lucene's WhiteSpaceAnalyzer and
>> it is not treating this  as whitespace.  so a search a given word/phrase
>> that happens to be next to one of these 's is not found.

View this message in context: http://old.nabble.com/WordExtractor.getText%28%29-returns-%15-on-word-docs.-tp27111308p27113150.html
Sent from the POI - User mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

View raw message