poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maxSchlein <m_schl...@hotmail.com>
Subject Re: WordExtractor.getText() returns ^U on word docs.
Date Mon, 11 Jan 2010 16:47:39 GMT

I can, and will create a bug for this, but I would think that someone else
out there, somewhere, has had this issue with poi.  

The solution that uses Apache StringUtils, is here:

WordExtractor wordExt = new WordExtractor(is);
String bodyText = WordExtractor.stripFields(wordExt.getText());

StringBuffer cleanString = new StringBuffer();
StringBuffer dirtyString = new StringBuffer(bodyText);

while(!StringUtils.isAsciiPrintable(dirtyString.toString()))
{
  char c;
  int index = 0;
                
  c = dirtyString.charAt(index);
  while(StringUtils.isAsciiPrintable(String.valueOf(c)))
  {
       index++;
       c = dirtyString.charAt(index);
  }
  dirtyString = new 
                   
StringBuffer(dirtyString.toString().replaceAll(String.valueOf(dirtyString.charAt(index)),
" "));
}

return dirtyString.toString();



Nick Burch-11 wrote:
> 
> On Mon, 11 Jan 2010, maxSchlein wrote:
>> I tried what you suggested:
>>
>>          WordExtractor wordExt = new WordExtractor(is);
>>          String bodyText = WordExtractor.stripFields(wordExt.getText());
>>
>> But the  is still in the text.
> 
> Can you create a new bug on bugzilla, and upload a sample file that shows 
> this behaviour? In the mean time, you'll need to go with Mark's suggestion 
> of manually removing them though
> 
> Cheers
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/WordExtractor.getText%28%29-returns-%15-on-word-docs.-tp27111308p27113524.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message