poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Khveras <andrei.sur...@gmail.com>
Subject Re: Tab symbols parsing in WORD document issue: org.apache.poi.hwpf.extractor.WordExtractor
Date Tue, 10 Jan 2012 16:16:23 GMT
Hi. Thank you for prompt reply. Unfortunately, getParagraph() function
fetches not paragraphs, but single lines, that ends with
newline<http://en.wikipedia.org/wiki/Newline>symbols. I consider a
paragraph as any text with preceding
newline <http://en.wikipedia.org/wiki/Newline> or
pagebreak<http://en.wikipedia.org/wiki/Page_break>symbols. Anyhow,
allthought I'm not sure I can read complicated
professional code, I will try to analyse it and to do my best of
understanding it and finding the ways of resolving the issue above.

On Tue, Jan 10, 2012 at 6:56 PM, Nick Burch <nick.burch@alfresco.com> wrote:

> On Tue, 10 Jan 2012, Andrei Khveras wrote:
>> I'm trying to use the class org.apache.poi.hwpf.extractor.**WordExtractor,
>> what I downloaded as a part of Apache POI <http://poi.apache.org/**
>> download.html <http://poi.apache.org/download.html>>.
>> *Could somebody, please*, kindly help me to resolve this little issue. My
>> goal is to get MS Word file contents as one single String, containing all
>> control characters. I need it for further (hand-made!) splitting text into
>> paragraphs, words, etc.
> Why not fetch the paragraphs directly then? That'd give you full control
> over which bit of text is in which paragraph, and will let you decide if
> you want to display or hide control characters etc
> I'd suggest you look at the code for WordExtractor to get an idea of how
> to go about doing it, then do your own version that implements your
> required logic
> Nick
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<user-unsubscribe@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org

*С уважением*

* 229-507-907 <http://wwp.icq.com/scripts/contact.dll?msgto=229507907>*
*Skype: tenety

BOOKRIVER.RU <http://bookriver.ru>


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message