# poi-user mailing list archives

##### Site index · List index
Message view
Top
From MSB <markbrd...@tiscali.co.uk>
Subject Re: font styles and equations in word doc
Date Mon, 13 Apr 2009 06:56:13 GMT

I must admit that I always thought images would be the best - maybe the only
way - to deal with complex mathmatical/scientific formulae and that is why I
asked what sort of formulae you were dealing with and whether Word had been
modified to include add-ons such as Rapid-Pi, MS Equation Editor or
something similar. I harboured doubts about HWPF's ability to handle a files
produced by a modified version of Word - even though it could extract the
text and any images from an OLE2CDF file for you - as I could not see how
they would have been modified to accomodate formulae. I am guessing the
formula add-ons produce images that can be inserted into a Word document and
which they - the add-ons - can retrieve and edit.

If you are dealing with simple symbols - by this I mean symbols that occupy
just a single line as does Pi for example - then it seems to be a question
of Font with iText. For example, I found the following two lines of code;

bfArial = BaseFont.createFont("c:\\windows\\fonts\\times.ttf",
BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
font = new com.lowagie.text.Font(bfArial, 20);

gave me a Font that would happilly render the pi r squared formula to a pdf
document.

Whereas, I could not find the correct encoding to make this line of code do
the same;

font = FontFactory.getFont(FontFactory.TIMES_ROMAN , "UTF-8",
BaseFont.EMBEDDED);

Thankfully, your tutor has offered you the flexibility to examine other
approaches to solving the problem. All the best.

nikhil n-2 wrote:
>
> yeah,we explored itext and found that there aren't any classes to handle
> such char's.the user can input his research paper in two ways.one is the
> way
> that i told to you earlier ie uploading a doc file.other way is to copy
> and
> paste every paragraph in the paper in the text boxes provided by us.if
> there
> are any equations then we ask the user to extract the entire equation from
> the paper as an image and upload it in the same way he does with any
> image.so,if at all i want to write equations onto a pdf file using itext,i
> have to upload it as an image.thanks for all the information provided to
> me.
>
> On Sat, Apr 11, 2009 at 6:44 PM, MSB <markbrdsly@tiscali.co.uk> wrote:
>
>>
>> Now it's getting interesting!!!
>>
>> HWPF is entirely blamelss in all of this it seems to me. It is recovering
>> all of the characters from the Word document as I was able to demonstrate
>> when I successfully created an rtf document. The problem comes when
>> trying
>> to create the pdf document from the Java String that we get through HWPF.
>> It
>> seems that iText cannot use such to create pdf files and as yet I do not
>> know why. If there is an iText forum - and I think that there is - it may
>> be
>> worthwhile posting there to see if anyone has any suggestions. In the
>> meantime, I am going to have a dig around on the internet to see if there
>> are any suggestions.
>>
>> PS OpenOffice will successfully convert Word documents into pdf files -
>> special characters included. That could offer an alternative approach for
>> you if the HWPF/iText combination cannot be persuaded to work - though I
>> cannot think why it would not.
>>
>>
>>
>> nikhil n-2 wrote:
>> >
>> > so,are there any classes which can retrieve these type of chars from
>> the
>> > doc
>> > file.sorry for the late reply.
>> >
>> > On Thu, Apr 9, 2009 at 12:47 AM, MSB <markbrdsly@tiscali.co.uk> wrote:
>> >
>> >>
>> >> Yes, I know the sort of think you mean now - when using Word I
>> remember
>> >> having the option to open a complicated looking dialog box that
>> allowed
>> >> me
>> >> to insert characters like the copyright and trademark symbols. I would
>> >> have
>> >> expected that if they could be placed into a Word document then they
>> are
>> >> encoded somewhere and available to us. My only doubts here surround
>> Words
>> >> use of Unicode - if it uses Unicode then everything should be OK.
>> >>
>> >> Also, I made another discovery tonight whilst playing with some code.
>> If
>> >> you
>> >> remember my previous post, I got the CharacterRun(s) from the
>> documents
>> >> high
>> >> level Range object. This does not have to be the case. You can do
>> >> something
>> >> like this;
>> >>
>> >>
>> >> HWPFDocument doc = new HWPFDocument(new FileInputStream(new
>> >> File("C;\\temp\\test.doc")));
>> >> Range = doc.getRange();
>> >> int numParagraphs = range.numParagraphs();
>> >> for(int i = 0; i < numParagraphs; i++) {
>> >>   Paragraph para = range.getParagraph(i);
>> >>   int numCharRuns = para.numCharacterRuns();
>> >>   for(int j = 0; J , numCharRuns; j++) {
>> >>      CharacterRun charRun = para.getCharacterRun(j);
>> >>      ..........
>> >>   }
>> >> }
>> >>
>> >> That would allow you to create new paragraphs ini the pdf file when
>> you
>> >> need
>> >> to - if I remember correctly, pdf files contain markedup text
>> organised
>> >> inot
>> >> paragraphs with the /par tag - and build each from the contents of the
>> >> character runs.
>> >>
>> >>
>> >> nikhil n-2 wrote:
>> >> >
>> >> > Thanks a lot sir for all the information.chars that may be present
>> in
>> a
>> >> > equation in a research paper are greek letters like pi,sigma,epsilon
>> >> > etc.they can be created in a microsoft word document as it provides
>> >> > options
>> >> > to insert such chars.but my doubt is how can i retrieve those chars
>> >> from
>> >> > the
>> >> > doc file by using hwpf.even if i am successfull in retrieving,i
>> should
>> >> be
>> >> > able to write them in a pdf file using itext.once again thank u.
>> >> >
>> >> > On Wed, Apr 8, 2009 at 9:01 PM, MSB <markbrdsly@tiscali.co.uk>
>> wrote:
>> >> >
>> >> >>
>> >> >> Thanks for the reply, I understand what you are after a little
>> better
>> >> >> now.
>> >> >>
>> >> >> As far as I am aware, formatting information is not exposed by
the
>> >> >> Paragraph
>> >> >> class but by the CharacterRun -
>> >> >> org.apache.poi.hwpf.usermodel.CharacterRun
>> >> >> -
>> >> >> class. By no means am I an expert but I think that as the Word
>> >> document
>> >> >> is
>> >> >> parsed by HWPF, if and when the formatting applied to a piece of
>> text
>> >> >> changes then it - the text - will be encapsulated within an
>> instance
>> >> of
>> >> >> the
>> >> >> CharacterRun class. That class provides methods that allow you
to
>> get
>> >> at
>> >> >> the
>> >> >> colour of the text, the name and size of the font used, and so
on.
>> To
>> >> get
>> >> >> at
>> >> >> the CharacterRun(s) in the document you would do something like
>> this;
>> >> >>
>> >> >> HWPFDocument doc = new HWPFDocument(new FileInputStream(new
>> >> >> File("C:\\temp\\test.doc")));
>> >> >> Range range = doc.getRange();
>> >> >> int numCharRuns = doc.numCharacterRuns();
>> >> >> CharacterRun charRun = null;
>> >> >> for(int i = 0; i < numCharRuns; i++) {
>> >> >>   charRun = doc.getCharacterRun(i);
>> >> >> }
>> >> >>
>> >> >> Then once you have the CharacterRun, you should be able to
>> interrogate
>> >> >> that
>> >> >> object for lots of information - have a look at the javadoc to
see
>> all
>> >> of
>> >> >> the available methods. After obtaining the info, you ought to be
>> able
>> >> to
>> >> >> use
>> >> >> iText to create the pdf file for you. My only concern is whether
>> >> working
>> >> >> through the document in this manner will allow you to accurately
>> >> >> re-create
>> >> >> it using iText; I guess that only a test will tell us this.
>> >> >>
>> >> >> The reason I asked about the nature of the research paper was that
>> I
>> >> >> wanted
>> >> >> to get some idea of the sort of characters that are included.
>> Forgive
>> >> me
>> >> >> please as I am 'mathmatically challenged' and do not know the terms
>> to
>> >> >> describe the sort of operators found in mathmatical expressions,
>> but
>> I
>> >> >> feared that we may be dealing with those - knowing that the
>> research
>> >> >> paper
>> >> >> is plain text removes that fear.
>> >> >>
>> >> >> Have a run with this and see how it works for you - I hope it may
>> be
>> >> able
>> >> >> to
>> >> >> return some of the characters you were not seeing before. If not,
>> we
>> >> may
>> >> >> need to look at other options. Should this fail again, is it
>> possible
>> >> for
>> >> >> you to let me have a copy - assuming there is no proprietary
>> >> information
>> >> >> contained within it that should not be seen by anyone outside of
>> your
>> >> >> institution - of the sort of document you are working with? That
>> way,
>> >> I
>> >> >> can
>> >> >> experiment with it myself; for example, I have OpenOffice on my
PC
>> and
>> >> >> NetBeans configured so that I can create and run applications that
>> use
>> >> >> Universal Network Objects (OpenOffice's API).
>> >> >>
>> >> >>
>> >> >> nikhil n-2 wrote:
>> >> >> >
>> >> >> > hii,
>> >> >> >
>> >> >> > i am new to hwpf.i am working on a project where i am supposed
to
>> >> >> a
>> >> >> > research paper in ieee format from a doc file and convert
it into
>> a
>> >> pdf
>> >> >> > file
>> >> >> > in a customized format.
>> >> >> > to do that i need to know the font size variations in the
text.i
>> am
>> >> >> unable
>> >> >> > to read char's like pi,sigma etc present in equations.
>> >> >> >
>> >> >> > thank u.
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22953001.html
>> >> >> Sent from the POI - User mailing list archive at Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> >> >> For additional commands, e-mail: user-help@poi.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22957496.html
>> >> Sent from the POI - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> >> For additional commands, e-mail: user-help@poi.apache.org
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p23001069.html
>> Sent from the POI - User mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
>
>

--
View this message in context: http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p23018668.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org