poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MSB <markbrd...@tiscali.co.uk>
Subject Re: How to compare 2 word doc (OLE2CDF or OpenXML).
Date Mon, 27 Jul 2009 08:01:09 GMT

That should be do-able, with one caveat, images would be the only thing I am
unsure about at this point; the complicating factor will be the depth of the
comparison. Your first task IMO would be to decide exactly how the
comparison should proceed; to sketch out an algorithm that will determine
'differentness'. Imagine that we have the first paragraph from document one,
what should we compare it to in document two? Should we only compare
corresponding paragraphs, i.e. only compare paragraph one in document one
with paragrph two in document two? What happens if a new paragraph was
inserted into document two so that now paragraph one in document one matches
paragraph two in document two?

If you have a good search around the list, there is code that demonstrates
how to extract the text from a document along with the tables. I am guessing
that you will need to get at the tables 'in line' so to speak as a change in
the position of the table within the document will be a change as far as
your algorithm is concerned. At this time, I cannot offer to help any
further as I am about to leave for the 'office' - a damp, rainy nature
reserve in actuality. If I have the time tonight, I will try to put
something together but would suggest that you search through the posts to
the list to track down some code that will allow you to get at the documents
contents as a starting point; I am confident that there is code there that
demonstrates how to get at the tables contents in-line. As always though, I
cannot promise anything - I am grappling with other Word 'issues' that are
absorbing quite a bit of time - but will help out where I can. Finally, XSSF
is still a bit of a mystery to me, I have not done any 'real' work with the


Mark B

bihag wrote:
> Hi Mark,
> Thanks for replay ... 
> What I want is compare two same versions of the document and note any
> changes that have been made.
> If I can get image, table changes that's really great ... but if I can
> only get text changes thats more than enough for current requirement ... 
> What I will do is, I will pass 2 documents to function that function
> should create new document with both file content and changes like ms word
> is doing with compare option in it's menu.
> ex. 
> File A.doc contains:- The brown fox jumps from lazy dog.
> File B.doc contains:- The fox jumps from lazy donkey.
> File generated after compare A.doc and B.doc contains :- this image file
>  http://www.nabble.com/file/p24674962/compare.jpg 
> MSB wrote:
>> This could very well be possible; I have certainly had some success
>> creating new Word documents using the API. Merging one document into
>> another is more tricky and not something I would try to do myself with
>> the API just yet. The first thing is to be clear on is exactly how you
>> wish to compare the documents. Are you saying that you want to compare
>> two versions of the same document and note any changes that have been
>> made? Are you looking just at the text and not at any formatting applied
>> to the text?
>> If so, then you could use the WordExtractor class to get at the text of
>> the two documents. This class can return an array of String(s) where each
>> element maps to a paragraph (I think) in the source document. Next, you
>> could compare the elements within the arrays to determine if a paragraph
>> had been deleted, added, moved, modified, etc. If you found a difference
>> and identified what it was, then that paragraph could be written away a
>> new 'results' document. To be completely honest, I have never tried to do
>> much work with the formatting of the text and I cannot claim sole
>> authorship of this code because I got a start from an example I found on
>> the 'net. Anyway, here is some very simple code to create a Word
>> document;
>> POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("..empty
>> file.."));
>> HWPFDocument doc = new HWPFDocument(fs);
>> // centered paragraph with large font size
>> Range range = doc.getRange();
>> Paragraph par1 = range.insertAfter(new ParagraphProperties(), 0);
>> par1.setSpacingAfter(200);
>> // justification: 0=left, 1=center, 2=right, 3=left and right
>> par1.setJustification((byte) 1);
>> CharacterRun run1 = par1.insertAfter("one");
>> run1.setFontSize(2 * 18);
>> // paragraph with bold typeface
>> Paragraph par2 = run1.insertAfter(new ParagraphProperties(), 0);
>> par2.setSpacingAfter(200);
>> CharacterRun run2 = par2.insertAfter("two two two two two two two two two
>> two two two two");
>> run2.setBold(true);
>> // paragraph with italic typeface and a line indent in the first line
>> Paragraph par3 = run2.insertAfter(new ParagraphProperties(), 0);
>> par3.setFirstLineIndent(200);
>> par3.setSpacingAfter(200);
>> CharacterRun run3 = par3.insertAfter("three three three three three three
>> three three three "
>>     + "three three three three three three three three three three three
>> three three three "
>>     + "three three three three three three three three three three three
>> three three three");
>> run3.setItalic(true);
>> // add a custom document property (needs POI 3.5; POI 3.2 doesn't save
>> custom properties)
>> DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
>> CustomProperties cp = dsi.getCustomProperties();
>> if (cp == null) {
>>     cp = new CustomProperties();
>> }
>> cp.put("myProperty", "prop prop prop");
>> dsi.setCustomProperties(cp);
>> doc.write(new FileOutputStream("..final file.."));
>> The key wrinkle is that HWPF cannot actually create a new, empty Word
>> document; you will need to use Word itself to create a new file that can
>> be used as the input to this process - I have called it the empty file in
>> the code above. All you need to do is open Word, select New->Document and
>> then save this away. Use this empty file as the input to the process and
>> you should be away.
>> There is a setColor() method defined on the CharacterRun class but I have
>> never used it myself. The only advice I can offer is to play with it and
>> see what the effect is on a simple bit of code such as this one. You will
>> have access to the usual effects such as strikethrough again using the
>> CharacterRun class.
>> Yours
>> Mark B
>> bihag wrote:
>>> Hi All,
>>> We want to compare two document and what ever things are not common that
>>> we have to highlight with some color or any other way ... So I thing we
>>> have to merge document or create new document which has content of both
>>> the document, and show difference with some color, like deleted with
>>> red, newly added with blue ... 
>>> Mainly we are looking for OLE2CDF doc compare solution ...
>>> please provide some code sniplet if possible ...
>>> Thanking you in advance ...

View this message in context: http://www.nabble.com/How-to-compare-2-word-doc-%28OLE2CDF-or-OpenXML%29.-tp24673506p24675255.html
Sent from the POI - Dev mailing list archive at Nabble.com.

To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

View raw message