poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Glen Thomas <glen_tho...@hotmail.co.uk>
Subject Problem extracting Word 2007 Docx content
Date Fri, 15 Oct 2010 10:35:16 GMT

Hello,
 
I am working on an urgent project using Tika and am having problems with POI's extraction
of content from Word docx files.
 
When POI finds a <w:p> tag in the document.xml file it adds a "\n" to the string output
as expected. But when POI finds a <w:br> tag, it does nothing, which is causing words
in the text to be merged together rather than on different lines.
 
I have located the source of the problem in versions 3.6 and 3.7beta3, but I am not a greatly
experienced developer and could use some help with this.
 
In POI3.7beta3 the problem can be fixed within the XWPFRun class, toString method.
 
I think this code:
if ("w:cr".equals(tagName)) {
    text.append("\n");
}
 
...should read:
if ("w:br".equals(tagName)) {
    text.append("\n");
}
 
As far as I know the docx format does not contain a <w:cr> tag and this is an error.
If there is a <w:cr> tag then the extra code for br should be added onto the method.
 
In POI3.6 the problem can be fixed within the XWPFParagrah class.
 
The constructor method builds the output string from the docx tags but does not account for
the <w:br> tags.
 
after this piece of code:
if (o instanceof CTPTab) {
    text.append("\t");
}
 
another if statement should be added to say something along the lines of:
if (o instanceof CTPBr) {
    text.append("\n");
}
 
I need this fix quite quickly so would be very grateful if somebody could help me to add this
fix to POI and compile with tika. This is my first attempt at contributing to an open-source
project so I am not familiar with how this works.
 
Thanks,
 
Glen Thomas 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message