poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jfmn...@free.fr
Subject Re: Extract text office 2010
Date Tue, 29 Mar 2011 09:23:41 GMT


Here Office2010 textbox.
there are some news tags :
   <mc:AlternateContent>
	<mc:Choice Requires="wps">
	</mc:Choice>
	<mc:Fallback>
	</mc:Fallback>
   </mc:AlternateContent>


[Code] 
<w:p w:rsidR="00423106" w:rsidRDefault="00733140">
	<w:r>
		<mc:AlternateContent>
			<mc:Choice Requires="wps">
				<w:drawing>
					<wp:anchor ...>
						...
						<a:graphic>
							<a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
								<wps:wsp>
									...
									<wps:txbx>
										<w:txbxContent>
											<w:p w:rsidR="00423106" w:rsidRPr="00423106" w:rsidRDefault="00423106">
												<w:r>
													<w:t>Togodo</w:t>
												</w:r>
											</w:p>
										</w:txbxContent>
									</wps:txbx>
									...
								 </wps:wsp>
							</a:graphicData>
						</a:graphic>
					</wp:anchor>
				</w:drawing>
			</mc:Choice>
			<mc:Fallback>
				<w:pict>
					...
					<v:shape id="Text Box 2" o:spid="_x0000_s1026" type="#_x0000_t202" style="position:absolute;margin-left:0;...">
						<v:textbox style="mso-fit-shape-to-text:t">
							<w:txbxContent>
								<w:p w:rsidR="00423106" w:rsidRPr="00423106" w:rsidRDefault="00423106">
									<w:r>
										<w:t>Togodo</w:t>
									</w:r>
								</w:p>
							</w:txbxContent>
						</v:textbox>
					</v:shape>
				</w:pict>
			</mc:Fallback>
		</mc:AlternateContent>
	</w:r>
 </w:p>[/Code] 


Here Office2010 textbox.
[Code] 
 <w:p w:rsidR="00423106" w:rsidRDefault="00423106">
	<w:r w:rsidRPr="00FB2EC2">
                ...
		<w:pict>
                ...
			<v:shape id="_x0000_s1026" type="#_x0000_t202" style="position:absolute;margin-left:...">
				<v:textbox style="mso-fit-shape-to-text:t">
					<w:txbxContent>
						<w:p w:rsidR="00423106" w:rsidRPr="00423106" w:rsidRDefault="00423106">
							<w:proofErr w:type="spellStart" /> 
							<w:r>
								<w:t>Togodo</w:t> 
							</w:r>
							<w:proofErr w:type="spellEnd" /> 
						</w:p>
					</w:txbxContent>
				</v:textbox>
			</v:shape>
		</w:pict>
	</w:r>
  </w:p>
[/Code] 
----- Mail Origi


nal -----
De: "Nick Burch" <nick.burch@alfresco.com>
À: "POI Users List" <user@poi.apache.org>
Envoyé: Lundi 28 Mars 2011 20h46:49 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm
/ Vienne
Objet: Re: Extract text office 2010

On Fri, 25 Mar 2011, jfmnews@free.fr wrote:
> Is Poi 3.7 can extract text from a office 2010 document ?

Generally it ought to be able to, but there's no explicit support for any 
new 2010 features that go beyond what 2007 did

> I can extract the text of the 2007 docx but not completely the text of 
> the word 2010 docx : The text of the textbox is missing

Can you identify how the xml differs? I'd suggest you try unzipping the 
two .docx files (they're a zip of xml) and see if you can see what's done 
differently for the text boxes

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message