poi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From n...@apache.org
Subject svn commit: r1615818 - /poi/site/src/documentation/content/xdocs/text-extraction.xml
Date Mon, 04 Aug 2014 22:51:25 GMT
Author: nick
Date: Mon Aug  4 22:51:25 2014
New Revision: 1615818

URL: http://svn.apache.org/r1615818
Log:
Since 3.5 was so long ago now, update the docs to show the ooxml text extractors as standard

Modified:
    poi/site/src/documentation/content/xdocs/text-extraction.xml

Modified: poi/site/src/documentation/content/xdocs/text-extraction.xml
URL: http://svn.apache.org/viewvc/poi/site/src/documentation/content/xdocs/text-extraction.xml?rev=1615818&r1=1615817&r2=1615818&view=diff
==============================================================================
--- poi/site/src/documentation/content/xdocs/text-extraction.xml (original)
+++ poi/site/src/documentation/content/xdocs/text-extraction.xml Mon Aug  4 22:51:25 2014
@@ -59,16 +59,15 @@
       <em>org.apache.poi.POIOLE2TextExtractor</em>. This additionally
       provides common methods to get at the <link href="hpfs/">HPFS
       document metadata</link>.</p>
-     <p>All OOXML based text extractors (available in POI 3.5 and later) 
-      also extend from
+     <p>All OOXML based text extractors also extend from
       <em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
       provides common methods to get at the OOXML metadata.</p>
     </section>
 
     <section><title>Text Extractor Factory</title>
-     <p>As part of the addition of OOXML support in Apache POI 3.5, there
-      is a common class to select the appropriate POI text extractor for 
-      you. <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
+     <p>POI provides a a common class to select the appropriate text extractor 
+      for you, based on the supplied document's contents. 
+      <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
       similar function to WorkbookFactory. You simply pass it an
       InputStream, a File, a POIFSFileSystem or a OOXML Package. It
       figures out the correct text extractor for you, and returns it.</p>
@@ -81,16 +80,19 @@
      <p>For .xls files, there is 
       <em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will 
       return text, optionally with formulas instead of their contents. 
-      Those using POI 3.5 can also use 
-      <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, to perform
-      a similar task for .xlsx files.</p>
-     <p>In addition, there is a second text extractor for .xls files,
-      <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>. This
-      is based on the streaming EventUserModel code, and will generally
+      Similarly, for .xlsx files there is
+      <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, which 
+      provides the same functionality.</p>
+     <p>For those working in constrained memory footprints, there are
+      two more Excel text extractors available. For .xls files, it's
+      <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>,
+      based on the streaming EventUserModel code, and will generally
       deliver a lower memory footprint for extraction. However, it will
       have problems correctly outputting more complex formulas, as it 
       works with records as they pass, and so doesn't have access to all
-      parts of complex and shared formulas.</p>
+      parts of complex and shared formulas. For .xlsx files the equivalent is
+      <em>org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor</em>, 
+      which is based on the XSSF SAX Event codebase.</p>
     </section>
 
     <section><title>Word</title>
@@ -100,18 +102,16 @@
      <p>Those using POI 3.7 can also extract simple textual content from
       older Word 6 and Word 95 files, using the scratchpad class
       <em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
-     <p>Since POI 3.5, it is possible to use
-      <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
-      text extraction for .docx files.</p> 
+     <p>For .docx files, the relevant class is 
+      <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em></p>
     </section>
 
     <section><title>PowerPoint</title>
      <p>For .ppt files, in scratchpad there is 
       <em>org.apache.poi.hslf.extractor.PowerPointExtractor</em>, which 
       will return text for your slideshow, optionally restricted to just
-      slides text or notes text. Those using POI 3.5 can also use 
-      <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>, to 
-      perform a similar task for .pptx files.</p>
+      slides text or notes text. For .pptx files, the class to use is
+      <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em></p>
     </section>
 
     <section><title>Publisher</title>



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@poi.apache.org
For additional commands, e-mail: commits-help@poi.apache.org


Mime
View raw message