incubator-odf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Han <devin...@apache.org>
Subject Tika is waiting for ODFToolkit to improve ODF file format processing
Date Mon, 24 Oct 2011 08:54:57 GMT
I saw this issue in Tika: OpenOffice parser: master footer text isn't
extracted https://issues.apache.org/jira/browse/TIKA-736

The current ODF parser of Tika doesn't touch the styles part and the embeded
document, only meta and content. They are waiting for the first ODF Toolkit
incubating release, then switch to a full featured parser much as they have
for the POI powered ones.

The first release is coming and we will have no code update before it. So, I
suggest start the discussion that how to use ODF Toolkit to realize it based
on the snapshot.

This feature concerns ODFDOM and Simple ODF API. We have involved text
extraction in the cookbook and demo, see:

http://incubator.apache.org/odftoolkit/simple/document/cookbook/TextExtractor.html
http://incubator.apache.org/odftoolkit/simple/demo/demo2.html

The work we need to do:
(1) What' s the detail requirements of Tika?
(2) Whether the exist features odf ODF Toolkit can cover the requirements of
Tika?
(3) How to use ODF Toolkit realize it?

CC to Tika Dev list, in case, guys in this list are interested in this
issue.
-- 
-Devin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message