incubator-odf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Weir <>
Subject Re: Tika is waiting for ODFToolkit to improve ODF file format processing
Date Mon, 24 Oct 2011 13:17:52 GMT
On Mon, Oct 24, 2011 at 4:54 AM, Devin Han <> wrote:
> I saw this issue in Tika: OpenOffice parser: master footer text isn't
> extracted
> The current ODF parser of Tika doesn't touch the styles part and the embeded
> document, only meta and content. They are waiting for the first ODF Toolkit
> incubating release, then switch to a full featured parser much as they have
> for the POI powered ones.
> The first release is coming and we will have no code update before it. So, I
> suggest start the discussion that how to use ODF Toolkit to realize it based
> on the snapshot.

In that JIRA thread Uwe talks about the desire for a
streaming/SAX-like API for scanning the ODF documents.  I agree.  The
DOM approach we use with ODF Toolkit is necessary for when you need
random, read/write access to a document.  But you pay a performance
(mainly heap memory) penalty for that flexibility.  But if you can
organize your program logic into a single-pass read-only approach,
then a streaming approach can -- in theory -- perform much better for
that restricted use case.  But I still wonder how much the underlying
ZipInputStream implementation actually manages to stream the deflate
algorithm when it unzips ODF's ZIP package....

In any case, this is something I'd be interested in working on after
we get our initial ODF Toolkit release out.  A memory optimized
streaming API for read-only, single pass uses.

> This feature concerns ODFDOM and Simple ODF API. We have involved text
> extraction in the cookbook and demo, see:
> The work we need to do:
> (1) What' s the detail requirements of Tika?
> (2) Whether the exist features odf ODF Toolkit can cover the requirements of
> Tika?
> (3) How to use ODF Toolkit realize it?
> CC to Tika Dev list, in case, guys in this list are interested in this
> issue.
> --
> -Devin

View raw message