tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-509) Container contents extraction
Date Fri, 10 Sep 2010 17:30:33 GMT

    [ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908103#action_12908103
] 

Nick Burch commented on TIKA-509:
---------------------------------

Support is now in place for .doc, .docx, .xls and .xlsx

There's a couple more office formats to support, and unit tests are needed for the general
container formats which are supported via the PackageExtractor

> Container contents extraction
> -----------------------------
>
>                 Key: TIKA-509
>                 URL: https://issues.apache.org/jira/browse/TIKA-509
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>            Priority: Minor
>         Attachments: 0001-TIKA-509-Container-contents-extraction.patch
>
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E
> This service will operate in a push mode, using streaming where possible (not all container
formats will support that). Users can control recursion, and will be given the chance to process
each embeded file in turn. It's up to them if they process a file or skip it.
> It will work similar to the current Parser code, with each container having its own extractor
in the parsers package, and the interface defined in the core package. There will be an Auto
extractor in the core package, configured with a list of parser extractors just like AutoDetectParser
does.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message