incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: [PROPOSAL] Tika, a content analysis toolkit
Date Wed, 07 Mar 2007 19:12:46 GMT
+1 (non-binding)

Sounds like a very interesting proposal, and as you mention, I could 
imagine some interaction/collaboration with UIMA.  I would follow such a 
project closely and possibly contribute.

--Thilo

Jukka Zitting wrote:
> Hi,
> 
> [Cross-posting to announce the Tika proposal, please use
> general@incubator.apache.org for followup discussion.]
> 
> This is a proposal to start a content analysis toolkit project in the
> Apache Incubator. The live version of the proposal is available at
> http://wiki.apache.org/incubator/TikaProposal.
> 
> Comments and questions are welcome. There is also a vacant place for a
> third mentor. Once people are satisfied with the proposal I will first
> call a vote on the Lucene PMC to sponsor the proposal and then a vote
> on the Incubator PMC to accept the project for incubation.
> 
> PS. Based on quick Google and USPTO searches there doesn't seem to be
> anything that would cause trouble with the "Tika" name.
> 
> BR,
> 
> Jukka Zitting
> 
> ================================
> Tika, a content analysis toolkit
> ================================
> 
> Abstract
> --------
> 
> Tika is a toolkit for detecting and extracting metadata and structured
> text content from various documents using existing parser libraries.
> 
> Proposal
> --------
> 
> The Tika content analysis toolkit will include features for detecting
> the content types, character encodings, languages, and other 
> characteristics
> of existing documents and for extracting structured text content from
> the documents.
> 
> The toolkit is targeted especially for search engines and other content
> indexing and analysis tools, but will be useful also for other applications
> that need to extract meaningful information from documents that might
> be presented as nothing else than binary streams.
> 
> Instead of implementing its own document parsers, Tika will use existing
> parser libraries like Jakarta POI [1] and PDFBox [2].
> 
> Background
> ----------
> 
> The initial idea for the Tika project was voiced in April 2006 by
> Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch
> parser framework and other content analysis features were seen as
> value-added components that would benefit also other projects. The idea
> received positive feedback, but lacked the momentum.
> 
> The idea was revisited in August 2006 when Jukka Zitting from the
> Jackrabbit project contacted Nutch for possible cooperation with similar
> ideas. The original Tika idea gained extra momentum and a Google Code
> project was set up as a staging area for prototype code before deciding
> how to best handle the setup of a new project. After a few initial
> commits the activity again declined.
> 
> In January 2007 the idea started gaining more momentum when Rida Benjelloun
> offered to contribute the Lius project [3] to Apache Lucene and when Mark
> Harwood also started looking for a generic toolkit like Tika.
> 
> This proposal is the result of the above efforts and related discussions
> both in private and on various public forums. Some alternatives to
> incubation, like Apache Labs [4] or Jakarta Commons [5], came up during
> the discussions but we believe that taking the project to the Incubator
> is the best way to start growing a viable community to sustain the Tika
> toolkit.
> 
> Rationale
> ---------
> 
> There is ever more demand for tools that automatically analyze and index
> documents in various formats. Search engines, content repositories, and
> other tools often need to extract metadata and text content from documents
> given as nothing or little else than a simple octet stream. While there
> are a number of existing parser libraries for various document types,
> each of them comes with a custom API and there are no generic tools for
> automatically determining which parser to use for which documents.
> Currently many projects end up creating their custom content analysis
> and extraction tools.
> 
> The Tika project attempts to remove this duplication of efforts. We
> believe that by pooling the efforts of multiple projects we will be able
> to create a generic toolkit that exceeds the capabilities and quality of
> the custom solutions of any single project. A generic toolkit project
> will also provide common ground for the developers of parser libraries
> and content applications to interact.
> 
> Initial Goals
> -------------
> 
> The initial goals of the proposed project are:
> 
>    * Viable community around the Tika codebase
> 
>    * Active relationships and possible cooperation with related
>      projects and communities
> 
>    * Generic parser API for extracting structured text content from
>      various document formats
> 
>    * Flexible metadata detection and extraction API
> 
>    * Java implementations of the metadata standards mentioned below
> 
> 
> Current Status
> ==============
> 
> Meritocracy
> -----------
> 
> All the initial committers are familiar with the meritocracy principles
> of Apache, and have already worked on the various source codebases. We will
> follow the normal meritocracy rules also with other potential contributors.
> 
> Community
> ---------
> 
> There is not yet a clear Tika community. Instead we have a number of people
> and related projects with an understanding that a shared toolkit project
> would best serve everyone's interests. The primary goal of the incubating
> project is to build a self-sustaining community around this shared vision.
> 
> Core Developers
> ---------------
> 
> The initial set of developers comes from various backgrounds, with 
> different
> but compatible needs for the proposed project.
> 
> Alignment
> ---------
> 
> As a generic toolkit the Tika will likely be widely used by various open
> source and commercial projects both together with and independent of other
> Apache tools like Lucene Java or Jakarta POI. Other Apache projects like
> Nutch and Jackrabbit are potential candidates for using Tika as an
> embedded component.
> 
> Known Risks
> ===========
> 
> Orphaned products
> -----------------
> 
> There are a number of projects at various stages of maturity that implement
> a subset of the proposed features in Tika. For many potential users the
> existing tools are already enough, which reduces the demand for a more
> generic toolkit. This can also be seen in the slow progress of this
> proposal over the past year.
> 
> However, once the project gets started we can quickly reach the feature
> level of existing tools based on seed code from sources mentioned below.
> After that we believe to be able to quickly grow the developer and user
> communities based on the benefits of a generic toolkit over custom
> alternatives.
> 
> Inexperience with Open Source
> -----------------------------
> 
> All the initial developers have worked on open source before and many are
> committers and PMC members within other Apache projects.
> 
> Homogenous Developers
> ---------------------
> 
> The initial developers come from a variety of backgrounds and with a
> variety of needs for the proposed toolkit.
> 
> Reliance on Salaried Developers
> -------------------------------
> 
> Some of the developers are paid to work on this or related projects,
> but the proposed project is not the primary task for anyone.
> 
> Relationships with Other Apache Products
> ----------------------------------------
> 
> Tika is related to at least the following Apache projects. None of
> the projects is a direct competitor for Tika, but there are many cases
> of potential overlap in functionality.
> 
>    * Apache Lucene [http://lucene.apache.org/java/]
>      The analysis part of Lucene contains code that might overlap with
>      some of the potential Tika functionality. There might also be some
>      overlap regarding the Document model in Lucene.
> 
>    * Lucene Nutch [http://lucene.apache.org/nutch/]
>      The Nutch project already contains a parser framework that does
>      many of the things that Tika is designed to do.
> 
>    * Apache Jackrabbit [http://jackrabbit.apache.org/]
>      The Jackrabbit project contains a text extraction component that
>      also implements a subset of the proposed Tika features.
> 
>    * Apache UIMA [http://incubator.apache.org/uima/]
>      The UIMA project provides a framework and pluggable tools for
>      analyzing text content and extracting information. Example tools
>      include language identification, sentence boundary detection and
>      "entity extraction" - finding references to people, places and
>      organisations. Tika could be used by UIMA to parse text but Tika
>      should be careful not to duplicate the subsequent text analysis
>      features UIMA offers.
> 
> A Excessive Fascination with the Apache Brand
> ---------------------------------------------
> 
> All of us are familiar with Apache and we have participated in Apache
> projects as contributors, committers, and PMC members. We feel that the
> Apache Software Foundation is a natural home for a project like this.
> 
> Documentation
> =============
> 
> There are bits and pieces of design discussions and other documentation
> around, see for example the following:
> 
>    * August 2006 nutch-dev: Parser design
>      http://thread.gmane.org/gmane.comp.search.nutch.devel/9685
> 
>    * September 2006 nutch-dev: Content type detection
>      http://thread.gmane.org/gmane.comp.search.nutch.devel/9969
> 
>    * October 2006 Lius tutorial
>      http://www.doculibre.com/lius/doc-1.0_en.html
> 
>    * February 2007 Tika wiki: Design discussion
>      http://code.google.com/p/tika/wiki/DesignDiscussion
> 
> Standards and conventions related to Tika include the Dublin Core [6]
> metadata set, the Shared MIME information draft [7] specification from
> freedesktop.org [8], and of course RFCs 2046 [9] and 3066 [10] for
> identifying media types and languages.
> 
> See also the potential parser libraries listed below for details on the
> various document formats that Tika plans to support.
> 
> Initial Source
> ==============
> 
> Tika will start with a combination of seed code from the efforts listed
> below:
> 
>    * The Apache Nutch project that contains a parser framework and
>      various content analysis tools
> 
>    * The Lius project, an indexing framework for Apache Lucene
> 
>    * The Apache Jackrabbit project that contains a text extraction
>      component
> 
> No existing codebase is selected as "the" starting point of Tika to avoid
> inheriting the world view and design limitations of any single project.
> 
> Source and Intellectual Property Submission Plan
> ================================================
> 
> All seed code and other contributions will be handled through the normal
> Apache contribution process.
> 
> We will also contact other related efforts for possible cooperation
> and contributions.
> 
> External Dependencies
> =====================
> 
> Tika will depend on a number of external parser libraries with various
> licensing conditions. An initial list of potential dependencies is shown
> below.
> 
>    Library      URL                               License
>    ---------------------------------------------------------------------
>    Jakarta POi  http://jakarta.apache.org/poi/    ASLv2
>    PDFBox       http://www.pdfbox.org/            BSD
>    NekoHTML     http://people.apache.org/~andyc/neko/doc/html/index.html
>                                                   CyberNeko (like ASL)
>    JTidy        http://jtidy.sourceforge.net/     W3C
>     
> There are also some LGPL parser libraries that would be useful. Whether
> and how such dependencies could be handled will be discussed during
> incubation. No such dependencies will be added to the project before
> the legal implications have been cleared.
> 
> Cryptography
> ============
> 
> Tika itself will not use cryptography, but it is possible that some of
> the external parser libraries will include cryptographic code to handle
> features like DRM in various document formats.
> 
> Required Resources
> ==================
> 
> Mailing lists
> 
>    * tika-dev@incubator.apache.org
>    * tika-commits@incubator.apache.org
>    * tika-private@incubator.apache.org
> 
> Subversion Directory
> 
>    * https://svn.apache.org/repos/asf/incubator/tika
> 
> Issue Tracking
> 
>    * JIRA Tika (TIKA)
> 
> Other Resources
> 
>    * none
> 
> Initial Committers
> ==================
> 
>    Name               Email                                     CLA
>    ----------------------------------------------------------------
>    Rida Benjelloun    rida dot benjelloun at doculibre dot com  yes
>    Mark Harwood       mharwood at apache dot org                yes
>    Chris A. Mattmann  mattmann at apache dot org                yes
>    Sami Siren         siren at apache dot org                   yes
>    Jukka Zitting      jukka at apache dot org                   yes
> 
> Affiliations
> ============
> 
>    Name               Affiliation
>    -------------------------------------------------
>    Rida Benjelloun    Doculibre inc.
>    Chris A. Mattmann  NASA Jet Propulsion Laboratory
>    Jukka Zitting      Day Management AG
> 
> Sponsors
> ========
> 
> Champion
> 
>    * Jukka Zitting (jukka at apache dot org)
> 
> Nominated Mentors
> 
>    * Doug Cutting (cutting at apache dot org)
>    * Jukka Zitting (jukka at apache dot org)
> 
> Sponsoring Entity
> 
>    * TODO: Apache Lucene (vote needed)
> 
> 
> [1]  http://jakarta.apache.org/poi/
> [2]  http://www.pdfbox.org/
> [3]  http://sourceforge.net/projects/lius/
> [4]  http://labs.apache.org/
> [5]  http://jakarta.apache.org/commons/
> [6]  http://dublincore.org/
> [7]  http://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec
> [8]  http://freedesktop.org/
> [9]  http://www.ietf.org/rfc/rfc2046.txt
> [10] http://www.ietf.org/rfc/rfc3066.txt



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message