lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: [jira] Lius into apache incubator
Date Thu, 01 Mar 2007 18:46:12 GMT
Hi,

On 3/1/07, Rida Benjelloun <rida.benjelloun@doculibre.com> wrote:
> On 3/1/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> > Would there be interest within the Lucene PMC in sponsoring a proposal
> > along such lines? I can volunteer to put together the proposal and act
> > as the champion and mentor of the project.
>
> -- >> We can put together the proposal and you can be the mentor of the
> project.

See below for a quick first draft (filled with TODOs).

PS. Will people mind if we use this list for fleshing out the details?
I've created a Google Group for Tika where we could also take the
discussion if that's preferred.

BR,

Jukka Zitting


Tika Proposal
=============

This is an early draft of a possible proposal for a Tika project
within the Apache Incubator. See
http://incubator.apache.org/guides/proposal.html for a description of
the propsal template.

Abstract
--------

Tika is a toolkit for detecting and extracting metadata and text
content from various documents using existing parser libraries.

Proposal
--------

The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other
characteristics of existing documents and for extracting structured
text content from the documents.

The toolkit is targeted especially for search engines and other
content indexing and analysis tools, but will be useful also for other
applications that need to extract meaningful information from
documents that might be presented as nothing else than binary streams.

Instead of implementing it's own document parsers, Tika will use
existing parser libraries like Jakarta POI and PDFBox.

Background
----------

The need for tools that automatically analyze and index content is
increasing as ever more information becomes available.

TODO: Discuss the various related projects and the lack of a common
analysis toolkit. Note how many of the existing tools have grown as
ad-hoc solutions to specific needs, and are often tightly bound to a
specific application or a parser library.

Rationale
---------

TODO

Initial Goals
-------------

TODO

Current Status
--------------

TODO

Meritocracy
-----------

TODO

Community
---------

TODO

Core Developers
---------------

TODO

Alignment
---------

TODO

Known Risks
-----------

TODO: There has been on-and-off interest in something like this for
quite a while already. How can we make sure that the current increase
in interest doesn't fade away?

Orphaned products
-----------------

TODO: See the comment above

Inexperience with Open Source
-----------------------------

TODO: Many of the interested participants have open source background.

Homogenous Developers
---------------------

TODO: There is no central company behind the proposal.

Reliance on Salaried Developers
-------------------------------

TODO: Some of us are salaried for this, other's are not.

Relationships with Other Apache Products
----------------------------------------

TODO: Lucene, Nutch, Jackrabbit, Droids, ...

A Excessive Fascination with the Apache Brand
---------------------------------------------

TODO

Documentation
-------------

TODO

Initial Source
--------------

TODO: Tika, Lius, Nutch?, ...

Source and Intellectual Property Submission Plan
------------------------------------------------

TODO

External Dependencies
---------------------

TODO: Some of the potential parser libraries will be GPL-licensed or
otherwise troublesome for an ASF project. How to best handle such
cases?

Cryptography
------------

TODO: Some of the document formats are involve encryption and features
like DRM. While Tika itself will probably not include any
cryptographic code, the parser dependencies will most likely include
such code.

Required Resources
------------------

Mailing lists

  * tika-dev@incubator.apache.org

Subversion Directory

  * https://svn.apache.org/repos/asf/incubator/tika

Issue Tracking

  * JIRA TIKA

Other Resources

  * none

Initial Committers
------------------

TODO

Affiliations
------------

TODO

Sponsors
--------

Champion

TODO (I can volunteer)

Nominated Mentors

TODO (Three mentors is the recommendation, I can volunteer as one)

Sponsoring Entity

TODO (Apache Lucene?)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message