incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "October2007" by ChrisMattmann
Date Mon, 08 Oct 2007 21:14:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by ChrisMattmann:

The comment on the change is:
Added Tika Project Report for Oct07

  === Tika ===
+ Tika is a toolkit for detecting and extracting metadata and structured
+ text content from various documents using existing parser
+ Libraries. Tika entered incubation on March 22nd, 2007.
+ ==== Community ====
+ There have been a number of positive items within Tika during the last few
+ months. The traffic on the Tika mailing list has increased significantly
+ (with typically 2, 3 questions, and 1 or 2 commits every day, or every other
+ day), and there have been a lot of recent inquiries from external projects
+ wanting to collaborate with Tika (including Aperture, PDFBox and a fellow
+ developing a JSon library currently hosted at Google code). In addition,
+ Tika's architecture has become a recent discussion of interest (as we'll see
+ below).
+ We recently elected Keith Bennett as a new committer to Tika. Keith has been
+ spearheading many of the new patches committed to Tika, as well as
+ participating in discussions about the architecture, and future direction of
+ the project.
+ Tika will be represented at the "Fast Feather" track at Apache Con US by
+ Jukka Zitting. The rest of the community is helping to create the content
+ for the presentation. The abstract is listed below:
- ----
+ -----
+ ''Tika is a new content analysis framework borne from the desire to factor our
+ commonality from the Apache Nutch search engine framework. Tika provides a
+ mime detection framework, an extensible parsing framework and metadata
+ environment for content analysis. Though in its nascent stages, progress on
+ Tika has recently taken shape and the project is nearing a stable 0.1
+ release. In this talk, we'll describe the core APIs of Tika and discuss its use in
+ several distinct domains including search engines, scientific data
+ dissemination and an industrial setting.''
+ -----
+ ==== Development ====
+ There have been a flurry of JIRA issues and code activity [1] including 47
+ issues currently in JIRA, with 32 resolved issues, 14 closed issues, and 2
+ open major/minor issues in progress).
+ Tika's Parser interface (one of its key components) has just undergone a
+ major overhaul led by Jukka Zitting, and Chris Mattmann has recently
+ contributed a MimeType system (with help from fellow Apache Nutch committer
+ Jerome Charron) to Tika. We also cleaned up and refactored large parts of
+ the rest of the code (removing references to LuisLite and branding the
+ project wherever possible with the Tika name), in preparation for an
+ upcoming 0.1 release.
+ Chris Mattmann has led an effort to carve out the existing MimeType
+ detection system in Apache Nutch [2] and replace it with Tika's improved
+ MimeType detection system. There is a patch sitting in JIRA right now [3],
+ and barring objections, Nutch will rely on Tika for its MimeType detection
+ abilities.
+ Also active recently were committers Bertrand Delacretaz, Sami Siren and
+ Rida Benjelloun, committing patches and improvements wherever needed.
+ ==== Issues before graduation ====
+ No changes since our last report: the Tika project is still at an
+ early stage of incubation. We need to continue bringing in the initial
+ codebases and are targeting an initial incubating release (0.1) probably
+ within the next month. We also need to work on growing the community and
+ figuring out how to best interact with external parser projects.
+  1.
+  2.
+  3.
  === TripleSoup ===

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message