www-announce mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sally Khudairi ...@apache.org>
Subject The Apache Software Foundation Announces Apache Tika™ v1.0
Date Wed, 09 Nov 2011 13:20:06 GMT
[this announcement is also available online at http://s.apache.org/N0I]

Standards-based, Content and Metadata Detection and Analysis Toolkit Powers Large-scale, Multi-lingual,
Multi-format Repositories at Adobe, the Internet Archive, NASA Jet Propulsion Laboratory,
and more.

9 November 2011 —FOREST HILL, MD— The Apache Software Foundation (ASF), the all-volunteer
developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today
announced Apache Tika v1.0, an embeddable, lightweight toolkit for content detection and analysis. 

"The Apache Tika v1.0 release is five years in the making, providing numerous improvements
and new parsing formats," said Chris Mattmann, Apache Tika Vice President, Senior Computer
Scientist at NASA Jet Propulsion Laboratory, and University of Southern California Adjunct
Assistant Professor of Computer Science. "From a toolkit perspective, it's easy to integrate,
and provides maximum functionality with little configuration."

With the increasing amount of information available on the Internet today, automatic information
processing and retrieval is urgently needed to understand content across cultures, languages,
and continents.

Apache Tika is a one-stop shop for identifying, retrieving, and parsing text and metadata
from over 1,200 file formats including HTML, XML, Microsoft Office, OpenOffice/OpenDocument,
PDF, images, ebooks/EPUB, Rich Text, compression and packaging formats, text/audio/image/video,
Java class files and archives, email/mbox, and more. 

Tika entered the Apache Incubator in 2007, became a sub-project of Apache Lucene in 2008,
and graduated as an ASF Top-level Project (TLP) in April 2010. Apache Tika has been tested
extensively in repositories exceeding 500 million documents across a variety of applications
in industry, academia and government labs.

"At NASA, we leverage Apache Tika on several of our Earth science data system projects," explained
Dan Crichton, Program Manager and Principal Computer Scientist, NASA Jet Propulsion Laboratory.
"Tika helps us processes hundreds of terabytes of scientific data in myriad formats and their
associated metadata models. Using Tika with other Apache technologies such as OODT, Lucene,
and Solr, we are able to automate, virtualize and increase the efficiency of NASA's science
data processing pipeline."

Users and software applications use Apache Tika to explore the information landscape through
flexible interfaces in Java, from the command line, REST-ful Web services, and also by consuming
its functionality from a multitude of programming languages directly, including Python, .NET
and C++. Tika defines a standard application programming interface (API) and makes use of
existing libraries such Apache POI and PDFBox to detect and extract metadata and structured
text content from various documents using existing parser libraries.


"We've used Apache Tika extensively for a wide range of content extraction tasks, including
parsing almost 600 million pages and documents from a large web crawl," said Ken Krugler,
Founder and President of Scale Unlimited. "It's proven invaluable as a simple yet robust solution
to the challenges of extracting text and metadata from the jungle of formats you find on the
web."

"Hippo CMS 7 uses Apache Jackrabbit to index content repositories containing as many as 500,000
documents," explained Arjé Cahn, CTO of Hippo. "We are exploring ways that Apache Tika can
enhance access to metadata in our faceted navigation feature, which may result in a possible
future patch."


Availability and Oversight
As with all Apache products, Apache Tika software is released under the Apache License v2.0,
and is overseen by a self-selected team of active contributors to the project. A Project Management
Committee (PMC) guides the Project’s day-to-day operations, including community development
and product releases. Apache Tika source code, documentation, and related resources are available
at http://tika.apache.org/.

Apache Tika in Action!
Apache Tika v1.0 will be featured at ApacheCon's Content Technologies track on 10 November
2011. PMC Chair Mattmann will describe the modern genesis of the project and its ecosystem,
as well as the newly-launched Manning Publications book, “Tika in Action” co-authored
by Mattmann and Zitting.

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees nearly one hundred fifty leading
Open Source projects, including Apache HTTP Server — the world's most popular Web server
software. Through the ASF's meritocratic process known as "The Apache Way," more than 350
individual Members and 3,000 Committers successfully collaborate to develop freely available
enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions
are distributed under the Apache License; and the community actively participates in ASF mailing
lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings,
and expo. The ASF is a US 501(3)(c) not-for-profit charity, funded by individual donations
and corporate sponsors including AMD, Basis Technology, Cloudera, Facebook, Google, IBM, HP,
Matt Mullenweg, Microsoft, PSW Group, SpringSource/VMware, and Yahoo!. For more information,
visit http://www.apache.org/.

"Apache", "Apache Tika", and "ApacheCon" are trademarks of The Apache Software Foundation.
All other brands and trademarks are the property of their respective owners.

# # #

--------------------------------------------------------------------- 
To unsubscribe, e-mail: announce-unsubscribe@apache.org 
For additional commands, e-mail: announce-help@apache.org 



Mime
View raw message