incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "TikaProposal" by JukkaZitting
Date Thu, 01 Mar 2007 20:51:37 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by JukkaZitting:
http://wiki.apache.org/incubator/TikaProposal

The comment on the change is:
Added links to related projects and discussions

------------------------------------------------------------------------------
  = Tika, a content analysis toolkit =
  
- ''This is a draft of a potential Tika proposal. Feel free to edit as you see fit, see also
the discussion section at the end of the page. You can "Subscribe" this page to get notified
whenever changes are made. See the [http://incubator.apache.org/guides/proposal.html proposal
guide] for a description of the expected proposal content.''
+ This is a draft version of the Tika proposal. Feel free to edit as you see fit, see also
the discussion section at the end of the page. You can "Subscribe" this page to get notified
whenever changes are made. See the [http://incubator.apache.org/guides/proposal.html proposal
guide] for a description of the expected proposal content.
+ 
+ See also the [http://thread.gmane.org/gmane.comp.search.nutch.devel/9684/focus=9693 earlier
proposal draft] by Chris Mattmann and Jerome Charron.
  
  == Abstract ==
  
@@ -10, +12 @@

  
  == Proposal ==
  
+ The Tika content analysis toolkit will include features for detecting the content types,
character encodings, languages, and other characteristics of existing documents and for extracting
structured text content from the documents.
- The Tika content analysis toolkit will include features for detecting the content types,
character encodings, languages, and other
- characteristics of existing documents and for extracting structured text content from the
documents.
  
+ The toolkit is targeted especially for search engines and other content indexing and analysis
tools, but will be useful also for other applications that need to extract meaningful information
from documents that might be presented as nothing else than binary streams.
- The toolkit is targeted especially for search engines and other content indexing and analysis
tools, but will be useful also for other
- applications that need to extract meaningful information from documents that might be presented
as nothing else than binary streams.
  
  Instead of implementing it's own document parsers, Tika will use existing parser libraries
like Jakarta POI and PDFBox.
  
@@ -22, +22 @@

  
  The need for tools that automatically analyze and index content is increasing as ever more
information becomes available.
  
- ''TODO: Discuss the various related projects and the lack of a common analysis toolkit.
Note how many of the existing tools have grown as
- ad-hoc solutions to specific needs, and are often tightly bound to a specific application
or a parser library.''
+ ''TODO: Discuss the various related projects and the lack of a common analysis toolkit.
Note how many of the existing tools have grown as ad-hoc solutions to specific needs, and
are often tightly bound to a specific application or a parser library.''
+ 
+ Related discussions and events:
+ 
+  * July 2007 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9373 nutch-dev: Text
extraction library]
+  * August 2006 [http://code.google.com/p/tika/ Tika project at Google Code]
+  * August 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9684 nutch-dev: Tika
update]
+  * August 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9685 nutch-dev: Parser
design]
+  * September 2006 [http://thread.gmane.org/gmane.comp.search.nutch.devel/9969 nutch-dev:
Content type detection]
+  * January 2007 [http://thread.gmane.org/gmane.comp.jakarta.lucene.devel/16888 java-dev:
Lius into apache incubator]
+  * February 2007 [http://thread.gmane.org/gmane.comp.jakarta.lucene.devel/8297 java-dev:
Lucene contribution]
+  * February 2007 [http://code.google.com/p/tika/wiki/DesignDiscussion Tika wiki: Design
discussion]
+  * ''TODO: What else?''
  
  == Rationale ==
  
@@ -55, +66 @@

  
  == Orphaned products ==
  
- ''TODO: There has been on-and-off interest in something like this for quite a while already.
How can we make sure that the current increase
+ ''TODO: There has been on-and-off interest in something like this for quite a while already.
How can we make sure that the current increase in interest doesn't fade away?''
- in interest doesn't fade away?''
  
  == Inexperience with Open Source ==
  
@@ -72, +82 @@

  
  == Relationships with Other Apache Products ==
  
- ''TODO: Lucene, Nutch, Jackrabbit, Droids, ...''
+ Tika is related to at least the following Apache projects. None of the projects is a direct
competitor for Tika, but there are many cases of potential overlap in functionality.
+ 
+  * [http://lucene.apache.org/java/ Apache Lucene] - The analysis part of Lucene contains
code that might overlap with some of the potential Tika functionality. There migth also be
some overlap regarding the Document model in Lucene.
+  * [http://lucene.apache.org/nutch/ Lucene Nutch] - The Nutch project already contains a
parser framework that does many of the things that Tika is designed to do.
+  * [http://jackrabbit.apache.org/ Apache Jackrabbit] - The Jackrabbit project contains a
text extraction component that also implements a subset of the proposed Tika features.
+  * ''TODO: Other projects? Solr? The Droids lab?''
  
  == A Excessive Fascination with the Apache Brand ==
  
@@ -84, +99 @@

  
  = Initial Source =
  
- ''TODO: Tika, Lius, Nutch?, ...''
+  * http://code.google.com/p/tika/
+  * http://www.bibl.ulaval.ca/lius/
+  * http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/ (?)
+  * ''TODO: What else?''
  
  = Source and Intellectual Property Submission Plan =
  
@@ -92, +110 @@

  
  = External Dependencies =
  
- ''TODO: Some of the potential parser libraries will be GPL-licensed or otherwise troublesome
for an ASF project. How to best handle such
+ ''TODO: Some of the potential parser libraries will be GPL-licensed or otherwise troublesome
for an ASF project. How to best handle such cases?''
- cases?''
+ 
+  * [http://jakarta.apache.org/poi/ Jakarta POI]
+  * [http://www.pdfbox.org/ PDFBox]
+  * ''TODO: Many others...''
  
  = Cryptography =
  
- ''TODO: Some of the document formats are involve encryption and features like DRM. While
Tika itself will probably not include any
+ ''TODO: Some of the document formats are involve encryption and features like DRM. While
Tika itself will probably not include any cryptographic code, the parser dependencies will
most likely include such code.''
- cryptographic code, the parser dependencies will most likely include such code.''
  
  = Required Resources =
  
@@ -120, +140 @@

  
  = Initial Committers =
  
- ''TODO''
+  * Rida Benjelloun (?)
+  * Mark Harwood (?)
+  * Jukka Zitting
+  * ''TODO: Who's interested?''
  
  = Affiliations =
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message