Return-Path: X-Original-To: apmail-tika-commits-archive@www.apache.org Delivered-To: apmail-tika-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0B67491A8 for ; Sat, 24 Mar 2012 05:30:43 +0000 (UTC) Received: (qmail 28178 invoked by uid 500); 24 Mar 2012 05:30:42 -0000 Delivered-To: apmail-tika-commits-archive@tika.apache.org Received: (qmail 28096 invoked by uid 500); 24 Mar 2012 05:30:41 -0000 Mailing-List: contact commits-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list commits@tika.apache.org Received: (qmail 28068 invoked by uid 99); 24 Mar 2012 05:30:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Mar 2012 05:30:40 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Mar 2012 05:30:35 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id C8D4D23889BB; Sat, 24 Mar 2012 05:30:15 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1304707 - in /tika/site/src/site: apt/1.1/ apt/1.1/detection.apt apt/1.1/formats.apt apt/1.1/gettingstarted.apt apt/1.1/index.apt apt/1.1/parser.apt apt/1.1/parser_guide.apt apt/download.apt apt/index.apt site.xml Date: Sat, 24 Mar 2012 05:30:15 -0000 To: commits@tika.apache.org From: mattmann@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20120324053015.C8D4D23889BB@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: mattmann Date: Sat Mar 24 05:30:14 2012 New Revision: 1304707 URL: http://svn.apache.org/viewvc?rev=1304707&view=rev Log: - update Tika website for 1.1 Added: tika/site/src/site/apt/1.1/ tika/site/src/site/apt/1.1/detection.apt tika/site/src/site/apt/1.1/formats.apt tika/site/src/site/apt/1.1/gettingstarted.apt tika/site/src/site/apt/1.1/index.apt tika/site/src/site/apt/1.1/parser.apt tika/site/src/site/apt/1.1/parser_guide.apt Modified: tika/site/src/site/apt/download.apt tika/site/src/site/apt/index.apt tika/site/src/site/site.xml Added: tika/site/src/site/apt/1.1/detection.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/detection.apt?rev=1304707&view=auto ============================================================================== --- tika/site/src/site/apt/1.1/detection.apt (added) +++ tika/site/src/site/apt/1.1/detection.apt Sat Mar 24 05:30:14 2012 @@ -0,0 +1,152 @@ + ----------------- + Content Detection + ----------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Content Detection + + This page gives you information on how content and language detection + works with Apache Tika, and how to tune the behaviour of Tika. + +%{toc|section=1|fromDepth=1} + +* {The Detector Interface} + + The + {{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}} + interface is the basis for most of the content type detection in Apache + Tika. All the different ways of detecting content all implement the + same common method: + +--- +MediaType detect(java.io.InputStream input, + Metadata metadata) throws java.io.IOException +--- + + The <<>> method takes the stream to inspect, and a + <<>> object that holds any additional information on + the content. The detector will return a + {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing + its best guess as to the type of the file. + + In general, only two keys on the Metadata object are used by Detectors. + These are <<>> which should hold the name + of the file (where known), and <<>> which should + hold the advertised content type of the file (eg from a webserver or + a content repository). + + +* {Mime Magic Detction} + + By looking for special ("magic") patterns of bytes near the start of + the file, it is often possible to detect the type of the file. For + some file types, this is a simple process. For others, typically + container based formats, the magic detection may not be enough. (More + detail on detecting container formats below) + + Tika is able to make use of a a mime magic info file, in the + {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop MIME-info}} + format to peform mime magic detection. + + This is provided within Tika by + {{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}. It is most commonly access via + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + normally sourced from the <<>> file. + + +* {Resource Name Based Detection} + + Where the name of the file is known, it is sometimes possible to guess + the file type from the name or extension. Within the + <<>> file is a list of patterns which are used to + identify the type from the filename. + + However, because files may be renamed, this method of detection is quick + but not always as accurate. + + This is provided within Tika by + {{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}. + + +* {Known Content Type "Detection} + + Sometimes, the mime type for a file is already known, such as when + downloading from a webserver, or when retrieving from a content store. + This information can be used by detectors, such as + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + + +* {The default Mime Types Detector} + + By default, the mime type detection in Tika is provided by + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}. + This detector makes use of <<>> to power + magic based and filename based detection. + + Firstly, magic based detection is used on the start of the file. + If the file is an XML file, then the start of the XML is processed + to look for root elements. Next, if available, the filename + (from <<>>) is + then used to improve the detail of the detection, such as when magic + detects a text file, and the filename hints it's really a CSV. Finally, + if available, the supplied content type (from <<>>) + is used to further refine the type. + + +* {Container Aware Detection} + + Several common file formats are actually held within a common container + format. One example is the PowerPoint .ppt and Word .doc formats, which + are both held within an OLE2 container. Another is Apple iWork formats, + which are actually a series of XML files within a Zip file. + + Using magic detection, it is easy to spot that a given file is an OLE2 + document, or a Zip file. Using magic detection alone, it is very difficult + (and often impossible) to tell what kind of file lives inside the container. + + For some use cases, speed is important, so having a quick way to know the + container type is sufficient. For other cases however, you don't mind + spending a bit of time (and memory!) processing the container to get a + more accurate answer on its contents. For these cases, a container + aware detector should be used. + + Tika provides a wrapping detector in the parsers bundle, of + {{{./api/org/apache/tika/detect/ContainerAwareDetector.html}org.apache.tika.detect.ContainerAwareDetector}}. + This detector will check for certain known containers, and if found, + will open them and detect the appropriate type based on the contents. + If the file isn't a known container, it will fall back to another + detector for the answer (most commonly the default + <<>> detector) + + Because this detector needs to read the whole file to process the + container, it must be used with a + {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}. + If called with a regular <<>>, then all work will be done + by the fallback detector. + + For more information on container formats and Tika, see + {{{http://wiki.apache.org/tika/MetadataDiscussion}}} + + +* {Language Detection} + + Tika is able to help identify the language of a piece of text, which + is useful when extracting text from document formats which do not include + language information in their metadata. + + The language detection is provided by + {{{./api/org/apache/tika/language/LanguageIdentifier.html}org.apache.tika.language.LanguageIdentifier}} Added: tika/site/src/site/apt/1.1/formats.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/formats.apt?rev=1304707&view=auto ============================================================================== --- tika/site/src/site/apt/1.1/formats.apt (added) +++ tika/site/src/site/apt/1.1/formats.apt Sat Mar 24 05:30:14 2012 @@ -0,0 +1,145 @@ + -------------------------- + Supported Document Formats + -------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Supported Document Formats + + This page lists all the document formats supported by Apache Tika 0.6. + Follow the links to the various parser class javadocs for more detailed + information about each document format and how it is parsed by Tika. + +%{toc|section=1|fromDepth=1} + +* {HyperText Markup Language} + + The HyperText Markup Language (HTML) is the lingua franca of the web. + Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}} + library to support virtually any kind of HTML found on the web. + The output from the + {{{api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class + is guaranteed to be well-formed and valid XHTML, and various heuristics + are used to prevent things like inline scripts from cluttering the + extracted text content. + +* {XML and derived formats} + + The Extensible Markup Language (XML) format is a generic format that can + be used for all kinds of content. Tika has custom parsers for some widely + used XML vocabularies like XHTML, OOXML and ODF, but the default + {{{api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}} + class simply extracts the text content of the document and ignores any XML + structure. The only exception to this rule are Dublin Core metadata + elements that are used for the document metadata. + +* {Microsoft Office document formats} + + Microsoft Office and some related applications produce documents in the + generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The + older OLE 2 format was introduced in Microsoft Office version 97 and was + the default format until Office version 2007 and the new XML-based + OOXML format. The + {{{api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}} + and + {{{api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}} + classes use {{{http://poi.apache.org/}Apache POI}} libraries to support + text and metadata extraction from both OLE2 and OOXML documents. + +* {OpenDocument Format} + + The OpenDocument format (ODF) is used most notably as the default format + of the OpenOffice.org office suite. The + {{{api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}} + class supports this format and the earlier OpenOffice 1.0 format on which + ODF is based. + +* {Portable Document Format} + + The {{{api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class + parsers Portable Document Format (PDF) documents using the + {{{http://pdfbox.apache.org/}Apache PDFBox}} library. + +* {Electronic Publication Format} + + The {{{api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class + supports the Electronic Publication Format (EPUB) used for many digital + books. + +* {Rich Text Format} + + The {{{api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class + uses the standard javax.swing.text.rtf feature to extract text content + from Rich Text Format (RTF) documents. + +* {Compression and packaging formats} + + Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}} + library to support various compression and packaging formats. The + {{{api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}} + class and its subclasses first parse the top level compression or + packaging format and then pass the unpacked document streams to a + second parsing stage using the parser instance specified in the + parse context. + +* {Text formats} + + Extracting text content from plain text files seems like a simple task + until you start thinking of all the possible character encodings. The + {{{api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses + encoding detection code from the {{{http://site.icu-project.org/}ICU}} + project to automatically detect the character encoding of a text document. + +* {Audio formats} + + Tika can detect several common audio formats and extract metadata + from them. Even text extraction is supported for some audio files that + contain lyrics or other textual content. The + {{{api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}} + and {{{api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}} + classes use standard javax.sound features to process simple audio + formats, and the + {{{api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class + adds support for the widely used MP3 format. + +* {Image formats} + + The {{{api/org/apache/tika/parser/image/ImageParser.html}ImageParser}} + class uses the standard javax.imageio feature to extract simple metadata + from image formats supported by the Java platform. More complex image + metadata is available through the + {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class + that uses the metadata-extractor library to supports Exif metadata + extraction from Jpeg images. + +* {Video formats} + + Currently Tika only supports the Flash video format using a simple + parsing algorithm implemented in the + {{{api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class. + +* {Java class files and archives} + + The {{{api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class + extracts class names and method signatures from Java class files, and + the {{{api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class + supports also jar archives. + +* {The mbox format} + + The {{{api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can + extract email messages from the mbox format used by many email archives + and Unix-style mailboxes. Added: tika/site/src/site/apt/1.1/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/gettingstarted.apt?rev=1304707&view=auto ============================================================================== --- tika/site/src/site/apt/1.1/gettingstarted.apt (added) +++ tika/site/src/site/apt/1.1/gettingstarted.apt Sat Mar 24 05:30:14 2012 @@ -0,0 +1,228 @@ + -------------------------------- + Getting Started with Apache Tika + -------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Getting Started with Apache Tika + + This document describes how to build Apache Tika from sources and + how to start using Tika in an application. + +Getting and building the sources + + To build Tika from sources you first need to either + {{{../download.html}download}} a source release or + {{{../source-repository.html}checkout}} the latest sources from + version control. + + Once you have the sources, you can build them using the + {{{http://maven.apache.org/}Maven 2}} build system. Executing the + following command in the base directory will build the sources + and install the resulting artifacts in your local Maven repository. + +--- +mvn install +--- + + See the Maven documentation for more information about the available + build options. + + Note that you need Java 5 or higher to build Tika. + +Build artifacts + + The Tika 1.1 build consists of a number of components and produces + the following main binaries: + + [tika-core/target/tika-core-1.1.jar] + Tika core library. Contains the core interfaces and classes of Tika, + but none of the parser implementations. Depends only on Java 5. + + [tika-parsers/target/tika-parsers-1.1.jar] + Tika parsers. Collection of classes that implement the Tika Parser + interface based on various external parser libraries. + + [tika-app/target/tika-app-1.1.jar] + Tika application. Combines the above libraries and all the external + parser libraries into a single runnable jar with a GUI and a command + line interface. + + [tika-bundle/target/tika-bundle-1.1.jar] + Tika bundle. An OSGi bundle that includes everything you need to use all + Tika functionality in an OSGi environment. + +Using Tika as a Maven dependency + + The core library, tika-core, contains the key interfaces and classes of Tika + and can be used by itself if you don't need the full set of parsers from + the tika-parsers component. The tika-core dependency looks like this: + +--- + + org.apache.tika + tika-core + 1.1 + +--- + + If you want to use Tika to parse documents (instead of simply detecting + document types, etc.), you'll want to depend on tika-parsers instead: + +--- + + org.apache.tika + tika-parsers + 1.1 + +--- + + Note that adding this dependency will introduce a number of + transitive dependencies to your project, including one on tika-core. + You need to make sure that these dependencies won't conflict with your + existing project dependencies. The listing below shows all the + compile-scope dependencies of tika-parsers in the Tika 1.1 release. + +--- ++- org.apache.tika:tika-core:jar:1.1:compile ++- org.gagravarr:vorbis-java-tika:jar:0.1:compile +| \- org.gagravarr:vorbis-java-core:jar:tests:0.1:runtime ++- org.apache.felix:org.apache.felix.scr.annotations:jar:1.6.0:provided ++- edu.ucar:netcdf:jar:4.2-min:compile +| \- org.slf4j:slf4j-api:jar:1.5.6:compile ++- org.apache.james:apache-mime4j-core:jar:0.7:compile ++- org.apache.james:apache-mime4j-dom:jar:0.7:compile ++- org.apache.commons:commons-compress:jar:1.3:compile ++- commons-codec:commons-codec:jar:1.5:compile ++- org.apache.pdfbox:pdfbox:jar:1.6.0:compile +| +- org.apache.pdfbox:fontbox:jar:1.6.0:compile +| +- org.apache.pdfbox:jempbox:jar:1.6.0:compile +| \- commons-logging:commons-logging:jar:1.1.1:compile ++- org.bouncycastle:bcmail-jdk15:jar:1.45:compile ++- org.bouncycastle:bcprov-jdk15:jar:1.45:compile ++- org.apache.poi:poi:jar:3.8-beta5:compile ++- org.apache.poi:poi-scratchpad:jar:3.8-beta5:compile ++- org.apache.poi:poi-ooxml:jar:3.8-beta5:compile +| +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta5:compile +| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile +| \- dom4j:dom4j:jar:1.6.1:compile ++- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile ++- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile ++- asm:asm:jar:3.1:compile ++- com.googlecode.mp4parser:isoparser:jar:1.0-beta-5:compile +| \- net.sf.scannotation:scannotation:jar:1.0.2:compile +| \- javassist:javassist:jar:3.6.0.GA:compile ++- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile ++- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile ++- rome:rome:jar:0.9:compile +| \- jdom:jdom:jar:1.0:compile ++- org.gagravarr:vorbis-java-core:jar:0.1:compile ++- junit:junit:jar:4.10:test +| \- org.hamcrest:hamcrest-core:jar:1.1:test ++- org.mockito:mockito-core:jar:1.7:test +| \- org.objenesis:objenesis:jar:1.0:test +\- org.slf4j:slf4j-log4j12:jar:1.5.6:test + \- log4j:log4j:jar:1.2.14:test + +--- + +Using Tika in an Ant project + + Unless you use a dependency manager tool like + {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application + you can include the Tika jar files and the dependencies individually. + +--- + + ... + + + + + + + + + + + + + + + + + + + + +--- + + An easy way to gather all these libraries is to run + "mvn dependency:copy-dependencies" in the tika-parsers source directory. + This will copy all Tika dependencies to the <<>> + directory. + + Alternatively you can simply drop the entire tika-app jar to your + classpath to get all of the above dependencies in a single archive. + +Using Tika as a command line utility + + The Tika application jar (tika-app-1.1.jar) can be used as a command + line utility for extracting text content and metadata from all sorts of + files. This runnable jar contains all the dependencies it needs, so + you don't need to worry about classpath settings to run it. + + The usage instructions are shown below. + +--- +usage: java -jar tika-app-1.1.jar [option] [file] + +Options: + -? or --help Print this usage message + -v or --verbose Print debug level messages + -g or --gui Start the Apache Tika GUI + -x or --xml Output XHTML content (default) + -h or --html Output HTML content + -t or --text Output plain text content + -m or --metadata Output only metadata + +Description: + Apache Tika will parse the file(s) specified on the + command line and output the extracted text content + or metadata to standard output. + + Instead of a file name you can also specify the URL + of a document to be parsed. + + If no file name or URL is specified (or the special + name "-" is used), then the standard input stream + is parsed. + + Use the "--gui" (or "-g") option to start + the Apache Tika GUI. You can drag and drop files + from a normal file explorer to the GUI window to + extract text content and metadata from the files. +--- + + You can also use the jar as a component in a Unix pipeline or + as an external tool in many scripting languages. + +--- +# Check if an Internet resource contains a specific keyword +curl http://.../document.doc \ + | java -jar tika-app-1.1.jar --text \ + | grep -q keyword +--- Added: tika/site/src/site/apt/1.1/index.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/index.apt?rev=1304707&view=auto ============================================================================== --- tika/site/src/site/apt/1.1/index.apt (added) +++ tika/site/src/site/apt/1.1/index.apt Sat Mar 24 05:30:14 2012 @@ -0,0 +1,176 @@ + --------------- + Apache Tika 0.8 + --------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Apache Tika 1.1 + + + The most notable changes in Tika 1.1 over the previous release are: + + * Link Extraction: The rel attribute is now extracted from links + per the LinkConteHandler. + ({{{http://issues.apache.org/jira/browse/TIKA-824}TIKA-824}}) + + * MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously + the last character in a UTF-16 tag could be corrupted) + ({{{http://issues.apache.org/jira/browse/TIKA-793}TIKA-793}}) + + * Performance: Loading of the default media type registry is now + significantly faster. + ({{{http://issues.apache.org/jira/browse/TIKA-780}TIKA-780}}) + + * PDF: Allow controlling whether overlapping duplicated text should + be removed. Disabling this (the default) can give big speedups to + text extraction and may workaround cases where non-duplicated + characters were incorrectly removed + ({{{http://issues.apache.org/jira/browse/TIKA-767}TIKA-767}}). + Allow controlling whether text tokens should be sorted by their x/y + position before extracting text + ({{{http://issues.apache.org/jira/browse/TIKA-612}TIKA-612}}); + this is necessary for certain PDFs. Fixed cases where too many +

tags appear in the XHTML output, causing NPE when opening + some PDFs with the GUI + ({{{http://issues.apache.org/jira/browse/TIKA-778}TIKA-778}}). + + * RTF: Fixed case where a font change would result in processing + bytes in the wrong font's charset, producing bogus text output + ({{{http://issues.apache.org/jira/browse/TIKA-777}TIKA-777}}). + Don't output whitespace in ignored group states, avoiding + excessive whitespace output + ({{{http://issues.apache.org/jira/browse/TIKA-781}TIKA-781}}). + Binary embedded content (using \bin control word) is now skipped + correctly; previously it could cause the parser to incorrectly + extract binary content as text + ({{{http://issues.apache.org/jira/browse/TIKA-782}TIKA-782}}). + + * CLI: New TikaCLI option "--list-detectors", which displays the + mimetype detectors that are available, similar to the existing + "--list-parsers" option for parsers. + ({{{http://issues.apache.org/jira/browse/TIKA-785}TIKA-785}}). + + * Detectors: The order of detectors, as supplied via the service + registry loader, is now controlled. User supplied detectors are + prefered, then Tika detectors (such as the container aware ones), + and finally the core Tika MimeTypes is used as a backup. This + allows for specific, detailed detectors to take preference over + the default mime magic + filename detector. + ({{{http://issues.apache.org/jira/browse/TIKA-786}TIKA-786}}) + + * Microsoft Project (MPP): Filetype detection has been fixed, and + basic metadata (but no text) is now extracted. + ({{{http://issues.apache.org/jira/browse/TIKA-789}TIKA-789}}) + + * Outlook: fixed NullPointerException in TikaGUI when messages with + embedded RTF or HTML content were filtered + ({{{http://issues.apache.org/jira/browse/TIKA-801}TIKA-801}}). + + * Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio + files, which extract audio metadata and tags + ({{{http://issues.apache.org/jira/browse/TIKA-747}TIKA-747}}). + + * MP4: Improved mime magic detection for MP4 based formats (including + QuickTime, MP4 Video and Audio, and 3GPP) + ({{{http://issues.apache.org/jira/browse/TIKA-851}TIKA-851}}). + + * MP4: Basic metadata extracting parser for MP4 files added, which includes + limited audio and video metadata, along with the iTunes media metadata + (such as Artist and Title) + ({{{http://issues.apache.org/jira/browse/TIKA-852}TIKA-852}}). + + * Document Passwords: A new ParseContext object, PasswordProvider, + has been added. This provides a way to supply the password for + a document during processing. Currently, only password protected + PDFs and Microsoft OOXML Files are supported. + ({{{http://issues.apache.org/jira/browse/TIKA-850}TIKA-850}}). + + The following people have contributed to Tika 1.1 by submitting or + commenting on the issues resolved in this release: + + * Alex Ott + + * Alexander Chow + + * Ali Oral + + * Andrzej Bialecki + + * Antoni Mylka + + * Arjohn Kampman + + * Bastian Mathes + + * Chris A. Mattmann + + * Craig Stires + + * David Tran + + * Etienne Jouvin + + * Fabian Lange + + * Geoff Jarrad + + * Jan H¿ydahl + + * Jerome Lacoste + + * John Mastarone + + * Jukka Zitting + + * Julien Nioche + + * Ken Krugler + + * Lau Brino + + * Markus Jelsma + + * Maxim Valyanskiy + + * Michael McCandless + + * Nick Burch + + * Pablo Queixalos + + * Paul Hill + + * Paul Pearcy + + * peter royal + + * PNS + + * Radek + + * Ray Gauss II + + * Stephan MŸhlstrasser + + * Swapna Vuppala + + * Torsten Krah + + * William Seemann + + * Yegor Kozlov + + See {{http://s.apache.org/Jn4}} for more details on these contributions. Added: tika/site/src/site/apt/1.1/parser.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/parser.apt?rev=1304707&view=auto ============================================================================== --- tika/site/src/site/apt/1.1/parser.apt (added) +++ tika/site/src/site/apt/1.1/parser.apt Sat Mar 24 05:30:14 2012 @@ -0,0 +1,245 @@ + -------------------- + The Parser interface + -------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +The Parser interface + + The + {{{api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}} + interface is the key concept of Apache Tika. It hides the complexity of + different file formats and parsing libraries while providing a simple and + powerful mechanism for client applications to extract structured text + content and metadata from all sorts of documents. All this is achieved + with a single method: + +--- +void parse( + InputStream stream, ContentHandler handler, Metadata metadata, + ParseContext context) throws IOException, SAXException, TikaException; +--- + + The <<>> method takes the document to be parsed and related metadata + as input and outputs the results as XHTML SAX events and extra metadata. + The parse context argument is used to specify context information (like + the current local) that is not related to any individual document. + The main criteria that lead to this design were: + + [Streamed parsing] The interface should require neither the client + application nor the parser implementation to keep the full document + content in memory or spooled to disk. This allows even huge documents + to be parsed without excessive resource requirements. + + [Structured content] A parser implementation should be able to + include structural information (headings, links, etc.) in the extracted + content. A client application can use this information for example to + better judge the relevance of different parts of the parsed document. + + [Input metadata] A client application should be able to include metadata + like the file name or declared content type with the document to be + parsed. The parser implementation can use this information to better + guide the parsing process. + + [Output metadata] A parser implementation should be able to return + document metadata in addition to document content. Many document + formats contain metadata like the name of the author that may be useful + to client applications. + + [Context sensitivity] While the default settings and behaviour of Tika + parsers should work well for most use cases, there are still situations + where more fine-grained control over the parsing process is desirable. + It should be easy to inject such context-specific information to the + parsing process without breaking the layers of abstraction. + + [] + + These criteria are reflected in the arguments of the <<>> method. + +* Document input stream + + The first argument is an + {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}} + for reading the document to be parsed. + + If this document stream can not be read, then parsing stops and the thrown + {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}} + is passed up to the client application. If the stream can be read but + not parsed (for example if the document is corrupted), then the parser + throws a {{{api/org/apache/tika/exception/TikaException.html}TikaException}}. + + The parser implementation will consume this stream but . + Closing the stream is the responsibility of the client application that + opened it in the first place. The recommended pattern for using streams + with the <<>> method is: + +--- +InputStream stream = ...; // open the stream +try { + parser.parse(stream, ...); // parse the stream +} finally { + stream.close(); // close the stream +} +--- + + Some document formats like the OLE2 Compound Document Format used by + Microsoft Office are best parsed as random access files. In such cases the + content of the input stream is automatically spooled to a temporary file + that gets removed once parsed. A future version of Tika may make it possible + to avoid this extra file if the input document is already a file in the + local file system. See + {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status + of this feature request. + +* XHTML SAX events + + The parsed content of the document stream is returned to the client + application as a sequence of XHTML SAX events. XHTML is used to express + structured content of the document and SAX events enable streamed + processing. Note that the XHTML format is used here only to convey + structural information, not to render the documents for browsing! + + The XHTML SAX events produced by the parser implementation are sent to a + {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} + instance given to the <<>> method. If this the content handler + fails to process an event, then parsing stops and the thrown + {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}} + is passed up to the client application. + + The overall structure of the generated event stream is (with indenting + added for clarity): + +--- + + + ... + + + ... + + +--- + + Parser implementations typically use the + {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}} + utility class to generate the XHTML output. + + Dealing with the raw SAX events can be a bit complex, so Apache Tika + comes with a number of utility classes that can be used to process and + convert the event stream to other representations. + + For example, the + {{{api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}} + class can be used to extract just the body part of the XHTML output and + feed it either as SAX events to another content handler or as characters + to an output stream, a writer, or simply a string. The following code + snippet parses a document from the standard input stream and outputs the + extracted text content to standard output: + +--- +ContentHandler handler = new BodyContentHandler(System.out); +parser.parse(System.in, handler, ...); +--- + + Another useful class is + {{{api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that + uses a background thread to parse the document and returns the extracted + text content as a character stream: + +--- +InputStream stream = ...; // the document to be parsed +Reader reader = new ParsingReader(parser, stream, ...); +try { + ...; // read the document text using the reader +} finally { + reader.close(); // the document stream is closed automatically +} +--- + +* Document metadata + + The third argument to the <<>> method is used to pass document + metadata both in and out of the parser. Document metadata is expressed + as an {{{api/org/apache/tika/metadata/Metadata.html}Metadata}} object. + + The following are some of the more interesting metadata properties: + + [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains + the document. + + A client application can set this property to allow the parser to use + file name heuristics to determine the format of the document. + + The parser implementation may set this property if the file format + contains the canonical name of the file (for example the Gzip format + has a slot for the file name). + + [Metadata.CONTENT_TYPE] The declared content type of the document. + + A client application can set this property based on for example a HTTP + Content-Type header. The declared content type may help the parser to + correctly interpret the document. + + The parser implementation sets this property to the content type according + to which the document was parsed. + + [Metadata.TITLE] The title of the document. + + The parser implementation sets this property if the document format + contains an explicit title field. + + [Metadata.AUTHOR] The name of the author of the document. + + The parser implementation sets this property if the document format + contains an explicit author field. + + [] + + Note that metadata handling is still being discussed by the Tika development + team, and it is likely that there will be some (backwards incompatible) + changes in metadata handling before Tika 1.0. + +* Parse context + + The final argument to the <<>> method is used to inject + context-specific information to the parsing process. This is useful + for example when dealing with locale-specific date and number formats + in Microsoft Excel spreadsheets. Another important use of the parse + context is passing in the delegate parser instance to be used by + two-phase parsers like the + {{{api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses. + Some parser classes allow customization of the parsing process through + strategy objects in the parse context. + +* Parser implementations + + Apache Tika comes with a number of parser classes for parsing + {{{formats.html}various document formats}}. You can also extend Tika + with your own parsers, and of course any contributions to Tika are + warmly welcome. + + The goal of Tika is to reuse existing parser libraries like + {{{http://www.pdfbox.org/}PDFBox}} or + {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most + of the parser classes in Tika are adapters to such external libraries. + + Tika also contains some general purpose parser implementations that are + not targeted at any specific document formats. The most notable of these + is the {{{apidocs/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}} + class that encapsulates all Tika functionality into a single parser that + can handle any types of documents. This parser will automatically determine + the type of the incoming document based on various heuristics and will then + parse the document accordingly. Added: tika/site/src/site/apt/1.1/parser_guide.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/parser_guide.apt?rev=1304707&view=auto ============================================================================== --- tika/site/src/site/apt/1.1/parser_guide.apt (added) +++ tika/site/src/site/apt/1.1/parser_guide.apt Sat Mar 24 05:30:14 2012 @@ -0,0 +1,135 @@ + -------------------------------------------- + Get Tika parsing up and running in 5 minutes + -------------------------------------------- + Arturo Beltran + -------------------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Get Tika parsing up and running in 5 minutes + + This page is a quick start guide showing how to add a new parser to Apache Tika. + Following the simple steps listed below your new parser can be running in only 5 minutes. + +%{toc|section=1|fromDepth=1} + +* {Getting Started} + + The {{{gettingstarted.html}Getting Started}} document describes how to + build Apache Tika from sources and how to start using Tika in an application. Pay close attention + and follow the instructions in the "Getting and building the sources" section. + + +* {Add your MIME-Type} + + You first need to modify {{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}} + in order to Tika can map the file extension with its MIME-Type. You should add something like this: + +--- + + + +--- + +* {Create your Parser class} + + Now, you need to create your new parser. This is a class that must implement the Parser interface + offered by Tika. A very simple Tika Parser looks like this: + +--- +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * @Author: Arturo Beltran + */ +package org.apache.tika.parser.hello; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Collections; +import java.util.Set; + +import org.apache.tika.exception.TikaException; +import org.apache.tika.metadata.Metadata; +import org.apache.tika.mime.MediaType; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; +import org.apache.tika.sax.XHTMLContentHandler; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; + +public class HelloParser implements Parser { + + private static final Set SUPPORTED_TYPES = Collections.singleton(MediaType.application("hello")); + public static final String HELLO_MIME_TYPE = "application/hello"; + + public Set getSupportedTypes(ParseContext context) { + return SUPPORTED_TYPES; + } + + public void parse( + InputStream stream, ContentHandler handler, + Metadata metadata, ParseContext context) + throws IOException, SAXException, TikaException { + + metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE); + metadata.set("Hello", "World"); + + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + xhtml.endDocument(); + } + + /** + * @deprecated This method will be removed in Apache Tika 1.0. + */ + public void parse( + InputStream stream, ContentHandler handler, Metadata metadata) + throws IOException, SAXException, TikaException { + parse(stream, handler, metadata, new ParseContext()); + } +} +--- + + Pay special attention to the definition of the SUPPORTED_TYPES static class + field in the parser class that defines what MIME-Types it supports. + + Is in the "parse" method where you will do all your work. This is, extract + the information of the resource and then set the metadata. + +* {List the new parser} + + Finally, you should explicitly tell the AutoDetectParser to include your new + parser. This step is only needed if you want to use the AutoDetectParser functionality. + If you figure out the correct parser in a different way, it isn't needed. + + List your new parser in: + {{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}} + + Modified: tika/site/src/site/apt/download.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/download.apt?rev=1304707&r1=1304706&r2=1304707&view=diff ============================================================================== --- tika/site/src/site/apt/download.apt (original) +++ tika/site/src/site/apt/download.apt Sat Mar 24 05:30:14 2012 @@ -19,19 +19,19 @@ Download Apache Tika - Apache Tika 1.0 is now available. - See the {{{http://www.apache.org/dist/tika/CHANGES-1.0.txt}CHANGES.txt}} + Apache Tika 1.1 is now available. + See the {{{http://www.apache.org/dist/tika/CHANGES-1.1.txt}CHANGES.txt}} file for more information on the list of updates in this initial release. - * {{{http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip}apache-tika-1.0-src.zip}} - (source archive, {{{http://www.apache.org/dist/tika/apache-tika-1.0-src.zip.asc}PGP signature}})\ - SHA1: <<<203d84b56c5b8879ce04b496e9b7421387ea386e>>>\ - MD5: <<<65e82bb15754bbc9f7122dcaf6813831>>> + * {{{http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.1-src.zip}apache-tika-1.1-src.zip}} + (source archive, {{{http://www.apache.org/dist/tika/apache-tika-1.1-src.zip.asc}PGP signature}})\ + SHA1: <<>>\ + MD5: <<<927134622b1c445b5f814f47495495a1>>> * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar}tika-app-1.0.jar}} (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.0.jar.asc}PGP signature}})\ - SHA1: <<<25c6e1a77b5e88f8e23db6c074ec95b9b24fb7f2>>>\ - MD5: <<<9f94067bab5258e70ffa6a79357c11ef>>> + SHA1: <<<6c442b0b4b4dfa2d80c78ecaa70b9a5be8a86991>>>\ + MD5: <<>> [] Modified: tika/site/src/site/apt/index.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/index.apt?rev=1304707&r1=1304706&r2=1304707&view=diff ============================================================================== --- tika/site/src/site/apt/index.apt (original) +++ tika/site/src/site/apt/index.apt Sat Mar 24 05:30:14 2012 @@ -23,7 +23,7 @@ Apache Tika - a content analysis toolkit structured text content from various documents using existing parser libraries. You can find the latest release on the {{{./download.html}download page}}. See the - {{{./0.10/gettingstarted.html}Getting Started}} guide for instructions on + {{{./0.11/gettingstarted.html}Getting Started}} guide for instructions on how to start using Tika. Tika is a project of the @@ -32,6 +32,14 @@ Apache Tika - a content analysis toolkit Latest News + [23 March 2012: Apache Tika Release] + Apache Tika 1.1 is out the door! We've made a number of improvements to + PDF, RTF and MP3 parsing. We've also provided some new features on the + command line including the ability to list detectors. Other bug fixes and + improvements are listed in the {{{http://www.apache.org/dist/tika/CHANGES-1.1.txt}CHANGES.txt} + file for this release. Have a look at the download page for more information + on the release. + [7 November 2011: Apache Tika Release] Apache Tika 1.0 has been released, just in time for ApacheCon NA 2011! The 1.0 release of Tika removes all deprecated pre 1.0 API methods, makes Modified: tika/site/src/site/site.xml URL: http://svn.apache.org/viewvc/tika/site/src/site/site.xml?rev=1304707&r1=1304706&r2=1304707&view=diff ============================================================================== --- tika/site/src/site/site.xml (original) +++ tika/site/src/site/site.xml Sat Mar 24 05:30:14 2012 @@ -39,7 +39,15 @@ - + + + + + + + + +