Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-dev@lucene.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Thu, 03 Mar 2011 16:21:36 -0000
Message-ID: <20110303162136.1799.95094@eosnew.apache.org>
Subject: 
 =?utf-8?q?=5BSolr_Wiki=5D_Trivial_Update_of_=22ExtractingRequestHandler?=
 =?utf-8?q?=22_by_EricPugh?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for chan=
ge notification.

The "ExtractingRequestHandler" page has been changed by EricPugh.
The comment on this change is: fix urls to tika project now it's out of inc=
ubation.  Don't deep link to formats page since it is version dependent and=
 tika versions change..
http://wiki.apache.org/solr/ExtractingRequestHandler?action=3Ddiff&rev1=3D6=
6&rev2=3D67

--------------------------------------------------

  =3D Introduction =3D
  <!> [[Solr1.4]]
  =

- A common need of users is the ability to ingest binary and/or structured =
documents such as Office, Word, PDF and other proprietary formats.  The [[h=
ttp://incubator.apache.org/tika/|Apache Tika]] project provides a framework=
 for wrapping many different file format parsers, such as PDFBox, POI and o=
thers.
+ A common need of users is the ability to ingest binary and/or structured =
documents such as Office, Word, PDF and other proprietary formats.  The [[h=
ttp://tika.apache.org/|Apache Tika]] project provides a framework for wrapp=
ing many different file format parsers, such as PDFBox, POI and others.
  =

  Solr's !ExtractingRequestHandler uses Tika to allow users to upload binar=
y files to Solr and have Solr extract text from it and then index it.
  =

@@ -17, +17 @@

   * Tika will automatically attempt to determine the input document type (=
word, pdf, etc.) and extract the content appropriately. If you want, you ca=
n explicitly specify a MIME type for Tika wth the stream.type parameter
   * Tika does everything by producing an XHTML stream that it feeds to a S=
AX !ContentHandler.
   * Solr then reacts to Tika's SAX events and creates the fields to index.
-  * Tika produces Metadata information such as Title, Subject, and Author,=
 according to specifications like !DublinCore.  See http://lucene.apache.or=
g/tika/formats.html for the file types supported.
+  * Tika produces Metadata information such as Title, Subject, and Author,=
 according to specifications like !DublinCore.  See http://tika.apache.org/=
 site for the file types supported.
   * All of the extracted text is added to the "content" field
   * We can map Tika's metadata fields to Solr fields.  We can boost these =
fields
   * We can also pass in literals for field values.
@@ -224, +224 @@

   * Commit
  =

  =3D Additional Resources =3D
- * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Artic=
les/Content-Extraction-Tika#example.source|Lucid Imagination article]] * [[=
http://tika.apache.org/0.7/formats.html|Supported document formats via Tika=
 (0.7)]]
+ * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Artic=
les/Content-Extraction-Tika#example.source|Lucid Imagination article]] * [[=
http://tika.apache.org/0.9/formats.html|Supported document formats via Tika=
 (0.9)]]
  =

  =3D What's in a Name =3D
  Grant was writing the javadocs for the code and needed an entry for the <=
title> tag and wrote out "Solr Content Extraction Library", since the contr=
ib directory is named "extraction".  This then lead to an "acronym":  Solr =
CEL which then gets mashed to: Solr Cell.  Hence, the project name is "Solr=
 Cell".  It's also appropriate because a Solar Cell's job is to convert the=
 raw energy of the Sun to electricity, and this contrib's module is respons=
ible for converting the "raw" content of a document to something usable by =
Solr. http://en.wikipedia.org/wiki/Solar_cell