Return-Path: Delivered-To: apmail-lucene-solr-commits-archive@minotaur.apache.org Received: (qmail 73771 invoked from network); 3 Mar 2011 16:22:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Mar 2011 16:22:00 -0000 Received: (qmail 86878 invoked by uid 500); 3 Mar 2011 16:22:00 -0000 Delivered-To: apmail-lucene-solr-commits-archive@lucene.apache.org Received: (qmail 86842 invoked by uid 500); 3 Mar 2011 16:22:00 -0000 Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-commits@lucene.apache.org Received: (qmail 86834 invoked by uid 99); 3 Mar 2011 16:22:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Mar 2011 16:22:00 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Mar 2011 16:21:58 +0000 Received: from eosnew.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 85A27110 for ; Thu, 3 Mar 2011 16:21:36 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Thu, 03 Mar 2011 16:21:36 -0000 Message-ID: <20110303162136.1799.95094@eosnew.apache.org> Subject: =?utf-8?q?=5BSolr_Wiki=5D_Trivial_Update_of_=22ExtractingRequestHandler?= =?utf-8?q?=22_by_EricPugh?= X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for chan= ge notification. The "ExtractingRequestHandler" page has been changed by EricPugh. The comment on this change is: fix urls to tika project now it's out of inc= ubation. Don't deep link to formats page since it is version dependent and= tika versions change.. http://wiki.apache.org/solr/ExtractingRequestHandler?action=3Ddiff&rev1=3D6= 6&rev2=3D67 -------------------------------------------------- =3D Introduction =3D [[Solr1.4]] = - A common need of users is the ability to ingest binary and/or structured = documents such as Office, Word, PDF and other proprietary formats. The [[h= ttp://incubator.apache.org/tika/|Apache Tika]] project provides a framework= for wrapping many different file format parsers, such as PDFBox, POI and o= thers. + A common need of users is the ability to ingest binary and/or structured = documents such as Office, Word, PDF and other proprietary formats. The [[h= ttp://tika.apache.org/|Apache Tika]] project provides a framework for wrapp= ing many different file format parsers, such as PDFBox, POI and others. = Solr's !ExtractingRequestHandler uses Tika to allow users to upload binar= y files to Solr and have Solr extract text from it and then index it. = @@ -17, +17 @@ * Tika will automatically attempt to determine the input document type (= word, pdf, etc.) and extract the content appropriately. If you want, you ca= n explicitly specify a MIME type for Tika wth the stream.type parameter * Tika does everything by producing an XHTML stream that it feeds to a S= AX !ContentHandler. * Solr then reacts to Tika's SAX events and creates the fields to index. - * Tika produces Metadata information such as Title, Subject, and Author,= according to specifications like !DublinCore. See http://lucene.apache.or= g/tika/formats.html for the file types supported. + * Tika produces Metadata information such as Title, Subject, and Author,= according to specifications like !DublinCore. See http://tika.apache.org/= site for the file types supported. * All of the extracted text is added to the "content" field * We can map Tika's metadata fields to Solr fields. We can boost these = fields * We can also pass in literals for field values. @@ -224, +224 @@ * Commit = =3D Additional Resources =3D - * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Artic= les/Content-Extraction-Tika#example.source|Lucid Imagination article]] * [[= http://tika.apache.org/0.7/formats.html|Supported document formats via Tika= (0.7)]] + * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Artic= les/Content-Extraction-Tika#example.source|Lucid Imagination article]] * [[= http://tika.apache.org/0.9/formats.html|Supported document formats via Tika= (0.9)]] = =3D What's in a Name =3D Grant was writing the javadocs for the code and needed an entry for the <= title> tag and wrote out "Solr Content Extraction Library", since the contr= ib directory is named "extraction". This then lead to an "acronym": Solr = CEL which then gets mashed to: Solr Cell. Hence, the project name is "Solr= Cell". It's also appropriate because a Solar Cell's job is to convert the= raw energy of the Sun to electricity, and this contrib's module is respons= ible for converting the "raw" content of a document to something usable by = Solr. http://en.wikipedia.org/wiki/Solar_cell