lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "ExtractingRequestHandler" by EricPugh
Date Thu, 03 Mar 2011 16:21:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by EricPugh.
The comment on this change is: fix urls to tika project now it's out of incubation.  Don't
deep link to formats page since it is version dependent and tika versions change..
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=66&rev2=67

--------------------------------------------------

  = Introduction =
  <!> [[Solr1.4]]
  
- A common need of users is the ability to ingest binary and/or structured documents such
as Office, Word, PDF and other proprietary formats.  The [[http://incubator.apache.org/tika/|Apache
Tika]] project provides a framework for wrapping many different file format parsers, such
as PDFBox, POI and others.
+ A common need of users is the ability to ingest binary and/or structured documents such
as Office, Word, PDF and other proprietary formats.  The [[http://tika.apache.org/|Apache
Tika]] project provides a framework for wrapping many different file format parsers, such
as PDFBox, POI and others.
  
  Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr
and have Solr extract text from it and then index it.
  
@@ -17, +17 @@

   * Tika will automatically attempt to determine the input document type (word, pdf, etc.)
and extract the content appropriately. If you want, you can explicitly specify a MIME type
for Tika wth the stream.type parameter
   * Tika does everything by producing an XHTML stream that it feeds to a SAX !ContentHandler.
   * Solr then reacts to Tika's SAX events and creates the fields to index.
-  * Tika produces Metadata information such as Title, Subject, and Author, according to specifications
like !DublinCore.  See http://lucene.apache.org/tika/formats.html for the file types supported.
+  * Tika produces Metadata information such as Title, Subject, and Author, according to specifications
like !DublinCore.  See http://tika.apache.org/ site for the file types supported.
   * All of the extracted text is added to the "content" field
   * We can map Tika's metadata fields to Solr fields.  We can boost these fields
   * We can also pass in literals for field values.
@@ -224, +224 @@

   * Commit
  
  = Additional Resources =
- * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid
Imagination article]] * [[http://tika.apache.org/0.7/formats.html|Supported document formats
via Tika (0.7)]]
+ * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid
Imagination article]] * [[http://tika.apache.org/0.9/formats.html|Supported document formats
via Tika (0.9)]]
  
  = What's in a Name =
  Grant was writing the javadocs for the code and needed an entry for the <title> tag
and wrote out "Solr Content Extraction Library", since the contrib directory is named "extraction".
 This then lead to an "acronym":  Solr CEL which then gets mashed to: Solr Cell.  Hence, the
project name is "Solr Cell".  It's also appropriate because a Solar Cell's job is to convert
the raw energy of the Sun to electricity, and this contrib's module is responsible for converting
the "raw" content of a document to something usable by Solr. http://en.wikipedia.org/wiki/Solar_cell

Mime
View raw message