lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "HadoopIndexing" by JasonRutherglen
Date Tue, 19 Jan 2010 23:02:23 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "HadoopIndexing" page has been changed by JasonRutherglen.
http://wiki.apache.org/solr/HadoopIndexing

--------------------------------------------------

New page:
<<TableOfContents>>

= What is Solr + Hadoop Indexing? =

SOLR-1301 implements a contrib module that enables indexing using Hadoop.  This methodology
is useful in cases where indexing to N master servers simply consumes too much time.  This
necessarily happens with systems that have terabytes of indexes.  In this larger than average
use case, simple schema changes would overwhelm the master servers, whereas creating shards
in parallel (the best case being a Hadoop node per shard), the overall process will take minutes
to hours, instead of days.

= Architecture =

== CSVIndexer ==

CSVIndexer is provided as an example, though for your own application, you will need to create
your own CSVIndexer like class.  CSVIndexer extends org.apache.hadoop.conf.Configured to be
instantiated via the Hadoop command line.  Your CSVIndexer like class will set your custom
mapper and set the output format as SolrOutputFormat.  Use the SolrDocumentConverter.setSolrDocumentConverter
to set your custom SolrDocumentConverter.

== SolrDocumentConverter == 

Implement this class to convert an object from a proprietary format into a SolrInputDocument
that may be indexed into Solr.

== Heartbeater == 

Hadoop will try to kill a running task if it doesn't receive a periodic heartbeat.  This is
why in a background thread the HeartBeater class continuously notifies Hadoop that the current
task is still executing.  This is necessary with Solr and Lucene because some tasks such as
an optimizing a large shard can take several minutes to several hours.

== Number of Reduce Tasks ==

The number of reduce tasks is a positive integer that will define the number of shards created.
 Setting the number of reduce tasks is important because you will need to choose a number
that creates shards that are not too large and not too small.

== Zipped Shards ==

It is recommended that shards in HDFS are stored in zip format.  To turn this setting on,
use SolrOutputFormat.setOutputZipFormat on the Hadoop JobConf.

== Indexing ==

An EmbeddedSolrServer is created, pointing at a local filesystem core (not an HDFS filesystem
due to inherent limitations making it not compatible with creating Lucene indexes).  Once
all Solr documents are written to the embedded core, the index is optimized, then zipped directly
to HDFS.

= Solr + Hadoop Example =

Mime
View raw message