incubator-blur-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Blur Wiki] Update of "MapReduce" by AaronMcCurry
Date Fri, 14 Jun 2013 00:37:02 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Blur Wiki" for change notification.

The "MapReduce" page has been changed by AaronMcCurry:
https://wiki.apache.org/blur/MapReduce

New page:
= MapReduce Indexing =

Here is an example of the typical usage of the BlurOutputFormat.  The Blur table has to be
created before the MapReduce job is started.  The setupJob method configures the following:

 * The reducer class to be !DefaultBlurReducer
 * The number of reducers to be equal to the number of shards in the table.
 * The output key class to a standard Text writable from the Hadoop library
 * The output value class is a !BlurMutate writable from the Blur library
 * The output format to be !BlurOutputFormat
 * Sets the !TableDescriptor in the Configuration
 * Sets the output path to the !TableDescriptor.getTableUri() value
 * Also the job will use the !BlurOutputCommitter class to commit or rollback the MapReduce
job

== Example Usage ==

{{{
Iface client = BlurClient.getClient("controller1:40010");

TableDescriptor tableDescriptor = client.describe(tableName);

Job job = new Job(jobConf, "blur index");
job.setJarByClass(BlurOutputFormatTest.class);
job.setMapperClass(CsvBlurMapper.class);
job.setInputFormatClass(TextInputFormat.class);

FileInputFormat.addInputPath(job, new Path(input));
CsvBlurMapper.addColumns(job, "cf1", "col");

BlurOutputFormat.setupJob(job, tableDescriptor);
BlurOutputFormat.setIndexLocally(job, true);
BlurOutputFormat.setOptimizeInFlight(job, false);

job.waitForCompletion(true);
}}}

== Options ==

 * !BlurOutputFormat.setIndexLocally(Job,boolean)
  * Enabled by default, this will enable local indexing on the machine where the task is running.
Then when the !RecordWriter closes the index is copied to the remote destination in HDFS.
 * !BlurOutputFormat.setMaxDocumentBufferSize(Job,int)
  * Sets the maximum number of documents that the buffer will hold in memory before overflowing
to disk. By default this is 1000 which will probably be very low for most systems.
 * !BlurOutputFormat.setOptimizeInFlight(Job,boolean)
  * Enabled by default, this will optimize the index while copying from the local index to
the remote destination in HDFS. Used in conjunction with the setIndexLocally.
 * !BlurOutputFormat.setReducerMultiplier(Job,int)
  * This will multiple the number of reducers for this job. For example if the table has 256
shards the normal number of reducers is 256. However if the reducer multiplier is set to 4
then the number of reducers will be 1024 and each shard will get 4 new segments instead of
the normal 1.

Mime
View raw message