crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: part files in wordcount example
Date Thu, 19 Feb 2015 10:33:59 GMT
Hi Unmesha,

Answers inlined below:

On Thu, Feb 19, 2015 at 11:22 AM, unmesha sreeveni
<unmeshabiju@gmail.com> wrote:
>
> 1. Once I ran this under 1.8 GB text file , I am getting 2 part files as
> output. so it means that this program ran under 2 reducers. where is it
> specified? Or is it done automaticaly?

In this case, the number of reducers is decided by the size of your
input file, and the crunch.bytes.per.reduce.task configuration value,
which is by 1 GB by default. Based on your input file of 1.8 GB, and a
configured number of bytes per reducer of 1GB, then two reducers are
used.

The Aggregate.count method is also overloaded to allow specifying the
number of partitions (or reducers) to be used without relying on data
size calculations.

> 2. DoFn() is similar to mapper ,reducer,combiner in mapper we are only
> emitting the word. But in Mapreduce we are emitting word,1. How is this
> aggregate done.

The underlying function of this aggregation (implemented via
Aggregate.count) is the same -- it's outputting word,1 and then using
a combiner and a reducer to come to a single count per word.

> 3. Where can I find good tutorials?

There is an initial "Getting started" page at
https://crunch.apache.org/getting-started.html -- that's probably the
best place to start. There is also an in-depth user guide at
https://crunch.apache.org/user-guide.html.

- Gabriel

Mime
View raw message