hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: Stackoverflow
Date Tue, 03 Jun 2008 13:44:00 GMT
On Tuesday 03 June 2008 08:35:10 Chris Douglas wrote:
> > I have no Java implementation of my job, sorry.
> Since it's all in the map side, IdentityMapper/IdentityReducer is
> fine, as long as both the splits and the number of reduce tasks are
> the same.
> > The data is a representation for loglines, and not exactly small,
> > e.g. the
> > stuff has already been reduced once.
> By "not exactly small, do you mean each line is long or that there
> are many records?

Well, not small in the meaning, that even I could get my boss to allow me to 
give you the data, transfering it might be painful. (E.g. the job that 
aborted had about 12M lines with with ~2.6GB data => the lines are not really 
long, but longer than 80 chars)

The expected result was that around 8-10M of lines would be output by the 
reduce task. (The lines are of two different types, one type means that all 
key/values but the first one can be dropped, and the second one is the more 
classical type where all values need to be added up.)

Because the stuff has already been reduced in big chunks already, I'd expect a 
~20% reduction. Still that's useful, considering that each of these lines 
turns into at least one SQL statement after it leaves the hadoop cluster.

> > The interesting thing is that it happens inside the last Map task,
> > not in the
> > reducer tasks.
> > As you can see above the mapper cmd is rather on the simple side.
> util.QuickSort is only used on the map side, so this shouldn't have
> anything to do with the reduce. Is it always and only the *last* map

Nope, although sometimes it happens earlier.

> task that fails? If I sent you a patch that would print a trace with
> the partitions, would you mind running it? Do you have any other
> settings that differ from the defaults? -C

If you tell me how to apply it, I'm happy to. (I'm not the biggest Java 
hotshot on this planet, I'm just using the provided 0.17.0 jars, Guess I 
would have to patch the source and run ant. On all nodes or just the 

And no, it's mostly untuned from the default hadoop config, paths and network 
addresses being configured, everything left as is.

OTOH, I would have to try to get enough data into my work queue to have a big 
enough chunk to reproduce it I guess. OTOH, it's not that bad, I stiil have 
over 1TB of logfiles for May to process, so I would just need to take off the 
brakes from hadoop to produce the data needed I guess.



My hadoop-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>



  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.

  <description>A base for other temporary directories.</description>

  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.


View raw message