hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Occasional regionserver crashes following socket errors writing to HDFS
Date Fri, 11 May 2012 03:34:03 GMT
On Thu, May 10, 2012 at 7:46 PM, Michael Segel
<michael_segel@hotmail.com> wrote:
> "Writing, it may make sense to avoid the reduce step and write yourself back into HBase
from inside your map. You'd do this when your job does not need the sort and collation that
mapreduce does on the map emitted data; on insert, HBase 'sorts' so there is no point double-sorting
(and shuffling data around your mapreduce cluster) unless you need to. If you do not need
the reduce, you might just have your map emit counts of records processed just so the framework's
report at the end of your job has meaning or set the number of reduces to zero and use TableOutputFormat.
See example code below. If running the reduce step makes sense in your case, its usually better
to have lots of reducers so load is spread across the HBase cluster."
> This isn't 100% true.
> I'd lose the quotes around 'sorts' because the data is sorted on key values. period.

Sounds good.

> I'd ask that you reconsider the following phrase...
> "You'd do this when your job does not need the sort and collation that mapreduce does
on the map emitted data;"

What would you suggest instead.

> I realize I went to this little midwestern school (tOSU), where ENG meant you were in
the college of engineering and not an English Major, so I'm not sure if I am parsing that
statement correctly.


The above phrase is mine.  I'm bad at writing so need help.

> If you refactor your M/R , HBase can be used for the 'collation' .  (If you make your
Mapper a null writable and manually write the output to HBase within Mapper.map(), you can
write to N tables without a problem. So you can write the record out, update a table where
you are keeping counters, stats, etc ... )  So I am still at a loss to find an example of
where you would need a reducer.

Can you make a patch.

I'm for making a stronger statement about reduce, that its rare if
ever its needed.  Lets get it in the doc.

> So one has to ask what would cause a write to be blocked
> GC ? Eran says he's already tuned it.
> MSLABS? Eran says that's covered.
> Table splits?
> Eran says that the table's region sizes are 256MB (default) and the other table is 512MB.
> If the table is constantly splitting, then you need to increase the region size. Again
we don't have enough information to diagnose if this is the issue.
> We don't know things about his cluster like the number of nodes, how much memory on each
node, as well as which version of HBase.
> I realize that these are all pretty basic issues, but sometimes its the little things
that will trip you up.

Above is generally good advice.

Thanks Michael.


View raw message