hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: HBase Performance Improvements?
Date Wed, 09 May 2012 15:23:06 GMT
Hey Something,

We can share everything, and even our ganglia is public [1] .  We are just
setting up a new cluster with Puppet and the HBase master just came up.
 HBase RS will be up probably tomorrow, where the first task will be a bulk
load of 400M records - we're just finishing our working day here...  We
have some blogs [2] on our experiences tuning HBase, and will re-benchmark
it all with the new hardware.  Since Oliver hasn't committed the code I
can't share it right now, but you can read the process and our performance
on the wiki [3].  Note that all our benchmarking was on the old hardware
which was pretty poor, so we're pretty excited to see the new hardware
performance.

We use Puppet for all our config and there will shortly be the full CDH3
Hadoop installation puppet scripts available [4] which might be of interest
too.  The CDH4 ones will follow probably shortly after CDH4 proper comes
out.

Cheers,
Tim

[1]
http://dev.gbif.org/ganglia/?c=hadoop-3&m=load_one&r=hour&s=by%20name&hc=4&mc=2
[2] http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html
[3]
http://dev.gbif.org/wiki/display/DEV/Populating+HBase+occurrences+from+MySQL
[4] https://github.com/lfrancke/gbif-puppet

On Wed, May 9, 2012 at 5:08 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Hey Oliver,
>
> Thanks a "billion" for the response -:)  I will take any code you can
> provide even if it's a hack!  I will even send you an Amazon gift card -
> not that you care or need it -:)
>
> Can you share some performance statistics?  Thanks again.
>
>
> On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF) <omeyn@gbif.org> wrote:
>
> > Heya Something,
> >
> > I had a similar task recently and by far the best way to go about this is
> > with bulk loading after pre-splitting your target table.  As you know
> > ImportTsv doesn't understand Avro files so I hacked together my own
> > ImportAvro class to create the Hfiles that I eventually moved into HBase
> > with completebulkload.  I haven't committed my class anywhere because
> it's
> > a pretty ugly hack, but I'm happy to share it with you as a starting
> point.
> >  Doing billions of puts will just drive you crazy.
> >
> > Cheers,
> > Oliver
> >
> > On 2012-05-09, at 4:51 PM, Something Something wrote:
> >
> > > I ran the following MR job that reads AVRO files & puts them on HBase.
> >  The
> > > files have tons of data (billions).  We have a fairly decent size
> > cluster.
> > > When I ran this MR job, it brought down HBase.  When I commented out
> the
> > > Puts on HBase, the job completed in 45 seconds (yes that's seconds).
> > >
> > > Obviously, my HBase configuration is not ideal.  I am using all the
> > default
> > > HBase configurations that come out of Cloudera's distribution:
> >  0.90.4+49.
> > >
> > > I am planning to read up on the following two:
> > >
> > > http://hbase.apache.org/book/important_configurations.html
> > > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/
> > >
> > > But can someone quickly take a look and recommend a list of priorities,
> > > such as "try this first..."?  That would be greatly appreciated.  As
> > > always, thanks for the time.
> > >
> > >
> > > Here's the Mapper. (There's no reducer):
> > >
> > >
> > >
> > > public class AvroProfileMapper extends AvroMapper<GenericData.Record,
> > > NullWritable> {
> > >    private static final Logger logger =
> > > LoggerFactory.getLogger(AvroProfileMapper.class);
> > >
> > >    final private String SEPARATOR = "*";
> > >
> > >    private HTable table;
> > >
> > >    private String datasetDate;
> > >    private String tableName;
> > >
> > >    @Override
> > >    public void configure(JobConf jobConf) {
> > >        super.configure(jobConf);
> > >        datasetDate = jobConf.get("datasetDate");
> > >        tableName = jobConf.get("tableName");
> > >
> > >        // Open table for writing
> > >        try {
> > >            table = new HTable(jobConf, tableName);
> > >            table.setAutoFlush(false);
> > >            table.setWriteBufferSize(1024 * 1024 * 12);
> > >        } catch (IOException e) {
> > >            throw new RuntimeException("Failed table construction", e);
> > >        }
> > >    }
> > >
> > >    @Override
> > >    public void map(GenericData.Record record,
> AvroCollector<NullWritable>
> > > collector,
> > >                    Reporter reporter) throws IOException {
> > >
> > >        String u1 = record.get("u1").toString();
> > >
> > >        GenericData.Array<GenericData.Record> fields =
> > > (GenericData.Array<GenericData.Record>) record.get("bag");
> > >        for (GenericData.Record rec : fields) {
> > >            Integer s1 = (Integer) rec.get("s1");
> > >            Integer n1 = (Integer) rec.get("n1");
> > >            Integer c1 = (Integer) rec.get("c1");
> > >            Integer freq = (Integer) rec.get("freq");
> > >            if (freq == null) {
> > >                freq = 0;
> > >            }
> > >
> > >            String key = u1 + SEPARATOR + n1 + SEPARATOR + c1 +
> SEPARATOR
> > +
> > > s1;
> > >            Put put = new Put(Bytes.toBytes(key));
> > >            put.setWriteToWAL(false);
> > >            put.add(Bytes.toBytes("info"), Bytes.toBytes("frequency"),
> > > Bytes.toBytes(freq.toString()));
> > >            try {
> > >                table.put(put);
> > >            } catch (IOException e) {
> > >                throw new RuntimeException("Error while writing to " +
> > > table + " table.", e);
> > >            }
> > >
> > >        }
> > >        logger.error("------------  Finished processing user: " + u1);
> > >    }
> > >
> > >    @Override
> > >    public void close() throws IOException {
> > >        table.close();
> > >    }
> > >
> > > }
> >
> >
> > --
> > Oliver Meyn
> > Software Developer
> > Global Biodiversity Information Facility (GBIF)
> > +45 35 32 15 12
> > http://www.gbif.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message