hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Haidinyak <phaidin...@local.com>
Subject RE: Results from a Map/Reduce
Date Fri, 17 Dec 2010 22:57:39 GMT
Our current process returns aggregated data from a SQL Server Database. It returns in less
than 5 seconds. The idea was to use HBase/Hadoop and the real logs to return data that the
user could 'filter' and keep the same response time but get more functionality. After playing
with one day's data (100 million Rows) it became apparent that it's going to be hard to 'query'
several days data and do it in under five seconds. But then again if it was easy the guy who
I inherited this project from would have finished it :-)

On a side note it looks like my IOException was because I was running the Client from a Windows
box. 

-Pete

-----Original Message-----
From: Jonathan Gray [mailto:jgray@fb.com] 
Sent: Friday, December 17, 2010 2:25 PM
To: user@hbase.apache.org
Subject: RE: Results from a Map/Reduce

If you do aggregation, your queries will most likely be well under a second.  The aggregates
should reduce the amount of data that needs to be read by several orders of magnitude, no?

> -----Original Message-----
> From: Peter Haidinyak [mailto:phaidinyak@local.com]
> Sent: Friday, December 17, 2010 1:43 PM
> To: user@hbase.apache.org
> Subject: RE: Results from a Map/Reduce
> 
> So the idea is to aggregate the final result to an HBase Table and then from
> the client query that table. I'm going to have to find a quicker method.
> Currently on my small three node cluster with 100million rows it takes a
> couple of minutes to do a scan that brings back several million rows. My boss
> wants the query to be in the 'less than five second' range.
> 
> Thanks
> 
> -Pete
> 
> -----Original Message-----
> From: Jonathan Gray [mailto:jgray@fb.com]
> Sent: Friday, December 17, 2010 1:19 PM
> To: user@hbase.apache.org
> Subject: RE: Results from a Map/Reduce
> 
> If there's a customer waiting for the query, then you wouldn't want to have
> them what for an MR job.
> 
> So what you're saying is you want to change this from on-demand scans to
> using MapReduce to aggregate roll-ups ahead of time and serve those?
> 
> In that case, your MR job doesn't need one final output, right?  You could do
> the Map over the entire table (or start/stop rows depending on schema) and
> with the appropriate filters.  You would output (customerid + hour bucket) as
> the key and 1 for the value.  You'd get a reduce for each customerid/hour
> bucket and would write that to HBase.
> 
> One of the ideas behind coprocessors is you could do the per-customer
> scan/filter/aggregate as a parallel operation inside the RSs (without the
> overhead of MR or cross-JVM) and might be able to increase the number of
> rows you can process within a reasonable amount of time.
> 
> Another approach to these kinds of aggregates, if you care about realtime at
> some level, is to use HBase's increment capabilities and a similar hour-
> bucketed schema but updated on demand instead of in batch.
> 
> Yeah, this is a "basic" operation but that only means there are 100 ways to
> implement it :)
> 
> JG
> 
> > -----Original Message-----
> > From: Peter Haidinyak [mailto:phaidinyak@local.com]
> > Sent: Friday, December 17, 2010 12:13 PM
> > To: user@hbase.apache.org
> > Subject: RE: Results from a Map/Reduce
> >
> > What I have is basically a query on a log table to return the number
> > of hits per hour for customer X for Y days and having the ability to
> > filtering on columns, these are to be displayed in a web page on demand.
> > Currently, using a Scan, with a popular customer I can get back
> > millions of rows to aggregate into 'Hits per hour' buckets. I wanted
> > to push the aggregation back to a Map/Reduce and then have those
> > results available to send back as a web page.
> > This seems like such a basic operation that I am hoping there are
> > 'Best Practices' or examples on how to accomplish this. I would also like a
> pony too.
> > :-)
> >
> > Thanks
> >
> > -Pete
> >
> > -----Original Message-----
> > From: Jonathan Gray [mailto:jgray@fb.com]
> > Sent: Friday, December 17, 2010 12:01 PM
> > To: user@hbase.apache.org
> > Subject: RE: Results from a Map/Reduce
> >
> > There's not much in the way of examples for coprocessors besides the
> > implementation of Security.  Check out HBASE-2000 and go from there.
> > If you're fairly new to HBase, then wait a couple months and there
> > should be much better support around Coprocessors.
> >
> > I'm unsure of a way to have a final result returned back to the main()
> > method.  What exactly are you trying to do with this result?
> > Available to you to do what with it?  Could the MR job put the result
> > back into HBase or could your reducer contain the logic you need to use
> with the final result?
> >
> > > -----Original Message-----
> > > From: Peter Haidinyak [mailto:phaidinyak@local.com]
> > > Sent: Friday, December 17, 2010 11:56 AM
> > > To: user@hbase.apache.org
> > > Subject: RE: Results from a Map/Reduce
> > >
> > > Does that mean that when the job.waitForCompletion(true) returns
> > > that I have the results from the Reducer(s) available to me? I
> > > haven't seen much on coprocessors, can you point me to some examples
> of their use?
> > >
> > > Thanks
> > > -Pete
> > >
> > > -----Original Message-----
> > > From: Jonathan Gray [mailto:jgray@fb.com]
> > > Sent: Friday, December 17, 2010 11:13 AM
> > > To: user@hbase.apache.org
> > > Subject: RE: Results from a Map/Reduce
> > >
> > > Hey Peter,
> > >
> > > That System.exit line is nothing important, just the main thread
> > > waiting for the tasks to finish before closing.
> > >
> > > You're interested in having the MR job return a single result?  To
> > > do that, you would need to roll-up the processing done in each of
> > > your Map tasks into a single Reduce task.  With one reducer, you can
> > > have a single point to do the final aggregation of the result.
> > >
> > > I'm not sure exactly what kind of aggregation you are doing but
> > > funneling into a single reducer can range from no problem to don't
> > > even try it.  Sounds like you just want a final number or something
> > > so
> > shouldn't be an issue.
> > >
> > > You might also consider doing your aggregations with coprocessors if
> > > you're into experimenting on HBase Trunk :)
> > >
> > > As for FirstKeyOnlyFilter:
> > >
> > > /**
> > >  * A filter that will only return the first KV from each row.
> > >  * <p>
> > >  * This filter can be used to more efficiently perform row count
> operations.
> > >  */
> > >
> > > That's what it does.  If you scan a table, regardless of what you
> > > ask for in the query, the filter will just return whatever the first
> > > KeyValue is on each row and will skip every other
> > > column/version/value of
> > that row except the first.
> > >
> > > Like it says, it's generally useful for doing row counting but that's about
it.
> > >
> > > JG
> > >
> > > > -----Original Message-----
> > > > From: Peter Haidinyak [mailto:phaidinyak@local.com]
> > > > Sent: Friday, December 17, 2010 10:56 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Results from a Map/Reduce
> > > >
> > > > Hi, dumb question again.
> > > >   I have been using a Scan to return a result back to my client
> > > > which works fine except when I am returning a million rows just to
> > > > aggregate the
> > > results.
> > > > The next logical step would be to do the aggregation in a Map/Reduce.
> > > > I've been looking at what samples I could find and they see to all do
> this...
> > > >
> > > >     System.exit(job.waitForCompletion(true) ? 0 : 1);
> > > >
> > > > My question, is there a way to return a result from the job in a
> > > > similar way of getting a ResultScanner back in iterating through
> > > > the
> > results?
> > > >
> > > > Also, is there a good definition of what a 'FirstKeyOnlyFilter' does?
> > > >
> > > > Thanks
> > > >
> > > > -Pete

Mime
View raw message