hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Steinbach <>
Subject Re: How long should this take?
Date Tue, 17 Nov 2009 21:58:46 GMT
Hi Andrew,

It's possible that the map is taking so long because you only have a single
map task running on your system. If this is a case, it's probably because
the apachelog table is an external table that references a non-splittable
file, for example a gzipped logfile. I recommend checking the tasktracker to
determine how many map tasks are running. You can rule out this problem if
you see more than one map task running.



On Tue, Nov 17, 2009 at 11:24 AM, Andrew O'Brien <>wrote:

> Hi everyone,
> So I'm evaluating Hive for an Apache access log processing job (who
> isn't? ;) and for testing I've got a logfile that's about 1 million
> lines/245MB.  I've loaded it into a table and now I want to extract
> out some ids from the request urls and filter out any requests without
> any ids.  Here's the query I'm running:
> CREATE TABLE access_with_company_and_product AS
>  SELECT ipaddress, ident, user, finishtime,
>    request, returncode, size, referer, agent,
>  regexp_extract(request, '/products/(\\d+)', 1) AS product_id,
>  regexp_extract(request, '/companies/(\\d+)', 1) AS company_id
>  FROM apachelog
> ) hit WHERE hit.product_id IS NOT NULL OR hit.company_id IS NOT NULL;
> It's been going for about 3 hours now and says it's only 2% through
> the map.  So I'm wondering is this the normal rate or am I doing
> something particularly inefficient here?  Or have I missed a
> configuration setting?
> I'm on a 2.53 GHz Core 2 Duo MacBook Pro with 4GB RAM running the
> stock configuration (Hive trunk, I'm pretty sure).  At any one point,
> it appears that only 1 core is really running at full and I've had at
> least a couple hundred MB of memory free the whole time.
> Any advice would be very appreciated.
> –Andrew

View raw message