Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of
 prvs=mgrover=45631af8f@oanda.com designates 98.158.95.75 as permitted sender)
Date: Wed, 25 Apr 2012 20:59:24 -0400 (EDT)
From: Mark Grover <mgrover@oanda.com>
To: user@hive.apache.org
Message-ID: 
 <1709555025.137099.1335401964032.JavaMail.root@sms-zimbra-message-store-03.sms.scalar.ca>
In-Reply-To: 
 <550FD6FBBF6C8F4BAF4D0AE7571EECA9FFB644EBF8@MBXCCR.hosting.hyves.org>
Subject: Re: When/how to use partitions and buckets usefully?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

I am hoping that other people who have used Map Join can pitch in here...

When the smaller table gets loaded into mapper's memory, the data is loaded=
 in its uncompressed form, right?

If so, is it possible at all in this case that the compressed size of small=
er table is less than the memory available but the uncompressed size isn't?

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation=20

www: oanda.com www: fxtrade.com=20
e: mgrover@oanda.com=20

"Best Trading Platform" - World Finance's Forex Awards 2009.=20
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.=20


----- Original Message -----
From: "Ruben de Vries" <ruben.devries@hyves.nl>
To: user@hive.apache.org
Sent: Wednesday, April 25, 2012 3:48:46 AM
Subject: RE: When/how to use partitions and buckets usefully?

I already tried running with that set to 400mb, but it didn=E2=80=99t work =
and that setting is only used when it=E2=80=99s trying to automatically fig=
ure out if it should be doing a mapjoin, while I=E2=80=99m forcing it to do=
 a mapjoin with a hint

From: gemini alex [mailto:gemini5201314@gmail.com]
Sent: Wednesday, April 25, 2012 9:40 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

there should be documented in wiki on LanguageManual+Joins .
=E5=9C=A8 2012=E5=B9=B44=E6=9C=8825=E6=97=A5 =E4=B8=8B=E5=8D=883:36=EF=BC=
=8Cgemini alex <gemini5201314@gmail.com>=E5=86=99=E9=81=93=EF=BC=9A
it's seemed you use the default hive configuration, the default map join wi=
ll have only 25M for small table,  copy your hive-default.xml to hive-site.=
xml and set hive.mapjoin.smalltable.filesize=3D300000000
=E5=9C=A8 2012=E5=B9=B44=E6=9C=8825=E6=97=A5 =E4=B8=8A=E5=8D=8812:09=EF=BC=
=8CRuben de Vries <ruben.devries@hyves.nl>=E5=86=99=E9=81=93=EF=BC=9A

I got the (rather big) log here in a github gist: https://gist.github.com/2=
480893
And I also attached the plan.xml it was using to the gist.

When loading the members_map (11mil records, 320mb, 30b per record), it see=
ms to take about 198b per record in the members_map, resulting in crashing =
around 7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STR=
ING) where
 - birthday is a string containing YYYY-MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string=
 birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with=
 the same effect
Some of the blogs I read talk about 25mb small table, not 300mb like mine .=
..

Anyone can make anything out of this?
I'd rather go with this if at all possible,
otherwise I have to go the hard way and migrate all the visit_stats into bu=
ckets so they can match the members_map on that?

-----Original Message-----
From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, April 24, 2012 3:58 PM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
      The operation you are seeing in your log is preparation of hash table=
 of the smaller table, This hash table file is compressed and loaded into D=
istributed Cache and from there it is used for map side joins. From your co=
nsole log the hash table size/data size has gone to nearly 1.5 GB, the data=
 is large to be loaded into memory of the hive client.

2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 699=
9999 Memory usage:   1,468,378,760      rate:   0.788


Can you enable debug logging and post in the console to get a better pictur=
e why it consumes this much memory.
Start your hive shell as
hive -hiveconf hive.root.logger=3DALL,console;


Regards
Bejoy KS


________________________________
From: Ruben de Vries <ruben.devries@hyves.nl>
To: "user@hive.apache.org" <user@hive.apache.org>
Sent: Tuesday, April 24, 2012 4:58 PM
Subject: FW: When/how to use partitions and buckets usefully?

Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
           1            1          247231757 hdfs://localhost:54310/user/hi=
ve/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
         442          441         1091837835 hdfs://localhost:54310/user/hi=
ve/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSON=
Serde) .


From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Tuesday, April 24, 2012 11:46 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

This operation is erroring out on the hive client itself before starting a =
map so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us t=
he result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <ruben.devries@hyves.nl> wr=
ote:
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

SELECT
   /*+ MAPJOIN(members_map) */
   date_int,
   members_map.gender AS gender,
   'generic',
   COUNT( memberId ) AS unique,
   SUM( `generic`['count'] ) AS count,
   SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id =3D visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59     Starting to launch local task to process map join; =
     maximum memory =3D 1864171520
2012-04-24 10:26:00     Processing rows:        200000          Hashtable s=
ize: 199999          Memory usage:   43501848        rate:   0.023
2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size: 689=
9999 Memory usage:   1449867552      rate:   0.778
2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 699=
9999 Memory usage:   1468378760      rate:   0.788
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space


I'm running it only my local, single node, dev env, could that be a problem=
 since it won't split over multiple mappers in this case?


-----Original Message-----
From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Tuesday, April 24, 2012 9:47 AM
To: user@hive.apache.org
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
Map join hint is provided to hive using "MAPJOIN" keyword as :
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key =3D b.key

To use map side join some hive configuration properties needs to be enabled

For plain map side joins
hive>SET hive.auto.convert.join=3Dtrue;
Latest versions of hive does a map join on the smaller table even if no map=
 join hit is provided.

For bucketed map joins
hive>SET hive.optimize.bucketmapjoin=3Dtrue

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Regards
Bejoy


________________________________
From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, April 24, 2012 12:46 PM
Subject: Re: When/how to use partitions and buckets usefully?

If you are doing a map side join make sure the table members_map is small e=
nough to hold in memory

On 4/24/12, Ruben de Vries <ruben.devries@hyves.nl> wrote:
> Wow thanks everyone for the nice feedback!
>
> I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right=
?
>
>
> Cheers,
>
> Ruben de Vries
>
> -----Original Message-----
> From: Mark Grover [mailto:mgrover@oanda.com]
> Sent: Tuesday, April 24, 2012 3:17 AM
> To: user@hive.apache.org; bejoy ks
> Cc: Ruben de Vries
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben,
> Like Bejoy pointed out, members_map is small enough to fit in memory,
> so your joins with visit_stats would be much faster with map-side join.
>
> However, there is still some virtue in bucketing visit_stats.
> Bucketing can optimize joins, group by's and potentially other queries
> in certain circumstances.
> You probably want to keep consistent bucketing columns across all your
> tables so they can leveraged in multi-table queries. Most people use
> some power of 2 as their number of buckets. To make the best use of
> the buckets, each of your buckets should be able to entirely load into
> memory on the node.
>
> I use something close the formula below to calculate the number of bucket=
s:
>
> #buckets =3D (x * Average_partition_size) /
> JVM_memory_available_to_your_Hadoop_tasknode
>
> I call x (>1) the "factor of conservatism". Higher x means you are
> being more conservative by having larger number of buckets (and
> bearing the increased overhead), lower x means the reverse. What x to
> use would depend on your use case. This is because the number of buckets =
in a table is fixed.
> If you have a large partition, it would distribute it's data into
> bulkier buckets and you would want to make sure these bulkier buckets
> can still fit in memory. Moreover, buckets are generated using a
> hashing function, if you have a strong bias towards a particular value
> of bucketing column in your data, some buckets might be bulkier than
> others. In that case, you'd want to make sure that those bulkier buckets =
can still fit in memory.
>
> To summarize, it depends on:
> * How the actual partition sizes vary from the average partition size (i.=
e.
> the standard deviation of your partition size). More standard
> deviations means you should be more conservative in your calculation and =
vice-versa.
> * Distribution of the data in the bucketing columns. "Wider"
> distribution means you should be more conservative and vice-versa.
>
> Long story short, I would say, x of 2 to 4 should suffice in most
> cases but feel free to verify that in your case:-) I would love to
> hear what factors others have been using when calculating their number of=
 buckets, BTW!
> Whatever answer you get for #buckets from above formula, use the
> closest power of 2 as the number of buckets in your table (I am not
> sure if this is a must, though).
>
> Good luck!
>
> Mark
>
> Mark Grover, Business Intelligence Analyst OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgrover@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Bejoy KS" <bejoy_ks@yahoo.com>
> To: "Ruben de Vries" <ruben.devries@hyves.nl>, user@hive.apache.org
> Sent: Monday, April 23, 2012 12:39:17 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If data is in hdfs, then you can bucket it only after loading into a
> temp/staging table and then to the final bucketed table. Bucketing
> needs a Map reduce job.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> From: Ruben de Vries <ruben.devries@hyves.nl>
> Date: Mon, 23 Apr 2012 18:13:20 +0200
> To: user@hive.apache.org<user@hive.apache.org>;
> bejoy_ks@yahoo.com<bejoy_ks@yahoo.com>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Thanks for the help so far guys,
>
>
>
> I bucketed the members_map, it's 330mb in size (11 mil records).
>
>
>
> Can you manually bucket stuff?
>
> Since my initial mapreduce job is still outside of Hive I'm doing a
> LOAD DATA to import stuff into the visit_stats tables, replacing that
> with INSERT OVERWRITE SELECT slows it down a lot
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy_ks@yahoo.com]
> Sent: Monday, April 23, 2012 6:06 PM
> To: user@hive.apache.org
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
> For Bucketed map join, both tables should be bucketed and the number
> of buckets of one should be multiple of other.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: "Bejoy KS" < bejoy_ks@yahoo.com >
>
>
> Date: Mon, 23 Apr 2012 16:03:34 +0000
>
>
> To: < user@hive.apache.org >
>
>
> ReplyTo: bejoy_ks@yahoo.com
>
>
> Subject: Re: When/how to use partitions and buckets usefully?
>
>
>
>
> Bucketed map join would be good I guess. What is the total size of the
> smaller table and what is its expected size in the next few years?
>
> The size should be good enough to be put in Distributed Cache, then
> map side joins would offer you much performance improvement.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
>
>
>
> From: Ruben de Vries < ruben.devries@hyves.nl >
>
>
> Date: Mon, 23 Apr 2012 17:38:20 +0200
>
>
> To: user@hive.apache.org<user@hive.apache.org >
>
>
> ReplyTo: user@hive.apache.org
>
>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Ok, very clear on the partitions, try to make them match the WHERE
> clauses, not so much about group clauses then ;)
>
>
>
> The member_map contains 11.636.619 records atm, I think bucketing
> those would be good?
>
> What's a good number to bucket them by then?
>
>
>
> And is there any point in bucketing the visit_stats?
>
>
>
>
>
> From: Tucker, Matt [mailto:Matt.Tucker@disney.com]
> Sent: Monday, April 23, 2012 5:30 PM
> To: user@hive.apache.org
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
> If you're only interested in a certain window of dates for analysis, a
> date-based partition scheme will be helpful, as it will trim
> partitions that aren't needed by the query before execution.
>
>
>
> If the member_map table is small, you might consider testing the
> feasibility of map-side joins, as it will reduce the number of
> processing stages. If member_map is large, bucketing on member_id will
> avoid having as many rows from visit_stats compared to each member_id for=
 joins.
>
>
>
>
> Matt Tucker
>
>
>
>
>
> From: Ruben de Vries [mailto:ruben.devries@hyves.nl]
> Sent: Monday, April 23, 2012 11:19 AM
> To: user@hive.apache.org
> Subject: When/how to use partitions and buckets usefully?
>
>
>
> It seems there's enough information to be found on how to setup and
> use partitions and buckets.
>
> But I'm more interested in how to figure out when and what columns you
> should be partitioning and bucketing to increase performance?!
>
>
>
> In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP
> cols which give me info about the visits) and 1 member_map (member_id,
> gender, age).
>
>
>
> Usually I group by date and then one of the other col so I assume that
> partitioning on date is a good start?!
>
>
>
> It seems the join of the member_map onto the visit_stats makes the
> queries a lot slower, can that be fixed by bucketing both tables? Or just=
 one of them?
>
>
>
>
> Maybe some ppl have written good blogs on this subject but I can't
> really seem to find them!?
>
>
>
> Any help would be appreciated, thanks in advance J
>


--
Nitin Pawar


--
Nitin Pawar