Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7AE286F92 for ; Tue, 19 Jul 2011 22:58:23 +0000 (UTC) Received: (qmail 78485 invoked by uid 500); 19 Jul 2011 22:58:22 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 78428 invoked by uid 500); 19 Jul 2011 22:58:22 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 78417 invoked by uid 99); 19 Jul 2011 22:58:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jul 2011 22:58:22 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.161.44] (HELO mail-fx0-f44.google.com) (209.85.161.44) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jul 2011 22:58:13 +0000 Received: by fxe6 with SMTP id 6so713839fxe.31 for ; Tue, 19 Jul 2011 15:57:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.205.37.66 with SMTP id td2mr2126425bkb.296.1311116273222; Tue, 19 Jul 2011 15:57:53 -0700 (PDT) Received: by 10.204.119.79 with HTTP; Tue, 19 Jul 2011 15:57:53 -0700 (PDT) X-Originating-IP: [96.10.173.94] In-Reply-To: References: Date: Tue, 19 Jul 2011 18:57:53 -0400 Message-ID: Subject: Re: hive mapjoin decision process From: Koert Kuipers To: user@hive.apache.org Content-Type: multipart/alternative; boundary=bcaec52c5c73b9823b04a8740b64 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec52c5c73b9823b04a8740b64 Content-Type: text/plain; charset=ISO-8859-1 thanks. changing mapred.child.java.opts from -Xmx512m to -Xmx1024m did the trick allocating more memory to the On Tue, Jul 19, 2011 at 6:49 PM, yongqiang he wrote: > >> i thought only one table needed to be small? > Yes. > > >> hive.mapjoin.maxsize also apply to big table? > No. > > >> i made sure hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize > are set large enough to accomodate the small table. yet hive does not > attempt to do a mapjoin. > > There are physical limitations. If the local machine can not hold all > records in memory locally, the local hashmap has to fail. So check > your machine's memory or the memory allocated for hive. > > Thanks > Yongqiang > On Tue, Jul 19, 2011 at 1:55 PM, Koert Kuipers wrote: > > thanks! > > i only see hive create the hashmap dump and perform mapjoin if both > tables > > are small. i thought only one table needed to be small? > > > > i try to merge a very large table with a small table. i made sure > > hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are set large > > enough to accomodate the small table. yet hive does not attempt to do a > > mapjoin. does hive.mapjoin.maxsize also apply to big table? or do i need > to > > look at other parameters as well? > > > > On Tue, Jul 19, 2011 at 4:15 PM, yongqiang he > > wrote: > >> > >> in most cases, the mapjoin falls back to normal join because of one of > >> these three reasons: > >> 1) the input table size is very big, so there will be no try on mapjoin > >> 2) if one of the input table is small (let's say less than 25MB which > >> is configurable), hive will try a local hashmap dump. If it cause OOM > >> on the client side when doing the local hashmap dump, it will go back > >> normal join.The reason here is mostly due to very good compression on > >> the input data. > >> 3) the mapjoin actually got started, and fails. it will fall back > >> normal join. This will most unlikely happen > >> > >> Thanks > >> Yongqiang > >> On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers > wrote: > >> > note: this is somewhat a repost of something i posted on the CDH3 user > >> > group. apologies if that is not appropriate. > >> > > >> > i am exploring map-joins in hive. with hive.auto.convert.join=true > hive > >> > tries to do a map-join and then falls back on a mapreduce-join if > >> > certain > >> > conditions are not met. this sounds great. but when i do a > >> > query and i notice it falls back on a mapreduce-join, how can i see > >> > which > >> > condition triggered the fallback (smalltablle.filesize or > >> > mapjoin.maxsize or > >> > something else perhaps memory related)? > >> > > >> > i tried reading the default log that a hive session produces, but it > >> > seems > >> > more like a massive json file than a log to me, so it is very hard for > >> > me to > >> > interpret that. i also turned on logging to console with debugging, > >> > looking > >> > for any clues there but without luck so far. is the info there and am > i > >> > just > >> > overlooking it? any ideas? > >> > > >> > thanks! koert > >> > > >> > > >> > > > > > > --bcaec52c5c73b9823b04a8740b64 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable thanks.
changing mapred.child.java.opts from -Xmx512m to -Xmx1024m did t= he trick


allocating more memory to the

On Tue, Jul 19, 2011 at 6:49 PM, yongqiang he &= lt;heyongqiangict@gmail.com= > wrote:
>> i thought only one table needed to be small?
Yes.

>> hive.mapjoin.maxsize also apply to big table?
No.

>> i made sure hive.mapjoin.smalltable.filesize and hive.mapjoin.maxs= ize are set large enough to accomodate the small table. yet hive does not a= ttempt to do a mapjoin.

There are physical limitations. If the local machine can not hold all=
records in memory locally, the local hashmap has to fail. So check
your machine's memory or the memory allocated for hive.

Thanks
Yongqiang
On Tue, Jul 19, 2011 at 1:55 PM, K= oert Kuipers <koert@tresata.com= > wrote:
> thanks!
> i only see hive create the hashmap dump and perform mapjoin if both ta= bles
> are small. i thought only one table needed to be small?
>
> i try to merge a very large table with a small table. i made sure
> hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are set larg= e
> enough to accomodate the small table. yet hive does not attempt to do = a
> mapjoin. does hive.mapjoin.maxsize also apply to big table? or do i ne= ed to
> look at other parameters as well?
>
> On Tue, Jul 19, 2011 at 4:15 PM, yongqiang he <heyongqiangict@gmail.com>
> wrote:
>>
>> in most cases, the mapjoin falls back to normal join because of on= e of
>> these three reasons:
>> 1) the input table size is very big, so there will be no try on ma= pjoin
>> 2) if one of the input table is small (let's say less than 25M= B which
>> is configurable), hive will try a local hashmap dump. If it cause = OOM
>> on the client side when doing the local hashmap dump, it will go b= ack
>> normal join.The reason here is mostly due to very good compression= on
>> the input data.
>> 3) the mapjoin actually got started, and fails. it will fall back<= br> >> normal join. This will most unlikely happen
>>
>> Thanks
>> Yongqiang
>> On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers <koert@tresata.com> wrote:
>> > note: this is somewhat a repost of something i posted on the = CDH3 user
>> > group. apologies if that is not appropriate.
>> >
>> > i am exploring map-joins in hive. with hive.auto.convert.join= =3Dtrue hive
>> > tries to do a map-join and then falls back on a mapreduce-joi= n if
>> > certain
>> > conditions are not met. this sounds great. but when i do a >> > query and i notice it falls back on a mapreduce-join, how can= i see
>> > which
>> > condition triggered the fallback (smalltablle.filesize or
>> > mapjoin.maxsize or
>> > something else perhaps memory related)?
>> >
>> > i tried reading the default log that a hive session produces,= but it
>> > seems
>> > more like a massive json file than a log to me, so it is very= hard for
>> > me to
>> > interpret that. i also turned on logging to console with debu= gging,
>> > looking
>> > for any clues there but without luck so far. is the info ther= e and am i
>> > just
>> > overlooking it? any ideas?
>> >
>> > thanks! koert
>> >
>> >
>> >
>
>

--bcaec52c5c73b9823b04a8740b64--