Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 64A25108A5 for ; Fri, 20 Sep 2013 21:56:09 +0000 (UTC) Received: (qmail 40310 invoked by uid 500); 20 Sep 2013 21:56:07 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 40263 invoked by uid 500); 20 Sep 2013 21:56:07 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 40254 invoked by uid 99); 20 Sep 2013 21:56:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Sep 2013 21:56:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of javadba@gmail.com designates 209.85.160.49 as permitted sender) Received: from [209.85.160.49] (HELO mail-pb0-f49.google.com) (209.85.160.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Sep 2013 21:56:00 +0000 Received: by mail-pb0-f49.google.com with SMTP id xb4so884057pbc.22 for ; Fri, 20 Sep 2013 14:55:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=NJntv0H3MvDX+GTdY6g2loMtxDz2mORQ4uFeSL4Dj00=; b=0v6YYbG03t9E9T9kIYNp2Tz+jpjn87J1YA1kItyYqHxj5AJjnj/SYPgqOSEEwanohQ qaPNIyYJ14nYPZlYICbZHRhiefyrstfcfZ3KofPl7/EfjUTsGMSNC7cN1ShDtWXXyyN5 +jpAWMIcuCzvHT97kpVZqbKoE6ExnBOtDLkc+rDl1P9+JAKKTX+93dFdkRnps/DHjRcO 62/0hK4Wt/lwWhm8ZHpdsNutEE9ZLYZtlQwab0jMEzkUvtf9R+AXJ0t4lkbeAwOQTve/ VAvWzt4tsNVcAT37fgAsVhVqLi9LJOERp7LfmKGPE8ZSIQv6jEn1ItBqc7Kt4qEiEZ6B romQ== MIME-Version: 1.0 X-Received: by 10.66.248.198 with SMTP id yo6mr10846019pac.143.1379714138816; Fri, 20 Sep 2013 14:55:38 -0700 (PDT) Received: by 10.68.109.34 with HTTP; Fri, 20 Sep 2013 14:55:38 -0700 (PDT) In-Reply-To: References: Date: Fri, 20 Sep 2013 14:55:38 -0700 Message-ID: Subject: Re: Loading data into partition taking seven times total of (map+reduce) on highly skewed data From: Stephen Boesch To: user@hive.apache.org Content-Type: multipart/alternative; boundary=047d7b15adcd22c62d04e6d7bbd0 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b15adcd22c62d04e6d7bbd0 Content-Type: text/plain; charset=ISO-8859-1 Another detail: ~400 mappers 64 reducers 2013/9/20 Stephen Boesch > > We have a small (3GB /280M rows) table with 435 partitions that is highly > skewed: one partition has nearly 200M, two others have nearly 40M apiece, > then the remaining 432 have all together less than 1% of total table size. > > So .. the skew is something to be addressed. However - even give that - > why would the following occur? > > > Table Structure: > > # Partition Information > # col_name data_type comment > derived_create_dt string None > > # Detailed Table Information > .. > Protect Mode: None > Retention: 0 > .. > Table Type: MANAGED_TABLE > Table Parameters: > SORTBUCKETCOLSPREFIX TRUE > transient_lastDdlTime 1379678551 > > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe > InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat > Compressed: No > Num Buckets: 64 > Bucket Columns: [station_id] > Sort Columns: [Order(col:station_id, order:1)] > Storage Desc Params: > serialization.format 1 > > HIGHLY SKEWED data: although > This particular load: > 300M rows > 4GB > 435 partitions > Over 99% of data in just 3 out of the 435 partitons > 2013-09-18 26733990 > 2013-09-19 191634067 > 2013-09-20 63790065 > > > > Map takes 10 min > Reduce 13 mins > Loading into partitions takes 3 hours 27 minutes > > > --047d7b15adcd22c62d04e6d7bbd0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Another detail: =A0 ~400 mappers =A064 reducers


2013/9/20 Stephen = Boesch <javadba@gmail.com>

We have a sm= all (3GB /280M rows) table with 435 partitions that is highly skewed: =A0on= e partition has nearly 200M, two others have nearly 40M apiece, then the re= maining 432 have all together less than 1% of total table size.

So .. the skew is something to be addressed. =A0However - ev= en give that - why would the following occur?

=A0=A0
Table Structure: =A0

=A0 =A0 =A0# Partition Information
# col_name =A0 =A0 =A0 =A0 =A0 =A0<= span style=3D"white-space:pre-wrap"> data_type =A0 =A0 =A0 =A0 =A0 <= span style=3D"white-space:pre-wrap"> comment
derived_create_dt =A0 string =A0 =A0 =A0 =A0 =A0 =A0 = =A0 None

# Detailed Table Informatio= n
..
Protect Mode: =A0 =A0 =A0 None
Retention: =A0 =A0 =A0 =A0 =A0 0
..
Table Type: =A0 =A0 =A0 =A0 MANAGED_TABLE
Table Parameters:
SORTBUCKETCOLSPREFIX TRUE
transient_lastDdlTime 1379678551

# Storage = Information
SerDe Li= brary: =A0 =A0 =A0 org.apache.h= adoop.hive.serde2.columnar.ColumnarSerDe
InputFormat: =A0 =A0 =A0 = =A0 org.apache.hadoop.hive.ql.i= o.RCFileInputFormat
= OutputFormat: =A0 =A0 =A0 org.= apache.hadoop.hive.ql.io.RCFileOutputFormat
Compressed: =A0 =A0 =A0 = =A0 No
Num Buckets: =A0 =A0 =A0 =A0 64
Bucket Columns: =A0 =A0 <= span style=3D"white-space:pre-wrap"> [station_id]
Sort Columns: =A0 =A0 =A0 [Order(col:station_id, order:1)]
Storage Desc Params:
serialization.format 1

= HIGHLY SKEWED data: =A0although
This particular load:
=A0 =A0 300M rows
=A0 =A0 = =A04GB
=A0 =A0 435 partitions
=A0 =A0 =A0 =A0Over 99% o= f data in just 3 out of the 435 partitons=A0
=A0 =A0 =A0 =A0 2013= -09-18 26733990=A0
=A0 =A0 =A0 2013-09-19 191= 634067
=A0 =A0 =A0 2013-09-20 63790065

=A0=A0

= Map takes 10 min
Reduce 13 mins
Loading into partitions takes 3 hours 27 minu= tes

=A0

--047d7b15adcd22c62d04e6d7bbd0--