Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B96841743C for ; Fri, 31 Oct 2014 03:45:34 +0000 (UTC) Received: (qmail 35770 invoked by uid 500); 31 Oct 2014 03:45:34 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 35690 invoked by uid 500); 31 Oct 2014 03:45:34 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 35678 invoked by uid 500); 31 Oct 2014 03:45:34 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 35675 invoked by uid 99); 31 Oct 2014 03:45:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Oct 2014 03:45:34 +0000 Date: Fri, 31 Oct 2014 03:45:34 +0000 (UTC) From: "Mostafa Mokhtar (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191285#comment-14191285 ] Mostafa Mokhtar commented on HIVE-8671: --------------------------------------- For the code below the query had the following inputs: totalInputFileSize = 9223372036854775341 bytesPerReducer = 100000000 9223372036854775341 + 100000000 -> Overflow. {code} public static int estimateReducers(long totalInputFileSize, long bytesPerReducer, int maxReducers, boolean powersOfTwo) { int reducers = (int) ((totalInputFileSize + bytesPerReducer - 1) / bytesPerReducer); reducers = Math.max(1, reducers); reducers = Math.min(maxReducers, reducers); {code} I recommend changing to {code} int reducers = (int) ((Math.max(totalInputFileSize,bytesPerReducer )) / bytesPerReducer); {code} > Overflow in estimate row count and data size with fetch column stats > -------------------------------------------------------------------- > > Key: HIVE-8671 > URL: https://issues.apache.org/jira/browse/HIVE-8671 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 0.14.0 > Reporter: Mostafa Mokhtar > Assignee: Prasanth J > Priority: Critical > Fix For: 0.14.0 > > > Overflow in row counts and data size for several TPC-DS queries. > Interestingly the operators which have overflow end up running with a small parallelism. > For instance Reducer 2 has an overflow but it only runs with parallelism of 2. > {code} > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: sum(VALUE._col0) > keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col3 (type: string), _col3 (type: string) > sort order: ++ > Map-reduce partition columns: _col3 (type: string) > Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) > Execution mode: vectorized > {code} > {code} > VERTEX TOTAL_TASKS DURATION_SECONDS CPU_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS > Map 1 62 26.41 1,779,510 211,978,502 60,628,390 > Map 5 1 4.28 6,950 138,098 138,098 > Map 6 1 2.44 3,910 31 31 > Reducer 2 2 22.69 61,320 60,628,390 69,182 > Reducer 3 1 2.63 3,910 69,182 100 > Reducer 4 1 1.01 1,180 100 100 > {code} > Query > {code} > explain > select i_item_desc > ,i_category > ,i_class > ,i_current_price > ,i_item_id > ,sum(ws_ext_sales_price) as itemrevenue > ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over > (partition by i_class) as revenueratio > from > web_sales > ,item > ,date_dim > where > web_sales.ws_item_sk = item.i_item_sk > and item.i_category in ('Jewelry', 'Sports', 'Books') > and web_sales.ws_sold_date_sk = date_dim.d_date_sk > and date_dim.d_date between '2001-01-12' and '2001-02-11' > group by > i_item_id > ,i_item_desc > ,i_category > ,i_class > ,i_current_price > order by > i_category > ,i_class > ,i_item_id > ,i_item_desc > ,revenueratio > limit 100 > {code} > Explain > {code} > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE) > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > Reducer 4 <- Reducer 3 (SIMPLE_EDGE) > DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: web_sales > filterExpr: ws_item_sk is not null (type: boolean) > Statistics: Num rows: 21594638446 Data size: 2850189889652 Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ws_item_sk is not null (type: boolean) > Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: ws_item_sk (type: int), ws_ext_sales_price (type: float), ws_sold_date_sk (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col0} {_col1} > 1 > keys: > 0 _col2 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col0, _col1 > input vertices: > 1 Map 6 > Statistics: Num rows: 24145061366 Data size: 193160490928 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col1} > 1 {_col1} {_col2} {_col3} {_col4} {_col5} > keys: > 0 _col0 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col1, _col6, _col7, _col8, _col9, _col10 > input vertices: > 1 Map 5 > Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col6 (type: string), _col7 (type: string), _col10 (type: string), _col9 (type: string), _col8 (type: float), _col1 (type: float) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: sum(_col5) > keys: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) > sort order: +++++ > Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) > Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col5 (type: double) > Execution mode: vectorized > Map 5 > Map Operator Tree: > TableScan > alias: item > filterExpr: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) > Statistics: Num rows: 462000 Data size: 663862160 Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) > Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: i_item_sk (type: int), i_item_id (type: string), i_item_desc (type: string), i_current_price (type: float), i_class (type: string), i_category (type: string) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: float), _col4 (type: string), _col5 (type: string) > Execution mode: vectorized > Map 6 > Map Operator Tree: > TableScan > alias: date_dim > filterExpr: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: d_date_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col0 (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE > Group By Operator > keys: _col0 (type: int) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE > Dynamic Partitioning Event Operator > Target Input: web_sales > Partition key expr: ws_sold_date_sk > Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE > Target column: ws_sold_date_sk > Target Vertex: Map 1 > Execution mode: vectorized > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: sum(VALUE._col0) > keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col3 (type: string), _col3 (type: string) > sort order: ++ > Map-reduce partition columns: _col3 (type: string) > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) > Execution mode: vectorized > Reducer 3 > Reduce Operator Tree: > Extract > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE > PTF Operator > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col0 (type: string), _col5 (type: double), ((_col5 * 100.0) / _wcol0) (type: double) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 > Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col1 (type: string), _col2 (type: string), _col4 (type: string), _col0 (type: string), _col6 (type: double) > sort order: +++++ > Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE > TopN Hash Memory Usage: 0.04 > value expressions: _col3 (type: float), _col5 (type: double) > Reducer 4 > Reduce Operator Tree: > Select Operator > expressions: KEY.reducesinkkey3 (type: string), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string), VALUE._col0 (type: float), KEY.reducesinkkey2 (type: string), VALUE._col1 (type: double), KEY.reducesinkkey4 (type: double) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 > Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE > Limit > Number of rows: 100 > Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Execution mode: vectorized > Stage: Stage-0 > Fetch Operator > limit: 100 > Processor Tree: > ListSink > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)