hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mpeters...@gmail.com
Subject Re: Trouble with large table joins when using a CSV SerDE
Date Mon, 02 Dec 2013 20:23:50 GMT
I found a solution:

set hive.auto.convert.join.noconditionaltask.size=25000000;


The default is 100MB.  If I drop it to 25MB it now works.  I ran this with
my full 11-table join on production data and it finished successfully and
in a reasonable amount of time.

-Michael



On Mon, Dec 2, 2013 at 12:09 PM, <mpeterson2@gmail.com> wrote:

> Hi,
>
> I recently wrote a CSV SerDe for Hive.  It works in most scenarios, but
> I've found one situation where Hive is failing with it.  I also tried the
> queries with the opensource CSV SerDe: https://github.com/ogrodnek and I
> see the same issue.  So either we both wrote our SerDe incorrectly or
> there is an issue with Hive.
>
> I've tested this with Hive-0.11 (on Hortonworks Data Platform 1.3.2) and
> Hive-0.12 (on HDP-2.0.6) and I see the exact same behavior on each.
>
> *Details*
>
> Here's the key details.  The CSV SerDes work fine for:
>
>    - single table queries
>    - queries that do a join (left outer join) of a large table (large
>    number of rows) and a small table (small number of rows)
>
>
> The CSV SerDes break down when I do a join (left outer join) query
> between two large tables (in terms of number of rows).
>
> To make it more concrete, the query below succeeds when I have only 10
> rows in each table, but fails when I have millions of rows in each:
>
> select DISTINCT
>        p.PK1, p.PK2, l.Col3, p.Col4,
>        l.Col5, l.Col6, l.Col7,
>        l.Col8, l.Col9, l.Col10,
>        p.Col3, p.Col4, p.Col5, p.Col6, l.Col7,
>        p.Col8, p.Col9, p.Col15, p.Col17, p.Col18, p.Col19,
>        p.Col20, p.Col21, p.Col22, p.Col23, p.Col24,
>        p.Col25, p.Col26, p.Col27, p.Col28, p.Col29, p.Col30,
>        p.Col31, p.Col32, p.Col33, p.Col36, p.Col37, p.Col38, p.Col39
> from pr p
> left outer join lp l
>   on p.PK1 = l.PK1 and p.PK2 = l.PK2
>
>
> (I've changed the column names.)
>
> At the bottom of this email I give the table definitions for these.
>
> Also the above query works fine for large datasets if I define the tables
> to use the default Hive SerDe (LazySimpleSerde) and just tell it to parse
> on commas, so the issue definitely has to do with using the CSV SerDe.
>
> The error message I get when it fails for large rows with the CSV SerDeis:
>
> Execution log at: /tmp/diuser/.log
> 2013-12-02 11:18:02     Starting to launch local task to process map join;
>      maximum memory = 1065484288
> 2013-12-02 11:18:08     Processing rows:        200000  Hashtable size:
> 199999  Memory usage:   187259488       rate:   0.176
> 2013-12-02 11:18:11     Processing rows:        300000  Hashtable size:
> 299999  Memory usage:   271135856       rate:   0.254
> 2013-12-02 11:18:13     Processing rows:        400000  Hashtable size:
> 399999  Memory usage:   360671568       rate:   0.339
> 2013-12-02 11:18:16     Processing rows:        500000  Hashtable size:
> 499999  Memory usage:   453548768       rate:   0.426
> 2013-12-02 11:18:18     Processing rows:        600000  Hashtable size:
> 599999  Memory usage:   525753560       rate:   0.493
> 2013-12-02 11:18:20     Processing rows:        700000  Hashtable size:
> 699999  Memory usage:   631831640       rate:   0.593
> Execution failed with exit status: 3
> Obtaining error information
>
> Task failed!
> Task ID:
>   Stage-5
>
> Logs:
>
> /tmp/diuser/hive.log
> FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql
> .exec.MapredLocalTask
>
>
> I don't see any additional useful information in the hive.log file it
> directs me to, but I can attach that if it would help.
>
> It looks like an out of memory error.  It fails when doing a mapjoin.  It
> doesn't fail if I force it to not do a mapjoin (see below).
>
>
> Here are things I've tried:
>
> 1. set hive.auto.convert.join=false
>
> This works in that in makes the error go away, but the real query I want
> to do is an 11-table join (only the first join is shown in the query above)
>  and the running time of that query goes from 6 minutes with the
> optimization turned on to 45 minutes with it turned off, so that is not an
> acceptable solution.
>
> 2. hive.mapjoin.smalltable.filesize=250
>
> This has no affect.  Same error occurs.
>
> 3. set hive.map.aggr.hash.percentmemory=0.25
>
> (From the suggestion here: http://dev.bizo
> .com/2013/02/map-side-aggregations-in-apache-hive.html).  This also has
> no effect - same error.
>
>
> *Questions:*
>
>    - Has anyone seen this problem with SerDe's they've built or used?
>    - Is there a known solution?
>    - Is this a defect either in the SerDe or Hive itself?
>
>
> The SerDe I built is not open source, but it is very similar to this the
> opensource CSV Serde: https://github.com/ogrodnek, in case you want to
> look at it and see if it has any defects that would cause this.
>
> Thank you,
> Michael
>
>
>
> *Table definitions:*
>
> CREATE EXTERNAL TABLE IF NOT EXISTS pr (
>   PK1           string,
>   PK2           string,
>   Col3          string,
>   Col4          string,
>   Col5          string,
>   Col6          string,
>   Col7          string,
>   Col8          string,
>   Col9          string,
>   Col10         string,
>   Col11         string,
>   Col12         string,
>   Col13         string,
>   Col14         string,
>   Col15         string,
>   Col16         string,
>   Col17         string,
>   Col18         string,
>   Col19         string,
>   Col20         string,
>   Col21         string,
>   Col22         string,
>   Col23         string,
>   Col24         string,
>   Col25         string,
>   Col26         string,
>   Col27         string,
>   Col28         string,
>   Col29         string,
>   Col30         string,
>   Col31         string,
>   Col32         string,
>   Col33         string,
>   Col34         string,
>   Col35         string,
>   Col36         string,
>   Col37         string,
>   Col38         string,
>   Col39         string)
> ROW FORMAT SERDE 'my.csv.CsvSerDe'
> LOCATION '${INPUTDIR}/pr_uniq'
> TBLPROPERTIES ('serde.retainOuterQuotes' = 'true', 'serde.trimWhitespace'
> = 'true');
>
>
> CREATE EXTERNAL TABLE IF NOT EXISTS lp (
>   PK1          string,
>   PK2          string,
>   Col3         string,
>   Col4         string,
>   Col5         string,
>   Col6         string,
>   Col7         string,
>   Col8         string,
>   Col9         string,
>   Col10        string,
>   Col11        string,
>   Col12        string,
>   Col13        string,
>   Col14        string,
>   Col15        string,
>   Col16        string,
>   Col17        string,
>   Col18        string,
>   Col19        string)
> ROW FORMAT SERDE 'my.csv.CsvSerDe'
> LOCATION '${INPUTDIR}/lp'
> TBLPROPERTIES ('serde.retainOuterQuotes' = 'true', 'serde.trimWhitespace'
> = 'true');
>
>

Mime
View raw message