impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <wbi...@gmail.com>
Subject Re: Memory limit exceed even with very simple count query
Date Thu, 06 Apr 2017 04:05:09 GMT
I convert these avro files to json with avro-tools, and the json files are
no larger than 1GB. So Impala should be able to read them. Some of the avro
files are corrupt.

16M     log.2017-04-05.1491321605834.avro.json
308M    log.2017-04-05.1491323647211.avro.json
103M    log.2017-04-05.1491327241311.avro.json
150M    log.2017-04-05.1491330839609.avro.json
397M    log.2017-04-05.1491334439092.avro.json
297M    log.2017-04-05.1491338038503.avro.json
160M    log.2017-04-05.1491341639694.avro.json
95M     log.2017-04-05.1491345239969.avro.json
360M    log.2017-04-05.1491348843931.avro.json
338M    log.2017-04-05.1491352442955.avro.json
71M     log.2017-04-05.1491359648079.avro.json
161M    log.2017-04-05.1491363247597.avro.json
628M    log.2017-04-05.1491366845827.avro.json
288M    log.2017-04-05.1491370445873.avro.json
162M    log.2017-04-05.1491374045830.avro.json
90M     log.2017-04-05.1491377650935.avro.json
269M    log.2017-04-05.1491381249597.avro.json
620M    log.2017-04-05.1491384846366.avro.json
70M     log.2017-04-05.1491388450262.avro.json
30M     log.2017-04-05.1491392047694.avro.json
114M    log.2017-04-05.1491395648818.avro.json
370M    log.2017-04-05.1491399246407.avro.json
359M    log.2017-04-05.1491402846469.avro.json
218M    log.2017-04-05.1491406180615.avro.json
29M     log.2017-04-05.1491409790105.avro.json
3.9M    log.2017-04-05.1491413385884.avro.json
9.3M    log.2017-04-05.1491416981829.avro.json
8.3M    log.2017-04-05.1491420581588.avro.json
2.3M    log.2017-04-05.1491424180191.avro.json
25M     log.2017-04-05.1491427781339.avro.json
24M     log.2017-04-05.1491431382552.avro.json
5.7M    log.2017-04-05.1491434984679.avro.json
35M     log.2017-04-05.1491438586674.avro.json
5.8M    log.2017-04-05.1491442192541.avro.json
23M     log.2017-04-05.1491445789230.avro.json
4.3M    log.2017-04-05.1491449386630.avro.json

Bin Wang <wbin00@gmail.com>于2017年4月6日周四 上午11:34写道:

> And here is another question. How does Impala estimate the unziped file
> size? All the gziped files is no bigger than 300MB so I think it will be OK
> to unzip.
>
> Bin Wang <wbin00@gmail.com>于2017年4月6日周四 上午9:31写道:
>
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <alex.behm@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wbin00@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <alex.behm@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wbin00@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <alex.behm@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <marcel@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wbin00@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <marcel@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wbin00@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>

Mime
View raw message