impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Behm <alex.b...@cloudera.com>
Subject Re: Memory limit exceed even with very simple count query
Date Thu, 06 Apr 2017 01:14:12 GMT
Parquet makes more sense particularly for that kind of query you have.

Still, you might want to be careful with creating huge gzipped files.
Impala's gzip decompressor for Parquet is also not streaming.

On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wbin00@gmail.com> wrote:

> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <alex.behm@cloudera.com>于2017年4月6日周四 02:52写道:
>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <marcel@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wbin00@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <marcel@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wbin00@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04'
and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>

Mime
View raw message