impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvius Rus <s...@cloudera.com>
Subject Re: Memory limit exceed even with very simple count query
Date Thu, 06 Apr 2017 22:40:02 GMT
There are some known outstanding issues that reference the word "corrupt":

https://issues.apache.org/jira/issues/?jql=project%20%3D%20impala%20and%20resolution%20is%20empty%20and%20text%20~%20corrupt

Feel free to post a new JIRA if you believe you are running into a new bug.

On Thu, Apr 6, 2017 at 12:41 AM, Bin Wang <wbin00@gmail.com> wrote:

> Hi,
>
> I've dig it and found there are two files that will trigger this problem.
> After remove them from the partition, I can query on it. But these two
> files are small both before and after compressed (smaller than 400 MB). The
> only problem is they have some corrupt data at the end. Is it a bug in
> Impala?
>
> Regards,
> Bin Wang
>
> Bin Wang <wbin00@gmail.com>于2017年4月6日周四 下午12:05写道:
>
>> I convert these avro files to json with avro-tools, and the json files
>> are no larger than 1GB. So Impala should be able to read them. Some of the
>> avro files are corrupt.
>>
>> 16M     log.2017-04-05.1491321605834.avro.json
>> 308M    log.2017-04-05.1491323647211.avro.json
>> 103M    log.2017-04-05.1491327241311.avro.json
>> 150M    log.2017-04-05.1491330839609.avro.json
>> 397M    log.2017-04-05.1491334439092.avro.json
>> 297M    log.2017-04-05.1491338038503.avro.json
>> 160M    log.2017-04-05.1491341639694.avro.json
>> 95M     log.2017-04-05.1491345239969.avro.json
>> 360M    log.2017-04-05.1491348843931.avro.json
>> 338M    log.2017-04-05.1491352442955.avro.json
>> 71M     log.2017-04-05.1491359648079.avro.json
>> 161M    log.2017-04-05.1491363247597.avro.json
>> 628M    log.2017-04-05.1491366845827.avro.json
>> 288M    log.2017-04-05.1491370445873.avro.json
>> 162M    log.2017-04-05.1491374045830.avro.json
>> 90M     log.2017-04-05.1491377650935.avro.json
>> 269M    log.2017-04-05.1491381249597.avro.json
>> 620M    log.2017-04-05.1491384846366.avro.json
>> 70M     log.2017-04-05.1491388450262.avro.json
>> 30M     log.2017-04-05.1491392047694.avro.json
>> 114M    log.2017-04-05.1491395648818.avro.json
>> 370M    log.2017-04-05.1491399246407.avro.json
>> 359M    log.2017-04-05.1491402846469.avro.json
>> 218M    log.2017-04-05.1491406180615.avro.json
>> 29M     log.2017-04-05.1491409790105.avro.json
>> 3.9M    log.2017-04-05.1491413385884.avro.json
>> 9.3M    log.2017-04-05.1491416981829.avro.json
>> 8.3M    log.2017-04-05.1491420581588.avro.json
>> 2.3M    log.2017-04-05.1491424180191.avro.json
>> 25M     log.2017-04-05.1491427781339.avro.json
>> 24M     log.2017-04-05.1491431382552.avro.json
>> 5.7M    log.2017-04-05.1491434984679.avro.json
>> 35M     log.2017-04-05.1491438586674.avro.json
>> 5.8M    log.2017-04-05.1491442192541.avro.json
>> 23M     log.2017-04-05.1491445789230.avro.json
>> 4.3M    log.2017-04-05.1491449386630.avro.json
>>
>> Bin Wang <wbin00@gmail.com>于2017年4月6日周四 上午11:34写道:
>>
>> And here is another question. How does Impala estimate the unziped file
>> size? All the gziped files is no bigger than 300MB so I think it will be OK
>> to unzip.
>>
>> Bin Wang <wbin00@gmail.com>于2017年4月6日周四 上午9:31写道:
>>
>> Is the snappy decompressor for AVOR or Parquet streaming?
>>
>> Alex Behm <alex.behm@cloudera.com>于2017年4月6日周四 上午9:27写道:
>>
>> I'd say following the best practices with Parquet should work fine.
>> Create snappy-compressed Parquet files of roughly 256MB in size.
>> If you want to stick with Avro, then yes, you'll just have to create
>> smaller files.
>>
>> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wbin00@gmail.com> wrote:
>>
>> So the best I can do to workaround this for now is splitting the files
>> into small files?
>>
>> Alex Behm <alex.behm@cloudera.com>于2017年4月6日周四 上午9:14写道:
>>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wbin00@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <alex.behm@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <marcel@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wbin00@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <marcel@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wbin00@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04'
and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
>>

Mime
View raw message