hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avrilia Floratou <avrilia.flora...@gmail.com>
Subject Re: ORC file question
Date Mon, 10 Feb 2014 21:46:25 GMT
Hi Prasanth,
Here are the answers to your questions:

1) Yes I have set both set hive.optimize.ppd=true; set
hive.optimize.index.filter=true;
2) From describe extended:  inputFormat:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
3) Hive 0.12
4) Select max (I1) from table;

Thanks,
Avrilia


On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Hi Avrilia
>
> I have few more questions
>
> 1) Have you enabled ORC predicate pushdown by setting
> hive.optimize.index.filter?
> 2) What is the value for hive.input.format?
> 3) Which hive version are you using?
> 4) What query are you using?
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <avrilia.floratou@gmail.com>
> wrote:
>
> Hi Prasanth,
>
> No it's not a partitioned table. The table consists of only one file of
> (91.7 GB). When I created the table I loaded data from a text table to the
> orc table and used only 1 map task so that only one large file is created
> and not many small files. This is why I'm getting confused with this
> behavior. It seems that the first 180 map tasks read a total of 3 MB only
> (all together) and then the remaining map tasks do the actual work. Any
> other idea on why this might be happening?
>
> Thanks,
> Avrilia
>
>
> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <
> pjayachandran@hortonworks.com> wrote:
>
>> Hi Avrilia
>>
>> Is it a partitioned table? If so approximately how many partitions are
>> there and how many files are there? What is the value for hive.input.format?
>>
>> My suspicion is that there are ~180 files and each file is ~515MB in
>> size. Since, you had mentioned you are using default stripe size i.e,
>> 256MB, the default HDFS block size for ORC files will be chose as 512MB.
>> When a query is issued, the input files are split on HDFS block boundaries.
>> So if the file size in a partition is 515MB there will be 2 splits per file
>> (512MB on HDFS block boundary + remaining 3MB). This happens when the input
>> format is set to HiveInputFormat.
>>
>> Thanks
>> Prasanth Jayachandran
>>
>> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <
>> avrilia.floratou@gmail.com> wrote:
>>
>> > Hi all,
>> >
>> > I'm running a query that scans a file stored in ORC format and extracts
>> some columns. My file is about 92 GB, uncompressed. I kept the default
>> stripe size. The MapReduce job generates 363 map tasks.
>> >
>> > I have noticed that the first 180 map tasks finish in 3 secs (each) and
>> after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then
>> the remaining map tasks are the ones that scan the data and each one
>> completes in about 20 sec. It seems that each of these map tasks gets as
>> input 512 MB of the file. I was wondering, what exactly are the first short
>> map tasks doing?
>> >
>> > Thanks,
>> > Avrilia
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Mime
View raw message