hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Demai Ni <nid...@gmail.com>
Subject ORC file across multiple HDFS blocks
Date Fri, 24 Apr 2015 21:45:43 GMT
hi, Guys,

I am working on directly READ ORC files from HDFS cluster, and hopefully to
leverage HDFS local shortcuit READ (
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html)
as much as possible

According to ORC design, each ORC file usually contain several Stripes, and
each Stripe has default of 250MB for the efficient reads from HDFS.  With
that, size of a ORC file can be easily at GB level, consisted of several
HDFS blocks.  There is a good chance that
1) a ORC file across several HDFS data nodes.
2) a Stripe may across two HDFS blocks, and lands on two different physical
nodes

With this in mind, should I design my ORC file to
1) only contain one Stripe?
2) make ensure(either by larger HDFS block or smaller Stripe size) that
each ORC file contain only one HDFS block?

Does it look reasonable? thanks

Demai

Mime
View raw message