hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashok Kumar <ashok34...@yahoo.com>
Subject Re: Difference between ORC and RC files
Date Mon, 21 Dec 2015 19:18:25 GMT
 Many thanks Sir. Very useful.
Kindly elaborate why RC files do not have these capabilities. As I see them they are Row Columnar
files. Am I correct to assume that ORC file is basically an RC file with more optimisation.
Are RC and ORC files designed for columnar format similar to the way a columnar data warehouse
is built?
Regards

    On Monday, 21 December 2015, 18:58, Alan Gates <alanfgates@gmail.com> wrote:
 

 ORC offers a number of features not available in RC files:
* Better encoding of data.  Integer values are run length encoded.  Strings and dates are
stored in a dictionary (and the resulting pointers then run length encoded).
* Internal indexes and statistics on the data.  This allows for more efficient reading of
the data as well as skipping of sections of the data not relevant to a given query.  These
indexes can also be used by the Hive optimizer to help plan query execution.
* Predicate push down for some predicates.  For example, in the query "select * from user
where state = 'ca'", ORC could look at a collection of rows and use the indexes to see that
no rows in that group have that value, and thus skip the group altogether.
* Tight integration with Hive's vectorized execution, which produces much faster processing
of rows
* Support for new ACID features in Hive (transactional insert, update, and delete).
* It has a much faster read time than RCFile and compresses much more efficiently.

Whether ORC is the best format for what you're doing depends on the data you're storing and
how you are querying it.  If you are storing data where you know the schema and you are doing
analytic type queries it's the best choice (in fairness, some would dispute this and choose
Parquet, though much of what I said above about ORC vs RC applies to Parquet as well).  If
you are doing queries that select the whole row each time columnar formats like ORC won't
be your friend.  Also, if you are storing self structured data such as JSON or Avro you may
find text or Avro storage to be a better format.

Alan.




    Ashok Kumar  December 21, 2015 at 9:45  Hi Gurus,
I am trying to understand the advantages that ORC file format offers over RC.
I have read the existing documents but I still don't seem to grasp the main differences.
Can someone explain to me as a user where ORC scores when compared to RC. What I like to know
is mainly the performance. I am also aware that ORC does some smart compression as well.
Finally is ORC file format is the best choice in Hive.
Thank you




  
Mime
View raw message