spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudhir Babu Pothineni <sbpothin...@gmail.com>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Wed, 27 Jul 2016 23:01:57 GMT
It depends on what you are dong, here is the recent comparison of ORC, Parquet

https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet

Although from ORC authors, I thought fair comparison, We use ORC as System of Record on our
Cloudera HDFS cluster, our experience is so far good.

Perquet is backed by Cloudera, which has more installations of Hadoop. ORC is by Hortonworks,
so battle of file format continues...

Sent from my iPhone

> On Jul 27, 2016, at 4:54 PM, janardhan shetty <janardhanp22@gmail.com> wrote:
> 
> Seems like parquet format is better comparatively to orc when the dataset is log data
without nested structures? Is this fair understanding ?
> 
>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfranke@gmail.com> wrote:
>> Kudu has been from my impression be designed to offer somethings between hbase and
parquet for write intensive loads - it is not faster for warehouse type of querying compared
to parquet (merely slower, because that is not its use case).   I assume this is still the
strategy of it.
>> 
>> For some scenarios it could make sense together with parquet and Orc. However I am
not sure what the advantage towards using hbase + parquet and Orc.
>> 
>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com" <Uwe@Moosheimer.com> wrote:
>>> 
>>> Hi Gourav,
>>> 
>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory
db with data storage while Parquet is "only" a columnar storage format.
>>> 
>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's
more a wish :-).
>>> 
>>> Regards,
>>> Uwe
>>> 
>>> Mit freundlichen Grüßen / best regards
>>> Kay-Uwe Moosheimer
>>> 
>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengupta@gmail.com>:
>>>> 
>>>> Gosh,
>>>> 
>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
speed that is better than SPARK.
>>>> 
>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone
might just start saying that KUDA has difficult lineage as well. After all dynastic rules
dictate.
>>>> 
>>>> Personally I feel that if something stores my data compressed and makes me
access it faster I do not care where it comes from or how difficult the child birth was :)
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav
>>>> 
>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sbpothineni@gmail.com>
wrote:
>>>>> Just correction:
>>>>> 
>>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
default. 
>>>>> 
>>>>> Do not know If Spark leveraging this new repo?
>>>>> 
>>>>> <dependency>
>>>>>  <groupId>org.apache.orc</groupId>
>>>>>     <artifactId>orc</artifactId>
>>>>>     <version>1.1.2</version>
>>>>>     <type>pom</type>
>>>>> </dependency>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <koert@tresata.com>
wrote:
>>>>>> 
>>>>> 
>>>>>> parquet was inspired by dremel but written from the ground up as
a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.).
it is also easy to add new support, since its a proper library.
>>>>>> 
>>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo
in hive. just hive. it didn't really exist by itself. it was part of the big java soup that
is called hive, without an easy way to extract it. hive does not expose proper java apis.
it never cared for that.
>>>>>> 
>>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr>
wrote:
>>>>>>> Interesting opinion, thank you
>>>>>>> 
>>>>>>> Still, on the website parquet is basically inspired by Dremel
(Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>>>>>>> 
>>>>>>> Other than this presentation [3], do you guys know any other
benchmark?
>>>>>>> 
>>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>>> [2]https://orc.apache.org/docs/
>>>>>>> [3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>> 
>>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <koert@tresata.com>
wrote:
>>>>>>>> 
>>>>>>>> when parquet came out it was developed by a community of
companies, and was designed as a library to be supported by multiple big data projects. nice
>>>>>>>> 
>>>>>>>> orc on the other hand initially only supported hive. it wasn't
even designed as a library that can be re-used. even today it brings in the kitchen sink of
transitive dependencies. yikes
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfranke@gmail.com>
wrote:
>>>>>>>>> I think both are very similar, but with slightly different
goals. While they work transparently for each Hadoop application you need to enable specific
support in the application for predicate push down. 
>>>>>>>>> In the end you have to check which application you are
using and do some tests (with correct predicate push down configuration). Keep in mind that
both formats work best if they are sorted on filter columns (which is your responsibility)
and if their optimatizations are correctly configured (min max index, bloom filter, compression
etc) . 
>>>>>>>>> 
>>>>>>>>> If you need to ingest sensor data you may want to store
it first in hbase and then batch process it in large files in Orc or parquet format.
>>>>>>>>> 
>>>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhanp22@gmail.com>
wrote:
>>>>>>>>>> 
>>>>>>>>>> Just wondering advantages and disadvantages to convert
data into ORC or Parquet. 
>>>>>>>>>> 
>>>>>>>>>> In the documentation of Spark there are numerous
examples of Parquet format. 
>>>>>>>>>> 
>>>>>>>>>> Any strong reasons to chose Parquet over ORC file
format ?
>>>>>>>>>> 
>>>>>>>>>> Also : current data compression is bzip2
>>>>>>>>>> 
>>>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy

>>>>>>>>>> This seems like biased.

Mime
View raw message