spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Pivovarov <apivova...@gmail.com>
Subject Re: ORC v/s Parquet for Spark 2.0
Date Thu, 28 Jul 2016 22:15:00 GMT
Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark
User List <http://apache-spark-user-list.1001560.n3.nabble.com/>
http://apache-spark-user-list.1001560.n3.nabble.com/

Anyone have a link to this discussion? Want to share it with my colleagues.

On Thu, Jul 28, 2016 at 2:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> As far as I know Spark still lacks the ability to handle Updates or
> deletes vis-à-vis ORC transactional tables. As you may know in Hive an ORC
> transactional table can handle updates and deletes. Transactional support
> was added to Hive for ORC tables. No transactional support with Spark SQL
> on ORC tables yet. Locking and concurrency (as used by Hive) with Spark
> app running a Hive context. I am not convinced this works actually. Case in
> point, you can test it for yourself in Spark and see whether locks are
> applied in Hive metastore . In my opinion Spark value comes as a query tool
> for faster query processing (DAG + IM capability)
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 18:46, Ofir Manor <ofir.manor@equalum.io> wrote:
>
>> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
>> personally think both are great at this point).
>> But the original question was about Spark 2.0. Anyone has some insights
>> about Parquet-specific optimizations / limitations vs. ORC-specific
>> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
>> beginning of the thread regarding Structured Streaming, but there was a
>> general claim that pre-2.0 Spark was missing many ORC optimizations, and
>> that some (all?) were added in 2.0.
>> I saw that a lot of related tickets closed in 2.0, but it would great if
>> someone close to the details can explain.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>
>> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Like anything else your mileage varies.
>>>
>>> ORC with Vectorised query execution
>>> <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution>
is
>>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>>> with columnar indexes. To me that is cool. Parquet has been around and has
>>> its use case as well.
>>>
>>> I guess there is no hard and fast rule which one to use all the time.
>>> Use the one that provides best fit for the condition.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 09:18, Jörn Franke <jornfranke@gmail.com> wrote:
>>>
>>>> I see it more as a process of innovation and thus competition is good.
>>>> Companies just should not follow these religious arguments but try
>>>> themselves what suits them. There is more than software when using software
>>>> ;)
>>>>
>>>> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>> wrote:
>>>>
>>>> And frankly this is becoming some sort of religious arguments now
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sbpothineni@gmail.com>
>>>> wrote:
>>>>
>>>>> It depends on what you are dong, here is the recent comparison of ORC,
>>>>> Parquet
>>>>>
>>>>>
>>>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>
>>>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>>>> good.
>>>>>
>>>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>>>> ORC is by Hortonworks, so battle of file format continues...
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <janardhanp22@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Seems like parquet format is better comparatively to orc when the
>>>>> dataset is log data without nested structures? Is this fair understanding
?
>>>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfranke@gmail.com>
wrote:
>>>>>
>>>>>> Kudu has been from my impression be designed to offer somethings
>>>>>> between hbase and parquet for write intensive loads - it is not faster
for
>>>>>> warehouse type of querying compared to parquet (merely slower, because
that
>>>>>> is not its use case).   I assume this is still the strategy of it.
>>>>>>
>>>>>> For some scenarios it could make sense together with parquet and
Orc.
>>>>>> However I am not sure what the advantage towards using hbase + parquet
and
>>>>>> Orc.
>>>>>>
>>>>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uwe@moosheimer.com>"
<
>>>>>> Uwe@Moosheimer.com <Uwe@moosheimer.com>> wrote:
>>>>>>
>>>>>> Hi Gourav,
>>>>>>
>>>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is
a
>>>>>> in memory db with data storage while Parquet is "only" a columnar
>>>>>> storage format.
>>>>>>
>>>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok
>>>>>> ... that's more a wish :-).
>>>>>>
>>>>>> Regards,
>>>>>> Uwe
>>>>>>
>>>>>> Mit freundlichen Grüßen / best regards
>>>>>> Kay-Uwe Moosheimer
>>>>>>
>>>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>>>>> gourav.sengupta@gmail.com>:
>>>>>>
>>>>>> Gosh,
>>>>>>
>>>>>> whether ORC came from this or that, it runs queries in HIVE with
TEZ
>>>>>> at a speed that is better than SPARK.
>>>>>>
>>>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>>>>> someone might just start saying that KUDA has difficult lineage as
well.
>>>>>> After all dynastic rules dictate.
>>>>>>
>>>>>> Personally I feel that if something stores my data compressed and
>>>>>> makes me access it faster I do not care where it comes from or how
>>>>>> difficult the child birth was :)
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Gourav
>>>>>>
>>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>>>>> sbpothineni@gmail.com> wrote:
>>>>>>
>>>>>>> Just correction:
>>>>>>>
>>>>>>> ORC Java libraries from Hive are forked into Apache ORC.
>>>>>>> Vectorization default.
>>>>>>>
>>>>>>> Do not know If Spark leveraging this new repo?
>>>>>>>
>>>>>>> <dependency>
>>>>>>>  <groupId>org.apache.orc</groupId>
>>>>>>>     <artifactId>orc</artifactId>
>>>>>>>     <version>1.1.2</version>
>>>>>>>     <type>pom</type>
>>>>>>> </dependency>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <koert@tresata.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> parquet was inspired by dremel but written from the ground up
as a
>>>>>>> library with support for a variety of big data systems (hive,
pig, impala,
>>>>>>> cascading, etc.). it is also easy to add new support, since its
a proper
>>>>>>> library.
>>>>>>>
>>>>>>> orc bas been enhanced while deployed at facebook in hive and
at
>>>>>>> yahoo in hive. just hive. it didn't really exist by itself. it
was part of
>>>>>>> the big java soup that is called hive, without an easy way to
extract it.
>>>>>>> hive does not expose proper java apis. it never cared for that.
>>>>>>>
>>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>>>>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>>>>>
>>>>>>>> Interesting opinion, thank you
>>>>>>>>
>>>>>>>> Still, on the website parquet is basically inspired by Dremel
>>>>>>>> (Google) [1] and part of orc has been enhanced while deployed
for Facebook,
>>>>>>>> Yahoo [2].
>>>>>>>>
>>>>>>>> Other than this presentation [3], do you guys know any other
>>>>>>>> benchmark?
>>>>>>>>
>>>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>>>> [2]https://orc.apache.org/docs/
>>>>>>>> [3]
>>>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>>>
>>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <koert@tresata.com>
wrote:
>>>>>>>>
>>>>>>>> when parquet came out it was developed by a community of
companies,
>>>>>>>> and was designed as a library to be supported by multiple
big data
>>>>>>>> projects. nice
>>>>>>>>
>>>>>>>> orc on the other hand initially only supported hive. it wasn't
even
>>>>>>>> designed as a library that can be re-used. even today it
brings in the
>>>>>>>> kitchen sink of transitive dependencies. yikes
>>>>>>>>
>>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfranke@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think both are very similar, but with slightly different
goals.
>>>>>>>>> While they work transparently for each Hadoop application
you need to
>>>>>>>>> enable specific support in the application for predicate
push down.
>>>>>>>>> In the end you have to check which application you are
using and
>>>>>>>>> do some tests (with correct predicate push down configuration).
Keep in
>>>>>>>>> mind that both formats work best if they are sorted on
filter columns
>>>>>>>>> (which is your responsibility) and if their optimatizations
are correctly
>>>>>>>>> configured (min max index, bloom filter, compression
etc) .
>>>>>>>>>
>>>>>>>>> If you need to ingest sensor data you may want to store
it first
>>>>>>>>> in hbase and then batch process it in large files in
Orc or parquet format.
>>>>>>>>>
>>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhanp22@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Just wondering advantages and disadvantages to convert
data into
>>>>>>>>> ORC or Parquet.
>>>>>>>>>
>>>>>>>>> In the documentation of Spark there are numerous examples
of
>>>>>>>>> Parquet format.
>>>>>>>>>
>>>>>>>>> Any strong reasons to chose Parquet over ORC file format
?
>>>>>>>>>
>>>>>>>>> Also : current data compression is bzip2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>>>> This seems like biased.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Mime
View raw message