hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Hive parquet vs Vertica vs Impala
Date Sat, 03 Jan 2015 20:51:01 GMT
Shashi,

Your questions are too broad, and you are asking questions that are
impossible to answer.
Q. "What is faster X or Y?".
A. "This depends on countless variables and can not be answered."

For one example even databases that are very similar in nature like
mysql/postgres might execute a query a different way based on it's query
planner or even the characteristics of the data.

How can you show if a query is "faster then vertica" if you do not have
access vertica to prove it?

I understand some of what you are trying to determine, but you should
really attempt to install these things and build a prototype to determine
what is the best fit for your application. This will grow your
understanding of the systems, help you ask better questions, and
potentially give you the ability to answer those questions yourself and
make better decisions.

The right way to ask this question might be "Hello, I have loaded 50Million
rows of data into hive and I am running this query 'select X, from bla
bla'. My vertica instances runs this query in X seconds and hive runs this
in Y seconds. Can this be optimized further?"

The software license for Impala is included here:
https://github.com/cloudera/Impala/blob/master/LICENSE.txt

Edward


On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao <raoshashidhar123@gmail.com>
wrote:

> Edward,
>
> Thanks for your reply.
> Can you please tell me the query performance of Hive-parquet against
> Vertica. Can Hive -parquet match against Vertica's retrieval performance,
> as I have been told Vertica is also compressed columnar format and is fast?
> What if I query against some 50 millions of rows , which one will be
> faster?
>
> And moreover is Impala open source ? In some blogs I have seen Impala as
> open source but in some it says Impala as Cloudera proprietary engine.
>
> Ultimately, I want to use Hive -parquet but need to show that it is better
> than Vertica, a few microseconds here and there would be fine. I don't have
> access to Vertica.
>
> Thanks
> shashi
>
> On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo <edlinuxguru@gmail.com>
> wrote:
>
>>  Hive is the only system that can store and query xml directly, with the
>> help of different serde's or input formats.
>>
>> Impala and Vertical have more standard schema systems that do not support
>> Collections like List, Map, Struct or nested collections you might need to
>> store and process a complex XML document.
>>
>> Parquet (A storage format that works with Hive and Impala can support
>> List,Map, Structs) but he the Impala engine can not access these at the
>> moment. Last I checked impala refuses to read tables that have one of these
>> elements ( instead of skipping them).
>>
>> It sounds like you want to do one of a few things:
>> 1) Normalize your xml into a table and then you can use Vertica, Hive, or
>> Imapa
>> 2) Write your data using using an Parquet (to handle nested objects ) and
>> Hive to query it.(Hopefully then when Impala adds collection support you
>> can switch over.
>>
>> But mostly you need to do more research.
>>
>> Edward
>>
>> On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Can someone help me with insights into Hive with parquet vs Vertica
>>> comparison.
>>>
>>> I need to store large xml data into one these database so please help me
>>> with query performance.
>>>
>>> Is Impala opensource and can we use it without Cloudera license.
>>>
>>> Thanks
>>> Shashi
>>>
>>>
>>>
>>
>

Mime
View raw message