hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Sqoop on Spark
Date Wed, 06 Apr 2016 22:47:56 GMT
Sorry are you referring to Hive as a relational Data Warehouse in this
scenario. The assumption here is that data is coming from a relational
 database (Oracle) so IMO the best storage for it in Big Data World is
another DW adaptive to SQL. Spark is a powerful query tool and together
with Hive as a backbone of storage provides a powerful framework to
anything. the performance is pretty fast indeed much faster compared to
MapR that Sqoop uses be default.

Anyway you are not confined to a table in Hive. You can take that data from
JDBC and do whatever is needed. There is no constraint in here.

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 April 2016 at 23:29, Jörn Franke <jornfranke@gmail.com> wrote:

> Well I am not sure, but using a database as a storage, such as relational
> databases or certain nosql databases (eg MongoDB) for Spark is generally a
> bad idea - no data locality, it cannot handle real big data volumes for
> compute and you may potentially overload an operational database.
> And if your job fails for whatever reason (eg scheduling ) then you have
> to pull everything out again. Sqoop and HDFS seems to me the more elegant
> solution together with spark. These "assumption" on parallelism have to be
> anyway made with any solution.
> Of course you can always redo things, but why - what benefit do you
> expect? A real big data platform has to support anyway many different tools
> otherwise people doing analytics will be limited.
>
> On 06 Apr 2016, at 20:05, Michael Segel <msegel_hadoop@hotmail.com> wrote:
>
> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink their ingestion process so
> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
> that term from Dean Wampler. ;-)
>
> Just saying. ;-)
>
> -Mike
>
> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>
> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha <guha.ayan@gmail.com> wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha <guha.ayan@gmail.com> wrote:
>>
>> Hi
>>
>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>> index and incremental only) and add data to existing Hive tables. Also, it
>> would be good to have an option to create Hive table, driven by job
>> specific configuration.
>>
>> What do you think?
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It depends on your use case using sqoop.
>>> What's it like?
>>>
>>> // maropu
>>>
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.ayan@gmail.com> wrote:
>>>
>>>> Hi All
>>>>
>>>> Asking opinion: is it possible/advisable to use spark to replace what
>>>> sqoop does? Any existing project done in similar lines?
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>

Mime
View raw message