hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From praveenesh kumar <praveen...@gmail.com>
Subject Re: How to load csv data into HIVE
Date Sat, 08 Sep 2012 14:35:45 GMT
Yup, Bejoy is correct :-) Just use hadoop streaming, for what it can do
best --->>> Cleaning, Transformations and Validations, in just simple steps.

Regards,
Praveenesh

On Sat, Sep 8, 2012 at 6:03 PM, Bejoy KS <bejoy_ks@yahoo.com> wrote:

> Hi Chuck
>
> I believe Praveenesh was adding his thought to the discussion on
> preprocessing the data using mapreduce itself. If you go with hadoop
> streaming you can use the python script in the mapper and that will do the
> preprocessing parallely on large volume data. Then this preprocessed data
> can be loaded into hive table.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * "Connell, Chuck" <Chuck.Connell@nuance.com>
> *Date: *Sat, 8 Sep 2012 12:18:33 +0000
> *To: *user@hive.apache.org<user@hive.apache.org>
> *ReplyTo: * user@hive.apache.org
> *Subject: *RE: How to load csv data into HIVE
>
> I would like to hear more about this "hadoop streaming to Hive" idea. I
> have used streaming jobs as mappers, with a python script as map.py. Are
> you saying that such a streaming mapper can load its output into Hive? Can
> you send some example code? Hive wants to load "files" not individual
> lines/records. How would you do this?
>
> Thanks very much,
> Chuck
>
>
>  ------------------------------
> *From:* praveenesh kumar [praveenesh@gmail.com]
> *Sent:* Saturday, September 08, 2012 7:54 AM
> *To:* user@hive.apache.org
> *Subject:* Re: How to load csv data into HIVE
>
>  You can use hadoop streaming that would be much faster... Just run your
> cleaning shell script logic in map phase and it will be done in just few
> minutes. That will keep the data in HDFS.
>
> Regards,
> Praveenesh
>
> On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <
> sandeepreddy.3647@gmail.com> wrote:
>
>> Hi,
>> Thank you all for your help. I'll try both ways and i'll get back to you.
>>
>>
>> On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>
>>> I said this assuming that a Hadoop cluster is available since Sandeep is
>>> planning to use Hive. If that is the case then MapReduce would be faster
>>> for such large files.
>>>
>>> Regards,
>>>     Mohammad Tariq
>>>
>>>
>>>
>>> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <Chuck.Connell@nuance.com
>>> > wrote:
>>>
>>>>  I cannot promise which is faster. A lot depends on how clever your
>>>> scripts are.****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>>>> *Sent:* Friday, September 07, 2012 10:42 AM
>>>> *To:* user@hive.apache.org
>>>> *Subject:* Re: How to load csv data into HIVE****
>>>>
>>>> ** **
>>>>
>>>> Hi,
>>>> I wrote a shell script to get csv data but when i run that script on a
>>>> 12GB csv its taking more time. If i run a python script will that be faster?
>>>> ****
>>>>
>>>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <
>>>> Chuck.Connell@nuance.com> wrote:****
>>>>
>>>> How about a Python script that changes it into plain tab-separated
>>>> text? So it would look like this…****
>>>>
>>>>  ****
>>>>
>>>> 174969274<tab>14-mar-2006<tab>3522876<tab>
>>>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
>>>> etc…****
>>>>
>>>>  ****
>>>>
>>>> Tab-separated with newlines is easy to read and works perfectly on
>>>> import.****
>>>>
>>>>  ****
>>>>
>>>> Chuck Connell****
>>>>
>>>> Nuance R&D Data Team****
>>>>
>>>> Burlington, MA****
>>>>
>>>> 781-565-4611****
>>>>
>>>>  ****
>>>>
>>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>>>> *Subject:* How to load csv data into HIVE****
>>>>
>>>>  ****
>>>>
>>>> Hi,
>>>> Here is the sample data
>>>> "174969274","14-mar-2006","****
>>>>
>>>> 3522876","","14-mar-2006","500000308","65","1"|
>>>>
>>>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
>>>>
>>>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
>>>>
>>>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>>>>
>>>> How to load this kind of data into HIVE?
>>>> I'm using shell script to get rid of double quotes and '|' but its
>>>> taking very long time to work on each csv which are 12GB each. What is the
>>>> best way to do this?****
>>>>
>>>>  ****
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> sandeep****
>>>>
>>>
>>>
>>
>>
>>  --
>> Thanks,
>> sandeep
>>
>>
>

Mime
View raw message