Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3F12BF963 for ; Fri, 3 May 2013 09:42:19 +0000 (UTC) Received: (qmail 27899 invoked by uid 500); 3 May 2013 09:42:14 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 27400 invoked by uid 500); 3 May 2013 09:42:07 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 27360 invoked by uid 99); 3 May 2013 09:42:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 May 2013 09:42:05 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of selvait90@gmail.com designates 209.85.216.174 as permitted sender) Received: from [209.85.216.174] (HELO mail-qc0-f174.google.com) (209.85.216.174) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 May 2013 09:41:58 +0000 Received: by mail-qc0-f174.google.com with SMTP id z24so641744qcq.33 for ; Fri, 03 May 2013 02:41:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=MfzxYEt8XrICEj604gWIejhJRjkTnvTONM5TH+lvggA=; b=A9h9FTF6hXlFZYvBLTnD3jVbfkatHs8e54juFnKh2chWsGGbXFW+J9+xyNd0ILkY8t Q1eRR6XLx0zIPcLb2dgxdVTfToqvJwNLTyJ8LUlukv/3ANFMWzzDxxfxQto3b6gtDrCu whfaSEla1A+WpO4Z0Hd6tyiLIh0ArvJAaXqcndpd9rYs+T8PYSP//h76iDOQWoeuNfoV ioXhDOBVp+OLT0W3Pu+DIY3wSfkyWwkbbzfmWeVT619I8vtFVaL5mcNa4AT1QJkbDhth WeFIYpcHhCGI1+T+bqCWArXCRaosb5EU8lDHea6nwaTJ0BoaYtDIhNqTj81UH0Ulc/eu EauA== MIME-Version: 1.0 X-Received: by 10.49.109.98 with SMTP id hr2mr13231335qeb.11.1367574097953; Fri, 03 May 2013 02:41:37 -0700 (PDT) Received: by 10.49.14.167 with HTTP; Fri, 3 May 2013 02:41:37 -0700 (PDT) In-Reply-To: References: Date: Fri, 3 May 2013 15:11:37 +0530 Message-ID: Subject: Re: Parallel Load Data into Two partitions of a Hive Table From: selva To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bdc901651413604dbcd28c3 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc901651413604dbcd28c3 Content-Type: text/plain; charset=ISO-8859-1 Thanks Yanbo. I my doubt is got clarified now. On Fri, May 3, 2013 at 2:38 PM, Yanbo Liang wrote: > load data to different partitions parallel is OK, because it equivalent to > write to different file on HDFS > > > 2013/5/3 selva > >> Hi All, >> >> I need to load a month worth of processed data into a hive table. Table >> have 10 partitions. Each day have many files to load and each file is >> taking two seconds(constantly) and i have ~3000 files). So it will take >> days to complete for 30 days worth of data. >> >> I planned to load every day data parallel into respective partition so >> that i can complete it short time. >> >> But i need clarrification before proceeding it. >> >> Question: >> >> 1. Will it cause data loss/corruption by loading parallel in different >> partition of same hive table ? >> >> For example, Assume i am doing like below, >> >> Table : processedlogs >> Partition : logdate >> >> Running below commands parallel, >> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-01'); >> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-02'); >> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-03'); >> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-04'); >> ..... >> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-30'); >> >> Thanks >> Selva >> >> >> >> >> >> > -- -- selva --047d7bdc901651413604dbcd28c3 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks Yanbo. I my doubt is got=A0clarified=A0now.


On Fri, May 3, = 2013 at 2:38 PM, Yanbo Liang <yanbohappy@gmail.com> wrote= :
load data to different part= itions parallel is OK, because it equivalent to write to different file on = HDFS


=
2013/5/3 selva <selvait90@gmail.com>
Hi All,

=
I need to load a month worth of processed data into a hive table= . Table have 10 partitions. Each day have many files to load and each file = is taking two seconds(constantly) and i have ~3000 files). So it will take = days to complete for 30 days worth of data.

I planned to load every day data=A0parallel=A0into resp= ective partition so that i can complete it short time.

=
But i need clarrification before proceeding it.

Question:

1. Will it cause data loss/corruption by= loading parallel in different partition of same hive table ?
For example, Assume i am doing like below,

Table =A0 =A0 : processedlogs
Partition : logdate
=
Running below commands parallel,
LOAD DATA INPATH = '/logs/processed/2013-04-01' OVERWRITE INTO TABLE processedlogs PAR= TITION(logdate=3D'2013-04-01');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO T= ABLE processedlogs PARTITION(logdate=3D'2013-04-02');
LOA= D DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE pro= cessedlogs PARTITION(logdate=3D'2013-04-03');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO T= ABLE processedlogs PARTITION(logdate=3D'2013-04-04');
...= ..
LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRIT= E INTO TABLE processedlogs PARTITION(logdate=3D'2013-04-30');

Thanks
Selva








--
= -- selva







--047d7bdc901651413604dbcd28c3--