Return-Path: Delivered-To: apmail-hadoop-hive-user-archive@minotaur.apache.org Received: (qmail 25808 invoked from network); 8 Jul 2010 01:07:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Jul 2010 01:07:26 -0000 Received: (qmail 41126 invoked by uid 500); 8 Jul 2010 01:07:26 -0000 Delivered-To: apmail-hadoop-hive-user-archive@hadoop.apache.org Received: (qmail 41002 invoked by uid 500); 8 Jul 2010 01:07:25 -0000 Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-user@hadoop.apache.org Delivered-To: mailing list hive-user@hadoop.apache.org Received: (qmail 40994 invoked by uid 99); 8 Jul 2010 01:07:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jul 2010 01:07:25 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.214.176] (HELO mail-iw0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jul 2010 01:07:18 +0000 Received: by iwn37 with SMTP id 37so500149iwn.35 for ; Wed, 07 Jul 2010 18:06:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.161.68 with SMTP id q4mr7503230ibx.79.1278551217030; Wed, 07 Jul 2010 18:06:57 -0700 (PDT) Received: by 10.231.17.67 with HTTP; Wed, 7 Jul 2010 18:06:56 -0700 (PDT) In-Reply-To: References: Date: Wed, 7 Jul 2010 18:06:56 -0700 Message-ID: Subject: Re: 1 big file or multiple smaller files for loading data from a database? From: Sarah Sproehnle To: hive-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi Todd, Are you planning to use Sqoop to do this import? If not, you should. :) It will do a parallel import, using MapReduce, to load the table into Hadoop. With the --hive-import option, it will also create the Hive table definition. Cheers, Sarah On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee wrote: > Hi, > I am new to Hive and Hadoop in general. I have a table in Oracle that has > millions of rows and I'd like to export it into HDFS so that I can run some > Hive queries. My first question is, is it recommended to export the entire > table as a single file (possibly 5GB), or more files with smaller sizes (10 > files each 500mb)? also, does it matter if I put the files under different > sub-directories before I do the data load in Hive? or everything has to be > under the same folder? > Thanks, > T > p.s. I am sorry if this post is submitted twice. -- Sarah Sproehnle Educational Services Cloudera, Inc http://www.cloudera.com/training