Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 84608 invoked from network); 17 Feb 2011 02:33:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Feb 2011 02:33:32 -0000 Received: (qmail 882 invoked by uid 500); 17 Feb 2011 02:33:31 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 654 invoked by uid 500); 17 Feb 2011 02:33:30 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 644 invoked by uid 99); 17 Feb 2011 02:33:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 02:33:29 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chase.bradford@gmail.com designates 209.85.212.176 as permitted sender) Received: from [209.85.212.176] (HELO mail-px0-f176.google.com) (209.85.212.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 02:33:21 +0000 Received: by pxi11 with SMTP id 11so321633pxi.35 for ; Wed, 16 Feb 2011 18:33:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:references:in-reply-to:mime-version :content-transfer-encoding:content-type:message-id:cc:x-mailer:from :subject:date:to; bh=EK7PUqJrwlH9RlPtOfrHmS54DtGAjuWRvbc/Rg1V0+E=; b=nrc6uUoYBz6F1mF/nhNEvLI9erwN2yOUE3/8qx1vgiPXBj1m7qRSEKsUAotQtT5jJ3 sIfHoJHMeYjXegsM5M2Ab9QYxyZLHCtLr5gOz5WEY0FlV21ElpsR4/Txig6G7qrW5aky MLyRIqd+e0sPCSLJzhNKqQEaSYe2tO9xnxal8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=references:in-reply-to:mime-version:content-transfer-encoding :content-type:message-id:cc:x-mailer:from:subject:date:to; b=qnkEo0N3/QK9sodk/R9x9Fn8KdCP/UNi8UV5TEZvRy2T7+xksjQbdGOgqTt3ZHXqyP gGV2L8d9jdd9snBMJXVEPc+F3aYQhWeNFbeyoIL3dYRNhXwaOEW4FtraVnl9KlaoM9+G c/R2ZQ8b4pAcrTqWvteUFBshpZrdm+TZF/mqw= Received: by 10.143.13.13 with SMTP id q13mr1050939wfi.207.1297909981049; Wed, 16 Feb 2011 18:33:01 -0800 (PST) Received: from [10.62.115.21] ([166.205.140.39]) by mx.google.com with ESMTPS id 25sm548840wfb.22.2011.02.16.18.32.58 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 16 Feb 2011 18:33:00 -0800 (PST) References: In-Reply-To: Mime-Version: 1.0 (iPhone Mail 8A400) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Message-Id: Cc: "mapreduce-user@hadoop.apache.org" X-Mailer: iPhone Mail (8A400) From: Chase Bradford Subject: Re: hadoop fs -put vs writing text files to hadoop as sequence files Date: Wed, 16 Feb 2011 18:33:13 -0800 To: "mapreduce-user@hadoop.apache.org" We use sequence files for storing text data, and you definitely notice the c= ost of compressing client side while streaming to hdfs. if I remember corre= ctly, it took about 10x. That drove us to using writer treads that fed off a= single input stream a few thousand lines at a time, and wrote to a hdfs dir= ectory with the desired name. On Feb 16, 2011, at 4:24 PM, Mapred Learn wrote: > Hi, > I have to upload some terabytes of data that is text files. > =20 > What would be good option to do so: > =20 > i) using hadoop fs -put to copy text files directly on hdfs. > =20 > ii) copying text files as sequence files on hdfs ? What would be extra tim= e in this case as opposed to (i). > =20 > Thanks, > Jimmy