Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 90017 invoked from network); 12 Mar 2010 19:16:46 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Mar 2010 19:16:46 -0000 Received: (qmail 10343 invoked by uid 500); 12 Mar 2010 19:16:06 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 10285 invoked by uid 500); 12 Mar 2010 19:16:06 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 10277 invoked by uid 99); 12 Mar 2010 19:16:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 19:16:06 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.147.107.20] (HELO mrout1-b.corp.re1.yahoo.com) (69.147.107.20) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 19:16:04 +0000 Received: from coatsatfind-lm.corp.yahoo.com (coatsatfind-lm.corp.yahoo.com [10.72.187.241]) by mrout1-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id o2CJFNUi088753 for ; Fri, 12 Mar 2010 11:15:24 -0800 (PST) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:from:to:in-reply-to:content-type: content-transfer-encoding:mime-version:subject:date:references:x-mailer; b=FXVePvTVnxcjCfpdru47Y7j9gVmWPgGE3OdR45gaC292mcliOOWYuyWDFC923wsa Message-Id: From: Hong Tang To: common-user@hadoop.apache.org In-Reply-To: <67C22614-089B-4547-9777-B8841E6B10E5@dataxu.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Subject: Re: Efficiently Stream into Sequence Files? Date: Fri, 12 Mar 2010 11:15:23 -0800 References: <67C22614-089B-4547-9777-B8841E6B10E5@dataxu.com> X-Mailer: Apple Mail (2.936) Have you looked at TFile? On Mar 12, 2010, at 5:22 AM, Scott Whitecross wrote: > Hi - > > I'd like to create a job that pulls small files from a remote server > (using FTP, SCP, etc.) and stores them directly to sequence files on > HDFS. Looking at the sequence file APi, I don't see an obvious way > to do this. It looks like what I have to do is pull the remote file > to disk, then read the file into memory to place in the sequence > file. Is there a better way? > > Looking at the API, am I forced to use the append method? > > FileSystem hdfs = > FileSystem.get(context.getConfiguration()); > FSDataOutputStream outputStream = hdfs.create(new > Path(outputPath)); > writer = > SequenceFile.createWriter(context.getConfiguration(), outputStream, > Text.class, BytesWritable.class, null, null); > > // read in file to remotefilebytes > > writer.append(filekey, remotefilebytes); > > > The alternative would be to have one job pull the remote files, and > a secondary job write them into sequence files. > > I'm using the latest Cloudera release, which I believe is Hadoop 20.1 > > Thanks. > > > >