Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 290C1FF86 for ; Wed, 17 Apr 2013 04:27:38 +0000 (UTC) Received: (qmail 91241 invoked by uid 500); 17 Apr 2013 04:27:33 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 91006 invoked by uid 500); 17 Apr 2013 04:27:33 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 90975 invoked by uid 99); 17 Apr 2013 04:27:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Apr 2013 04:27:32 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=FORGED_YAHOO_RCVD,FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [212.82.109.247] (HELO nm21-vm7.bullet.mail.ird.yahoo.com) (212.82.109.247) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 17 Apr 2013 04:27:25 +0000 Received: from [77.238.189.54] by nm21.bullet.mail.ird.yahoo.com with NNFMP; 17 Apr 2013 04:27:03 -0000 Received: from [46.228.39.125] by tm7.bullet.mail.ird.yahoo.com with NNFMP; 17 Apr 2013 04:27:03 -0000 Received: from [127.0.0.1] by smtp162.mail.ir2.yahoo.com with NNFMP; 17 Apr 2013 04:27:03 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1366172823; bh=V8px+kKe6pOjrJ4HII9+4iGRMGHmWitIWA41wJSPkP0=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:X-Rocket-Received:From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:X-Mailer:Thread-Index:Content-Language; b=0IfjlCWQ0ivncSZF2p7pLbBxyOmSUQeoCzJqIQNDpGatB6LYB2uam7IFIHpG8aCtVHt3/aMvraeFhIQkkYybr7YGi67xz8x5AX8PpXGwepoRPMD0blsmTeQPeAeNmTtLzLBipHTnSkYhqzlh9Qo8MqhiXycA0HfImnm23vk4kZU= X-Yahoo-Newman-Id: 804388.83605.bm@smtp162.mail.ir2.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: SJ1UqfgVM1kJJ0EW06xRPU0TpZjhxTz2070fPKkP_d_OZcD j0KOvGP.Oqfj99hQ7mUKwibCDPWcXOicihly0XdDL5HGaXg7RFH71gl5z10c KoB0Ez8SWmosLw1SMRLtaqDFoozBvknnDVIPS0F5IIS.CTdWL3ES_Y80Rx6m mysXxnImLFrf4eTLuMEbKgGj2aKra74rFI49OZh1fvTdcZxxsl4CLC36jb30 JAjyWROMvS.9Phf.ee8.zBT3BGm5H5f2P.G8FWLI4NWtgg1Jl9EENqQ11jZN d6sb_sMFxJMsp6EU3F4Kb3aSayA.idDoLuFUI3bFaA9x3no_ZcvxLX0WK.Tp KFnfhOopVVV_V4cqi0zvgYb.VVcE8nhQ0KqA6lsv_sQ9Pq3p1UUaCWBcTKSe IW5KVWHhg9ggo_JF.VaRHCPVmh9nEAMKnePQx4fid7uUbsH0WDCNz X-Yahoo-SMTP: k2gD1GeswBAV_JFpZm8dmpTCwr4ufTKOyA-- X-Rocket-Received: from sattelite (davidparks21@113.161.75.108 with login) by smtp162.mail.ir2.yahoo.com with SMTP; 17 Apr 2013 04:27:03 +0000 UTC From: "David Parks" To: Subject: Mapreduce jobs to download job input from across the internet Date: Wed, 17 Apr 2013 11:26:56 +0700 Message-ID: <0b8a01ce3b23$cdd5ad30$69810790$@yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0B8B_01CE3B5E.7A36F630" X-Mailer: Microsoft Outlook 14.0 Thread-Index: Ac47IvqbLkoZS0Z4QLaunhMIfuM2Kw== Content-Language: en-us X-Virus-Checked: Checked by ClamAV on apache.org This is a multipart message in MIME format. ------=_NextPart_000_0B8B_01CE3B5E.7A36F630 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit For a set of jobs to run I need to download about 100GB of data from the internet (~1000 files of varying sizes from ~10 different domains). Currently I do this in a simple linux script as it's easy to script FTP, curl, and the like. But it's a mess to maintain a separate server for that process. I'd rather it run in mapreduce. Just give it a bill of materials and let it go about downloading it, retrying as necessary to deal with iffy network conditions. I wrote one such job to craw images we need to acquire, and it was the royalist of royal pains. I wonder if there are any good approaches to this kind of data acquisition task in Hadoop. It would certainly be nicer just to schedule a data-acquisition job ahead of the processing jobs in Oozie rather than try to maintain synchronization between the download processes and the jobs. Ideas? ------=_NextPart_000_0B8B_01CE3B5E.7A36F630 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

For a set = of jobs to run I need to download about 100GB of data from the internet = (~1000 files of varying sizes from ~10 different = domains).

 

Currently I do this in a simple linux script as = it’s easy to script FTP, curl, and the like. But it’s a mess = to maintain a separate server for that process. I’d rather it run = in mapreduce. Just give it a bill of materials and let it go about = downloading it, retrying as necessary to deal with iffy network = conditions.

 

I wrote one such job to craw images we need to = acquire, and it was the royalist of royal pains. I wonder if there are = any good approaches to this kind of data acquisition task in Hadoop. It = would certainly be nicer just to schedule a data-acquisition job ahead = of the processing jobs in Oozie rather than try to maintain = synchronization between the download processes and the = jobs.

 

Ideas?

 

------=_NextPart_000_0B8B_01CE3B5E.7A36F630--