Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28775DD2A for ; Tue, 5 Mar 2013 23:35:32 +0000 (UTC) Received: (qmail 32254 invoked by uid 500); 5 Mar 2013 23:35:27 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 32000 invoked by uid 500); 5 Mar 2013 23:35:27 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 31992 invoked by uid 99); 5 Mar 2013 23:35:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2013 23:35:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of publicnetworkservices@gmail.com designates 209.85.223.181 as permitted sender) Received: from [209.85.223.181] (HELO mail-ie0-f181.google.com) (209.85.223.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2013 23:35:21 +0000 Received: by mail-ie0-f181.google.com with SMTP id 17so8666376iea.12 for ; Tue, 05 Mar 2013 15:35:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=qlNiArtFJJ4/UQ+jV6LvBIIZiQJqySK7YYXBgBlKgIA=; b=LL57CI4Qc09s6lson1K0bBZcN6rkrm5oT7SM45P0Wnsl2BcDz8e8aPS7vG0t4amZPe zkWAwAR1vKhefvZ2CCdSufbgqzioQZqkBKUuk46QBMAO4fMbDDrp/1LoKqIL4rPAV+jv mk0eeLA9+QCHbI1XhcKCZdfZ4mOqjRkuvTX0jdRZ6Bzyk/4rwD6B2abRxgQWrLKwL0EO WJtHTBpQBEuxdI1vtdIaNmgZylceEjYMsTzOpS+TTnJmiCri2X3sqM7riWQV4wOHPxii wCG6ukCy0KNFezgnDKjDpSXoiI7yt8YZK1UEF2Cel0HoZ/d5R9xgVhkc4y2hElRE41Ax g/eg== MIME-Version: 1.0 X-Received: by 10.50.13.175 with SMTP id i15mr8325664igc.75.1362526501324; Tue, 05 Mar 2013 15:35:01 -0800 (PST) Received: by 10.50.34.169 with HTTP; Tue, 5 Mar 2013 15:35:01 -0800 (PST) Date: Tue, 5 Mar 2013 15:35:01 -0800 Message-ID: Subject: Execution handover in map/reduce pipeline From: Public Network Services To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d0447f3821bd34004d735ec9d X-Virus-Checked: Checked by ClamAV on apache.org --f46d0447f3821bd34004d735ec9d Content-Type: text/plain; charset=ISO-8859-1 Hi... I have an application that processes large amounts of proprietary binary-encoded text data in the following sequence 1. Gets a URL to a file or a directory as input 2. Reads the list of the binary files found under the input URL 3. Extracts the text data from each of those files 4. Saves the text data into new files 5. Informs the application about newly extracted files 6. Processes each of the extracted text files 7. Submits the processing results to a proprietary data repository This whole processing is highly CPU-intensive and can be partially parallelized, so I am thinking of trying Hadoop for achieving higher performance. So, assuming that all the above take place in HDFS (including the input URL being an HDFS one), a MapReduce implementation could use - A lightweight non-Hadoop thread to kick-start the execution flow, i.e. implement step 1 - A Mapper that would implement steps 2-4 - A Reducer that would implement step 5 (receive the notifications) - A Mapper that would implement step 6 - A Reducer that would implement step 7 The first mapper (for steps 2-4) will probably need to do its processing in a single, non-parallelized step. My question is, how is the first reducer going to hand over execution to the second mapper, once done? Or, is there a better way of implementing the above scenario? Thanks! --f46d0447f3821bd34004d735ec9d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi...

I have an application that processes large amounts= of proprietary binary-encoded text data in the following sequence
  1. Gets a URL to a file or a directory as input
  2. Reads the li= st of the binary files found under the input URL
  3. Extracts the text data from each of those files
  4. Saves the text = data into new files
  5. Informs the application about newly extracted f= iles
  6. Processes each of the extracted text files
  7. Submits the= processing results to a proprietary data repository
This whole processing is highly CPU-intensive and can be partiall= y parallelized, so I am thinking of trying Hadoop for achieving higher perf= ormance.

So, assuming that all the above tak= e place in HDFS (including the input URL being an HDFS one), a MapReduce im= plementation could use
  • A lightweight non-Hadoop thread to kick-start the execution fl= ow, i.e. implement step 1
  • A Mapper that would implement steps 2-4
  • A Reducer that would implement step 5 (receive the notifications)
  • A Mapper that would implement step 6
  • A Reducer that would imple= ment step 7
The first mapper (for steps 2-4) will probably ne= ed to do its processing in a single, non-parallelized step.

My question is, how is the first reducer going to hand over exec= ution to the second mapper, once done?

Or, is ther= e a better way of implementing the above scenario?

Thanks!

--f46d0447f3821bd34004d735ec9d--