Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 39266 invoked from network); 6 Jul 2006 09:20:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Jul 2006 09:20:22 -0000 Received: (qmail 90599 invoked by uid 500); 6 Jul 2006 09:20:22 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 90562 invoked by uid 500); 6 Jul 2006 09:20:21 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 90553 invoked by uid 99); 6 Jul 2006 09:20:21 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jul 2006 02:20:21 -0700 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_WHOIS X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jul 2006 02:20:20 -0700 Received: from [172.24.90.137] (arunc.bangalore.corp.yahoo.com [172.24.90.137]) by mrout2.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id k669Ix3H094498 for ; Thu, 6 Jul 2006 02:19:00 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:organization:user-agent: x-accept-language:mime-version:to:subject:content-type:content-transfer-encoding; b=mGqbS2qqdCFet3GMdpp92JafwoB8njmGzJEFHBlaDDZpbfLxwMta1nhlry2WU4xg Message-ID: <44ACD582.90607@yahoo-inc.com> Date: Thu, 06 Jul 2006 14:48:58 +0530 From: Arun C Murthy Organization: Yahoo! Inc. User-Agent: Mozilla Thunderbird 0.7.3 (X11/20040929) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Enhancement to TextInputFormat? Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi, Here's a scenario I have faced a couple of times recently: I have a list of URIs (either http:// or just dfs file-list) which represent input to a Map-Reduce task where each map gets 1 URI, gets data from the URI (read either through dfs apis or over http as the case maybe) and then manipulates that data. In-essence it's a simple TextInputFormat with each 'line' representing not the actual 'data' to manipulate in the map, but an 'indirection' to the data. Do you guys think it makes sense to provide this as a part of the MR framework itself? i.e. extend TextInputFormat into (say) URIInputFormat and the MR framework then 'fetches' the data (the 'fetcher'/'reader' is configurable with reasonable defaults provided in the framework e.g. for dfs://, http:// etc.) pointed to by the URI and then provides a 'stream' (as 'key') to the map function? Admittedly it isn't very hard to do as-is today, however it would definitely ease the user's job. All he needs is to provide a simple text file with a list of URIs and then gets a readable stream in his map. Thus reducing the amount of 'code' he has to write and enhancing his experience. Thoughts? If there is sufficient interest/utility I will go ahead and spec this in more detail and create a jira issue. thanks, Arun