Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 14873 invoked from network); 21 Apr 2008 17:33:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Apr 2008 17:33:12 -0000 Received: (qmail 98810 invoked by uid 500); 21 Apr 2008 17:33:10 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 97879 invoked by uid 500); 21 Apr 2008 17:33:08 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 97870 invoked by uid 99); 21 Apr 2008 17:33:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2008 10:33:08 -0700 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.50.2.13] (HELO ex9.myhostedexchange.com) (69.50.2.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2008 17:32:24 +0000 Received: from 206.169.1.36 ([206.169.1.36]) by ex9.hostedexchange.local ([172.16.69.18]) with Microsoft Exchange Server HTTP-DAV ; Mon, 21 Apr 2008 17:32:37 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Mon, 21 Apr 2008 10:31:31 -0700 Subject: Re: datanode files list From: Ted Dunning To: Message-ID: Thread-Topic: datanode files list Thread-Index: Acij1Yl7yAY8NA/IEd2KhQAWy8rVfQ== In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org This is kind of odd that you are doing this. It really sounds like a replication of what hadoop is doing. Why not just run a map process and have hadoop figure out which blocks are where? Can you say more about *why* you are doing this, not just what you are trying to do? On 4/21/08 10:28 AM, "Shimi K" wrote: > I am using Hadoop HDFS as a distributed file system. On each DFS node I have > another process which needs to read the local HDFS files. > Right now I'm calling the NameNode in order to get the list of all the files > in the cluster. For each file I check if it is a local file (one of the > locations is the host of the node), if it is I read it. > Disadvantages: > * This solution works only if the entire file is not split. > * It involves the NameNode. > * Each node needs to iterate on all the files in the cluster. > > There must be a better way to do it. The perfect way will be to call the > DataNode and to get a list of the local files and their blocks. > > On Mon, Apr 21, 2008 at 7:18 PM, Ted Dunning wrote: > >> >> Datanodes don't necessarily contain complete files. It is possible to >> enumerate all files and to find out which datanodes host different blocks >> from these files. >> >> What did you need to do? >> >> >> On 4/21/08 2:11 AM, "Shimi K" wrote: >> >>> Is there a way to get the list of files on each datanode? >>> I need to be able to get all the names of the files on a specific >> datanode? >>> is there a way to do it? >> >>