Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 70606 invoked from network); 16 Feb 2009 04:07:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Feb 2009 04:07:41 -0000 Received: (qmail 49856 invoked by uid 500); 16 Feb 2009 04:07:29 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 49802 invoked by uid 500); 16 Feb 2009 04:07:29 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 49780 invoked by uid 99); 16 Feb 2009 04:07:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Feb 2009 20:07:29 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of amansk@gmail.com designates 74.125.92.27 as permitted sender) Received: from [74.125.92.27] (HELO qw-out-2122.google.com) (74.125.92.27) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Feb 2009 04:07:22 +0000 Received: by qw-out-2122.google.com with SMTP id 9so486358qwb.35 for ; Sun, 15 Feb 2009 20:07:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=zGINkNp7LInivLrstN+VL1fiobPBc2KOdyBDSciz3Lw=; b=ooKSmAygCAQeB9mXXeUs1XKE0Mm6EKW1K2xSMPlI/L3hUzyGgxnfROa3G2SkKZ2lU1 B3mY7Xv3ATVukDXe1gN5Dy3ae7M8AV+HGDuVJTTglX6lqtTys36t4EFUm/AvI2I/S13x YYpvYRl4bLBIuSjLBejFsP2Sxa3QC2WU0qdyY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=cD2wuQObcFUJfaSEMLLUhAUCUbVti+hhLvdYF7pOgYeTFKCi3PmUZhIqsrtZyqi/qH I0ahbEwa2B2AQ1oCyoNv2IX8KNjuHDnCh8QwAIYBgTwdUti+Gatx/GcQpZIDakT1RudN HppZm5NugjoFaHpICFffwVZ+AjTOAFU55Z4vQ= MIME-Version: 1.0 Received: by 10.224.74.74 with SMTP id t10mr4777124qaj.334.1234757221110; Sun, 15 Feb 2009 20:07:01 -0800 (PST) In-Reply-To: References: <35a22e220902151157x4e1470abme4d0d80abe4ec5ad@mail.gmail.com> <35a22e220902151520y314326ld42c02e7c9289931@mail.gmail.com> <35a22e220902151522m6463a4e5j81b39da047d3c575@mail.gmail.com> <35a22e220902151800je677b4ch7b9aeca21c2a0e5b@mail.gmail.com> <35a22e220902151814k75ffd030tead4056734c13701@mail.gmail.com> <35a22e220902151915j536e729ej94e58a49fddeea16@mail.gmail.com> Date: Sun, 15 Feb 2009 20:07:01 -0800 Message-ID: <35a22e220902152007q70f9aaebi27681b96083c4d4c@mail.gmail.com> Subject: Re: HDFS architecture based on GFS? From: Amandeep Khurana To: core-user@hadoop.apache.org Cc: core-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015175cdf888c557604630151c0 X-Virus-Checked: Checked by ClamAV on apache.org --0015175cdf888c557604630151c0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Alright.. Got it. Now, do the task trackers talk to the namenode and the data node directly or do they go through the job tracker for it? So, if my code is such that I need to access more files from the hdfs, would the job tracker get involved or not? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia wrote: > Normally, HDFS files are accessed through the namenode. If there was a > malicious process though, then I imagine it could talk to a datanode > directly and request a specific block. > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana > wrote: > > > Ok. Got it. > > > > Now, when my job needs to access another file, does it go to the Namenode > > to > > get the block ids? How does the java process know where the files are and > > how to access them? > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > wrote: > > > > > I mentioned this case because even jobs written in Java can use the > HDFS > > > API > > > to talk to the NameNode and access the filesystem. People often do this > > > because their job needs to read a config file, some small data table, > etc > > > and use this information in its map or reduce functions. In this case, > > you > > > open the second file separately in your mapper's init function and read > > > whatever you need from it. In general I wanted to point out that you > > can't > > > know which files a job will access unless you look at its source code > or > > > monitor the calls it makes; the input file(s) you provide in the job > > > description are a hint to the MapReduce framework to place your job on > > > certain nodes, but it's reasonable for the job to access other files as > > > well. > > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana > > > wrote: > > > > > > > Another question that I have here - When the jobs run arbitrary code > > and > > > > access data from the HDFS, do they go to the namenode to get the > block > > > > information? > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana > > > > wrote: > > > > > > > > > Assuming that the job is purely in Java and not involving streaming > > or > > > > > pipes, wouldnt the resources (files) required by the job as inputs > be > > > > known > > > > > beforehand? So, if the map task is accessing a second file, how > does > > it > > > > make > > > > > it different except that there are multiple files. The JobTracker > > would > > > > know > > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > > > I am slightly confused why you have mentioned this case > separately... > > > Can > > > > > you elaborate on it a little bit? > > > > > > > > > > Amandeep > > > > > > > > > > > > > > > Amandeep Khurana > > > > > Computer Science Graduate Student > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia > > > > > wrote: > > > > > > > > > >> Typically the data flow is like this:1) Client submits a job > > > description > > > > >> to > > > > >> the JobTracker. > > > > >> 2) JobTracker figures out block locations for the input file(s) by > > > > talking > > > > >> to HDFS NameNode. > > > > >> 3) JobTracker creates a job description file in HDFS which will be > > > read > > > > by > > > > >> the nodes to copy over the job's code etc. > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with > the > > > > >> appropriate data blocks. > > > > >> 5) After running, maps create intermediate output files on those > > > slaves. > > > > >> These are not in HDFS, they're in some temporary storage used by > > > > >> MapReduce. > > > > >> 6) JobTracker starts reduces on a series of slaves, which copy > over > > > the > > > > >> appropriate map outputs, apply the reduce function, and write the > > > > outputs > > > > >> to > > > > >> HDFS (one output file per reducer). > > > > >> 7) Some logs for the job may also be put into HDFS by the > > JobTracker. > > > > >> > > > > >> However, there is a big caveat, which is that the map and reduce > > tasks > > > > run > > > > >> arbitrary code. It is not unusual to have a map that opens a > second > > > HDFS > > > > >> file to read some information (e.g. for doing a join of a small > > table > > > > >> against a big file). If you use Hadoop Streaming or Pipes to write > a > > > job > > > > >> in > > > > >> Python, Ruby, C, etc, then you are launching arbitrary processes > > which > > > > may > > > > >> also access external resources in this manner. Some people also > > > > read/write > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security > > > solution > > > > >> would ideally deal with these cases too. > > > > >> > > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana < > amansk@gmail.com > > > > > > > >> wrote: > > > > >> > > > > >> > A quick question here. How does a typical hadoop job work at the > > > > system > > > > >> > level? What are the various interactions and how does the data > > flow? > > > > >> > > > > > >> > Amandeep > > > > >> > > > > > >> > > > > > >> > Amandeep Khurana > > > > >> > Computer Science Graduate Student > > > > >> > University of California, Santa Cruz > > > > >> > > > > > >> > > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana < > > amansk@gmail.com > > > > > > > > >> > wrote: > > > > >> > > > > > >> > > Thanks Matei. If the basic architecture is similar to the > Google > > > > >> stuff, I > > > > >> > > can safely just work on the project using the information from > > the > > > > >> > papers. > > > > >> > > > > > > >> > > I am aware of the 4487 jira and the current status of the > > > > permissions > > > > >> > > mechanism. I had a look at them earlier. > > > > >> > > > > > > >> > > Cheers > > > > >> > > Amandeep > > > > >> > > > > > > >> > > > > > > >> > > Amandeep Khurana > > > > >> > > Computer Science Graduate Student > > > > >> > > University of California, Santa Cruz > > > > >> > > > > > > >> > > > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia < > > > matei@cloudera.com> > > > > >> > wrote: > > > > >> > > > > > > >> > >> Forgot to add, this JIRA details the latest security features > > > that > > > > >> are > > > > >> > >> being > > > > >> > >> worked on in Hadoop trunk: > > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487. > > > > >> > >> This document describes the current status and limitations of > > the > > > > >> > >> permissions mechanism: > > > > >> > >> > > > > >> > > > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html > . > > > > >> > >> > > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia < > > > matei@cloudera.com > > > > > > > > > >> > >> wrote: > > > > >> > >> > > > > >> > >> > I think it's safe to assume that Hadoop works like > > > MapReduce/GFS > > > > at > > > > >> > the > > > > >> > >> > level described in those papers. In particular, in HDFS, > > there > > > is > > > > a > > > > >> > >> master > > > > >> > >> > node containing metadata and a number of slave nodes > > > (datanodes) > > > > >> > >> containing > > > > >> > >> > blocks, as in GFS. Clients start by talking to the master > to > > > list > > > > >> > >> > directories, etc. When they want to read a region of some > > file, > > > > >> they > > > > >> > >> tell > > > > >> > >> > the master the filename and offset, and they receive a list > > of > > > > >> block > > > > >> > >> > locations (datanodes). They then contact the individual > > > datanodes > > > > >> to > > > > >> > >> read > > > > >> > >> > the blocks. When clients write a file, they first obtain a > > new > > > > >> block > > > > >> > ID > > > > >> > >> and > > > > >> > >> > list of nodes to write it to from the master, then contact > > the > > > > >> > datanodes > > > > >> > >> to > > > > >> > >> > write it (actually, the datanodes pipeline the write as in > > GFS) > > > > and > > > > >> > >> report > > > > >> > >> > when the write is complete. HDFS actually has some security > > > > >> mechanisms > > > > >> > >> built > > > > >> > >> > in, authenticating users based on their Unix ID and > providing > > > > >> > Unix-like > > > > >> > >> file > > > > >> > >> > permissions. I don't know much about how these are > > implemented, > > > > but > > > > >> > they > > > > >> > >> > would be a good place to start looking. > > > > >> > >> > > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana < > > > > >> amansk@gmail.com > > > > >> > >> >wrote: > > > > >> > >> > > > > > >> > >> >> Thanks Matie > > > > >> > >> >> > > > > >> > >> >> I had gone through the architecture document online. I am > > > > >> currently > > > > >> > >> >> working > > > > >> > >> >> on a project towards Security in Hadoop. I do know how the > > > data > > > > >> moves > > > > >> > >> >> around > > > > >> > >> >> in the GFS but wasnt sure how much of that does HDFS > follow > > > and > > > > >> how > > > > >> > >> >> different it is from GFS. Can you throw some light on > that? > > > > >> > >> >> > > > > >> > >> >> Security would also involve the Map Reduce jobs following > > the > > > > same > > > > >> > >> >> protocols. Thats why the question about how does the > Hadoop > > > > >> framework > > > > >> > >> >> integrate with the HDFS, and how different is it from Map > > > Reduce > > > > >> and > > > > >> > >> GFS. > > > > >> > >> >> The GFS and Map Reduce papers give a good information on > how > > > > those > > > > >> > >> systems > > > > >> > >> >> are designed but there is nothing that concrete for Hadoop > > > that > > > > I > > > > >> > have > > > > >> > >> >> been > > > > >> > >> >> able to find. > > > > >> > >> >> > > > > >> > >> >> Amandeep > > > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> Amandeep Khurana > > > > >> > >> >> Computer Science Graduate Student > > > > >> > >> >> University of California, Santa Cruz > > > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia < > > > > >> matei@cloudera.com> > > > > >> > >> >> wrote: > > > > >> > >> >> > > > > >> > >> >> > Hi Amandeep, > > > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS and aims > to > > > > >> provide > > > > >> > >> those > > > > >> > >> >> > capabilities as an open-source project. HDFS is similar > to > > > GFS > > > > >> > (large > > > > >> > >> >> > blocks, replication, etc); some notable things missing > are > > > > >> > read-write > > > > >> > >> >> > support in the middle of a file (unlikely to be provided > > > > because > > > > >> > few > > > > >> > >> >> Hadoop > > > > >> > >> >> > applications require it) and multiple appenders (the > > record > > > > >> append > > > > >> > >> >> > operation). You can read about HDFS architecture at > > > > >> > >> >> > > > http://hadoop.apache.org/core/docs/current/hdfs_design.html > > > . > > > > >> The > > > > >> > >> >> MapReduce > > > > >> > >> >> > part of Hadoop interacts with HDFS in the same way that > > > > Google's > > > > >> > >> >> MapReduce > > > > >> > >> >> > interacts with GFS (shipping computation to the data), > > > > although > > > > >> > >> Hadoop > > > > >> > >> >> > MapReduce also supports running over other distributed > > > > >> filesystems. > > > > >> > >> >> > > > > > >> > >> >> > Matei > > > > >> > >> >> > > > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana < > > > > >> > amansk@gmail.com > > > > >> > >> > > > > > >> > >> >> > wrote: > > > > >> > >> >> > > > > > >> > >> >> > > Hi > > > > >> > >> >> > > > > > > >> > >> >> > > Is the HDFS architecture completely based on the > Google > > > > >> > Filesystem? > > > > >> > >> If > > > > >> > >> >> it > > > > >> > >> >> > > isnt, what are the differences between the two? > > > > >> > >> >> > > > > > > >> > >> >> > > Secondly, is the coupling between Hadoop and HDFS same > > as > > > > how > > > > >> it > > > > >> > is > > > > >> > >> >> > between > > > > >> > >> >> > > the Google's version of Map Reduce and GFS? > > > > >> > >> >> > > > > > > >> > >> >> > > Amandeep > > > > >> > >> >> > > > > > > >> > >> >> > > > > > > >> > >> >> > > Amandeep Khurana > > > > >> > >> >> > > Computer Science Graduate Student > > > > >> > >> >> > > University of California, Santa Cruz > > > > >> > >> >> > > > > > > >> > >> >> > > > > > >> > >> >> > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > --0015175cdf888c557604630151c0--