Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 55198 invoked from network); 16 Feb 2009 03:21:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Feb 2009 03:21:19 -0000 Received: (qmail 32111 invoked by uid 500); 16 Feb 2009 03:21:08 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 32024 invoked by uid 500); 16 Feb 2009 03:21:07 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 32002 invoked by uid 99); 16 Feb 2009 03:21:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Feb 2009 19:21:07 -0800 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.146.179] (HELO wa-out-1112.google.com) (209.85.146.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Feb 2009 03:21:01 +0000 Received: by wa-out-1112.google.com with SMTP id v27so1110278wah.29 for ; Sun, 15 Feb 2009 19:20:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.114.46.14 with SMTP id t14mr1921131wat.150.1234754439622; Sun, 15 Feb 2009 19:20:39 -0800 (PST) In-Reply-To: <35a22e220902151915j536e729ej94e58a49fddeea16@mail.gmail.com> References: <35a22e220902151157x4e1470abme4d0d80abe4ec5ad@mail.gmail.com> <35a22e220902151520y314326ld42c02e7c9289931@mail.gmail.com> <35a22e220902151522m6463a4e5j81b39da047d3c575@mail.gmail.com> <35a22e220902151800je677b4ch7b9aeca21c2a0e5b@mail.gmail.com> <35a22e220902151814k75ffd030tead4056734c13701@mail.gmail.com> <35a22e220902151915j536e729ej94e58a49fddeea16@mail.gmail.com> Date: Sun, 15 Feb 2009 19:20:39 -0800 Message-ID: Subject: Re: HDFS architecture based on GFS? From: Matei Zaharia To: core-user@hadoop.apache.org Cc: core-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636458462c230f3046300ab63 X-Virus-Checked: Checked by ClamAV on apache.org --001636458462c230f3046300ab63 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Normally, HDFS files are accessed through the namenode. If there was a malicious process though, then I imagine it could talk to a datanode directly and request a specific block. On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana wrote: > Ok. Got it. > > Now, when my job needs to access another file, does it go to the Namenode > to > get the block ids? How does the java process know where the files are and > how to access them? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia wrote: > > > I mentioned this case because even jobs written in Java can use the HDFS > > API > > to talk to the NameNode and access the filesystem. People often do this > > because their job needs to read a config file, some small data table, etc > > and use this information in its map or reduce functions. In this case, > you > > open the second file separately in your mapper's init function and read > > whatever you need from it. In general I wanted to point out that you > can't > > know which files a job will access unless you look at its source code or > > monitor the calls it makes; the input file(s) you provide in the job > > description are a hint to the MapReduce framework to place your job on > > certain nodes, but it's reasonable for the job to access other files as > > well. > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana > > wrote: > > > > > Another question that I have here - When the jobs run arbitrary code > and > > > access data from the HDFS, do they go to the namenode to get the block > > > information? > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana > > > wrote: > > > > > > > Assuming that the job is purely in Java and not involving streaming > or > > > > pipes, wouldnt the resources (files) required by the job as inputs be > > > known > > > > beforehand? So, if the map task is accessing a second file, how does > it > > > make > > > > it different except that there are multiple files. The JobTracker > would > > > know > > > > beforehand that multiple files would be accessed. Right? > > > > > > > > I am slightly confused why you have mentioned this case separately... > > Can > > > > you elaborate on it a little bit? > > > > > > > > Amandeep > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia > > > wrote: > > > > > > > >> Typically the data flow is like this:1) Client submits a job > > description > > > >> to > > > >> the JobTracker. > > > >> 2) JobTracker figures out block locations for the input file(s) by > > > talking > > > >> to HDFS NameNode. > > > >> 3) JobTracker creates a job description file in HDFS which will be > > read > > > by > > > >> the nodes to copy over the job's code etc. > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the > > > >> appropriate data blocks. > > > >> 5) After running, maps create intermediate output files on those > > slaves. > > > >> These are not in HDFS, they're in some temporary storage used by > > > >> MapReduce. > > > >> 6) JobTracker starts reduces on a series of slaves, which copy over > > the > > > >> appropriate map outputs, apply the reduce function, and write the > > > outputs > > > >> to > > > >> HDFS (one output file per reducer). > > > >> 7) Some logs for the job may also be put into HDFS by the > JobTracker. > > > >> > > > >> However, there is a big caveat, which is that the map and reduce > tasks > > > run > > > >> arbitrary code. It is not unusual to have a map that opens a second > > HDFS > > > >> file to read some information (e.g. for doing a join of a small > table > > > >> against a big file). If you use Hadoop Streaming or Pipes to write a > > job > > > >> in > > > >> Python, Ruby, C, etc, then you are launching arbitrary processes > which > > > may > > > >> also access external resources in this manner. Some people also > > > read/write > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security > > solution > > > >> would ideally deal with these cases too. > > > >> > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana > > > > >> wrote: > > > >> > > > >> > A quick question here. How does a typical hadoop job work at the > > > system > > > >> > level? What are the various interactions and how does the data > flow? > > > >> > > > > >> > Amandeep > > > >> > > > > >> > > > > >> > Amandeep Khurana > > > >> > Computer Science Graduate Student > > > >> > University of California, Santa Cruz > > > >> > > > > >> > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana < > amansk@gmail.com > > > > > > >> > wrote: > > > >> > > > > >> > > Thanks Matei. If the basic architecture is similar to the Google > > > >> stuff, I > > > >> > > can safely just work on the project using the information from > the > > > >> > papers. > > > >> > > > > > >> > > I am aware of the 4487 jira and the current status of the > > > permissions > > > >> > > mechanism. I had a look at them earlier. > > > >> > > > > > >> > > Cheers > > > >> > > Amandeep > > > >> > > > > > >> > > > > > >> > > Amandeep Khurana > > > >> > > Computer Science Graduate Student > > > >> > > University of California, Santa Cruz > > > >> > > > > > >> > > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia < > > matei@cloudera.com> > > > >> > wrote: > > > >> > > > > > >> > >> Forgot to add, this JIRA details the latest security features > > that > > > >> are > > > >> > >> being > > > >> > >> worked on in Hadoop trunk: > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487. > > > >> > >> This document describes the current status and limitations of > the > > > >> > >> permissions mechanism: > > > >> > >> > > > >> > > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html. > > > >> > >> > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia < > > matei@cloudera.com > > > > > > > >> > >> wrote: > > > >> > >> > > > >> > >> > I think it's safe to assume that Hadoop works like > > MapReduce/GFS > > > at > > > >> > the > > > >> > >> > level described in those papers. In particular, in HDFS, > there > > is > > > a > > > >> > >> master > > > >> > >> > node containing metadata and a number of slave nodes > > (datanodes) > > > >> > >> containing > > > >> > >> > blocks, as in GFS. Clients start by talking to the master to > > list > > > >> > >> > directories, etc. When they want to read a region of some > file, > > > >> they > > > >> > >> tell > > > >> > >> > the master the filename and offset, and they receive a list > of > > > >> block > > > >> > >> > locations (datanodes). They then contact the individual > > datanodes > > > >> to > > > >> > >> read > > > >> > >> > the blocks. When clients write a file, they first obtain a > new > > > >> block > > > >> > ID > > > >> > >> and > > > >> > >> > list of nodes to write it to from the master, then contact > the > > > >> > datanodes > > > >> > >> to > > > >> > >> > write it (actually, the datanodes pipeline the write as in > GFS) > > > and > > > >> > >> report > > > >> > >> > when the write is complete. HDFS actually has some security > > > >> mechanisms > > > >> > >> built > > > >> > >> > in, authenticating users based on their Unix ID and providing > > > >> > Unix-like > > > >> > >> file > > > >> > >> > permissions. I don't know much about how these are > implemented, > > > but > > > >> > they > > > >> > >> > would be a good place to start looking. > > > >> > >> > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana < > > > >> amansk@gmail.com > > > >> > >> >wrote: > > > >> > >> > > > > >> > >> >> Thanks Matie > > > >> > >> >> > > > >> > >> >> I had gone through the architecture document online. I am > > > >> currently > > > >> > >> >> working > > > >> > >> >> on a project towards Security in Hadoop. I do know how the > > data > > > >> moves > > > >> > >> >> around > > > >> > >> >> in the GFS but wasnt sure how much of that does HDFS follow > > and > > > >> how > > > >> > >> >> different it is from GFS. Can you throw some light on that? > > > >> > >> >> > > > >> > >> >> Security would also involve the Map Reduce jobs following > the > > > same > > > >> > >> >> protocols. Thats why the question about how does the Hadoop > > > >> framework > > > >> > >> >> integrate with the HDFS, and how different is it from Map > > Reduce > > > >> and > > > >> > >> GFS. > > > >> > >> >> The GFS and Map Reduce papers give a good information on how > > > those > > > >> > >> systems > > > >> > >> >> are designed but there is nothing that concrete for Hadoop > > that > > > I > > > >> > have > > > >> > >> >> been > > > >> > >> >> able to find. > > > >> > >> >> > > > >> > >> >> Amandeep > > > >> > >> >> > > > >> > >> >> > > > >> > >> >> Amandeep Khurana > > > >> > >> >> Computer Science Graduate Student > > > >> > >> >> University of California, Santa Cruz > > > >> > >> >> > > > >> > >> >> > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia < > > > >> matei@cloudera.com> > > > >> > >> >> wrote: > > > >> > >> >> > > > >> > >> >> > Hi Amandeep, > > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS and aims to > > > >> provide > > > >> > >> those > > > >> > >> >> > capabilities as an open-source project. HDFS is similar to > > GFS > > > >> > (large > > > >> > >> >> > blocks, replication, etc); some notable things missing are > > > >> > read-write > > > >> > >> >> > support in the middle of a file (unlikely to be provided > > > because > > > >> > few > > > >> > >> >> Hadoop > > > >> > >> >> > applications require it) and multiple appenders (the > record > > > >> append > > > >> > >> >> > operation). You can read about HDFS architecture at > > > >> > >> >> > > http://hadoop.apache.org/core/docs/current/hdfs_design.html > > . > > > >> The > > > >> > >> >> MapReduce > > > >> > >> >> > part of Hadoop interacts with HDFS in the same way that > > > Google's > > > >> > >> >> MapReduce > > > >> > >> >> > interacts with GFS (shipping computation to the data), > > > although > > > >> > >> Hadoop > > > >> > >> >> > MapReduce also supports running over other distributed > > > >> filesystems. > > > >> > >> >> > > > > >> > >> >> > Matei > > > >> > >> >> > > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana < > > > >> > amansk@gmail.com > > > >> > >> > > > > >> > >> >> > wrote: > > > >> > >> >> > > > > >> > >> >> > > Hi > > > >> > >> >> > > > > > >> > >> >> > > Is the HDFS architecture completely based on the Google > > > >> > Filesystem? > > > >> > >> If > > > >> > >> >> it > > > >> > >> >> > > isnt, what are the differences between the two? > > > >> > >> >> > > > > > >> > >> >> > > Secondly, is the coupling between Hadoop and HDFS same > as > > > how > > > >> it > > > >> > is > > > >> > >> >> > between > > > >> > >> >> > > the Google's version of Map Reduce and GFS? > > > >> > >> >> > > > > > >> > >> >> > > Amandeep > > > >> > >> >> > > > > > >> > >> >> > > > > > >> > >> >> > > Amandeep Khurana > > > >> > >> >> > > Computer Science Graduate Student > > > >> > >> >> > > University of California, Santa Cruz > > > >> > >> >> > > > > > >> > >> >> > > > > >> > >> >> > > > >> > >> > > > > >> > >> > > > > >> > >> > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > > --001636458462c230f3046300ab63--