hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: HDFS architecture based on GFS?
Date Mon, 16 Feb 2009 04:07:01 GMT
Alright.. Got it.

Now, do the task trackers talk to the namenode and the data node directly or
do they go through the job tracker for it? So, if my code is such that I
need to access more files from the hdfs, would the job tracker get involved
or not?




Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia <matei@cloudera.com> wrote:

> Normally, HDFS files are accessed through the namenode. If there was a
> malicious process though, then I imagine it could talk to a datanode
> directly and request a specific block.
>
> On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana <amansk@gmail.com>
> wrote:
>
> > Ok. Got it.
> >
> > Now, when my job needs to access another file, does it go to the Namenode
> > to
> > get the block ids? How does the java process know where the files are and
> > how to access them?
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <matei@cloudera.com>
> wrote:
> >
> > > I mentioned this case because even jobs written in Java can use the
> HDFS
> > > API
> > > to talk to the NameNode and access the filesystem. People often do this
> > > because their job needs to read a config file, some small data table,
> etc
> > > and use this information in its map or reduce functions. In this case,
> > you
> > > open the second file separately in your mapper's init function and read
> > > whatever you need from it. In general I wanted to point out that you
> > can't
> > > know which files a job will access unless you look at its source code
> or
> > > monitor the calls it makes; the input file(s) you provide in the job
> > > description are a hint to the MapReduce framework to place your job on
> > > certain nodes, but it's reasonable for the job to access other files as
> > > well.
> > >
> > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <amansk@gmail.com>
> > > wrote:
> > >
> > > > Another question that I have here - When the jobs run arbitrary code
> > and
> > > > access data from the HDFS, do they go to the namenode to get the
> block
> > > > information?
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <amansk@gmail.com>
> > > > wrote:
> > > >
> > > > > Assuming that the job is purely in Java and not involving streaming
> > or
> > > > > pipes, wouldnt the resources (files) required by the job as inputs
> be
> > > > known
> > > > > beforehand? So, if the map task is accessing a second file, how
> does
> > it
> > > > make
> > > > > it different except that there are multiple files. The JobTracker
> > would
> > > > know
> > > > > beforehand that multiple files would be accessed. Right?
> > > > >
> > > > > I am slightly confused why you have mentioned this case
> separately...
> > > Can
> > > > > you elaborate on it a little bit?
> > > > >
> > > > > Amandeep
> > > > >
> > > > >
> > > > > Amandeep Khurana
> > > > > Computer Science Graduate Student
> > > > > University of California, Santa Cruz
> > > > >
> > > > >
> > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <matei@cloudera.com
> >
> > > > wrote:
> > > > >
> > > > >> Typically the data flow is like this:1) Client submits a job
> > > description
> > > > >> to
> > > > >> the JobTracker.
> > > > >> 2) JobTracker figures out block locations for the input file(s)
by
> > > > talking
> > > > >> to HDFS NameNode.
> > > > >> 3) JobTracker creates a job description file in HDFS which will
be
> > > read
> > > > by
> > > > >> the nodes to copy over the job's code etc.
> > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with
> the
> > > > >> appropriate data blocks.
> > > > >> 5) After running, maps create intermediate output files on those
> > > slaves.
> > > > >> These are not in HDFS, they're in some temporary storage used
by
> > > > >> MapReduce.
> > > > >> 6) JobTracker starts reduces on a series of slaves, which copy
> over
> > > the
> > > > >> appropriate map outputs, apply the reduce function, and write
the
> > > > outputs
> > > > >> to
> > > > >> HDFS (one output file per reducer).
> > > > >> 7) Some logs for the job may also be put into HDFS by the
> > JobTracker.
> > > > >>
> > > > >> However, there is a big caveat, which is that the map and reduce
> > tasks
> > > > run
> > > > >> arbitrary code. It is not unusual to have a map that opens a
> second
> > > HDFS
> > > > >> file to read some information (e.g. for doing a join of a small
> > table
> > > > >> against a big file). If you use Hadoop Streaming or Pipes to
write
> a
> > > job
> > > > >> in
> > > > >> Python, Ruby, C, etc, then you are launching arbitrary processes
> > which
> > > > may
> > > > >> also access external resources in this manner. Some people also
> > > > read/write
> > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security
> > > solution
> > > > >> would ideally deal with these cases too.
> > > > >>
> > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <
> amansk@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > A quick question here. How does a typical hadoop job work
at the
> > > > system
> > > > >> > level? What are the various interactions and how does the
data
> > flow?
> > > > >> >
> > > > >> > Amandeep
> > > > >> >
> > > > >> >
> > > > >> > Amandeep Khurana
> > > > >> > Computer Science Graduate Student
> > > > >> > University of California, Santa Cruz
> > > > >> >
> > > > >> >
> > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <
> > amansk@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Thanks Matei. If the basic architecture is similar
to the
> Google
> > > > >> stuff, I
> > > > >> > > can safely just work on the project using the information
from
> > the
> > > > >> > papers.
> > > > >> > >
> > > > >> > > I am aware of the 4487 jira and the current status
of the
> > > > permissions
> > > > >> > > mechanism. I had a look at them earlier.
> > > > >> > >
> > > > >> > > Cheers
> > > > >> > > Amandeep
> > > > >> > >
> > > > >> > >
> > > > >> > > Amandeep Khurana
> > > > >> > > Computer Science Graduate Student
> > > > >> > > University of California, Santa Cruz
> > > > >> > >
> > > > >> > >
> > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <
> > > matei@cloudera.com>
> > > > >> > wrote:
> > > > >> > >
> > > > >> > >> Forgot to add, this JIRA details the latest security
features
> > > that
> > > > >> are
> > > > >> > >> being
> > > > >> > >> worked on in Hadoop trunk:
> > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> > > > >> > >> This document describes the current status and
limitations of
> > the
> > > > >> > >> permissions mechanism:
> > > > >> > >>
> > > > >>
> > > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html
> .
> > > > >> > >>
> > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia
<
> > > matei@cloudera.com
> > > > >
> > > > >> > >> wrote:
> > > > >> > >>
> > > > >> > >> > I think it's safe to assume that Hadoop works
like
> > > MapReduce/GFS
> > > > at
> > > > >> > the
> > > > >> > >> > level described in those papers. In particular,
in HDFS,
> > there
> > > is
> > > > a
> > > > >> > >> master
> > > > >> > >> > node containing metadata and a number of slave
nodes
> > > (datanodes)
> > > > >> > >> containing
> > > > >> > >> > blocks, as in GFS. Clients start by talking
to the master
> to
> > > list
> > > > >> > >> > directories, etc. When they want to read a
region of some
> > file,
> > > > >> they
> > > > >> > >> tell
> > > > >> > >> > the master the filename and offset, and they
receive a list
> > of
> > > > >> block
> > > > >> > >> > locations (datanodes). They then contact the
individual
> > > datanodes
> > > > >> to
> > > > >> > >> read
> > > > >> > >> > the blocks. When clients write a file, they
first obtain a
> > new
> > > > >> block
> > > > >> > ID
> > > > >> > >> and
> > > > >> > >> > list of nodes to write it to from the master,
then contact
> > the
> > > > >> > datanodes
> > > > >> > >> to
> > > > >> > >> > write it (actually, the datanodes pipeline
the write as in
> > GFS)
> > > > and
> > > > >> > >> report
> > > > >> > >> > when the write is complete. HDFS actually
has some security
> > > > >> mechanisms
> > > > >> > >> built
> > > > >> > >> > in, authenticating users based on their Unix
ID and
> providing
> > > > >> > Unix-like
> > > > >> > >> file
> > > > >> > >> > permissions. I don't know much about how these
are
> > implemented,
> > > > but
> > > > >> > they
> > > > >> > >> > would be a good place to start looking.
> > > > >> > >> >
> > > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep
Khurana <
> > > > >> amansk@gmail.com
> > > > >> > >> >wrote:
> > > > >> > >> >
> > > > >> > >> >> Thanks Matie
> > > > >> > >> >>
> > > > >> > >> >> I had gone through the architecture document
online. I am
> > > > >> currently
> > > > >> > >> >> working
> > > > >> > >> >> on a project towards Security in Hadoop.
I do know how the
> > > data
> > > > >> moves
> > > > >> > >> >> around
> > > > >> > >> >> in the GFS but wasnt sure how much of
that does HDFS
> follow
> > > and
> > > > >> how
> > > > >> > >> >> different it is from GFS. Can you throw
some light on
> that?
> > > > >> > >> >>
> > > > >> > >> >> Security would also involve the Map Reduce
jobs following
> > the
> > > > same
> > > > >> > >> >> protocols. Thats why the question about
how does the
> Hadoop
> > > > >> framework
> > > > >> > >> >> integrate with the HDFS, and how different
is it from Map
> > > Reduce
> > > > >> and
> > > > >> > >> GFS.
> > > > >> > >> >> The GFS and Map Reduce papers give a good
information on
> how
> > > > those
> > > > >> > >> systems
> > > > >> > >> >> are designed but there is nothing that
concrete for Hadoop
> > > that
> > > > I
> > > > >> > have
> > > > >> > >> >> been
> > > > >> > >> >> able to find.
> > > > >> > >> >>
> > > > >> > >> >> Amandeep
> > > > >> > >> >>
> > > > >> > >> >>
> > > > >> > >> >> Amandeep Khurana
> > > > >> > >> >> Computer Science Graduate Student
> > > > >> > >> >> University of California, Santa Cruz
> > > > >> > >> >>
> > > > >> > >> >>
> > > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei
Zaharia <
> > > > >> matei@cloudera.com>
> > > > >> > >> >> wrote:
> > > > >> > >> >>
> > > > >> > >> >> > Hi Amandeep,
> > > > >> > >> >> > Hadoop is definitely inspired by
MapReduce/GFS and aims
> to
> > > > >> provide
> > > > >> > >> those
> > > > >> > >> >> > capabilities as an open-source project.
HDFS is similar
> to
> > > GFS
> > > > >> > (large
> > > > >> > >> >> > blocks, replication, etc); some notable
things missing
> are
> > > > >> > read-write
> > > > >> > >> >> > support in the middle of a file (unlikely
to be provided
> > > > because
> > > > >> > few
> > > > >> > >> >> Hadoop
> > > > >> > >> >> > applications require it) and multiple
appenders (the
> > record
> > > > >> append
> > > > >> > >> >> > operation). You can read about HDFS
architecture at
> > > > >> > >> >> >
> > http://hadoop.apache.org/core/docs/current/hdfs_design.html
> > > .
> > > > >> The
> > > > >> > >> >> MapReduce
> > > > >> > >> >> > part of Hadoop interacts with HDFS
in the same way that
> > > > Google's
> > > > >> > >> >> MapReduce
> > > > >> > >> >> > interacts with GFS (shipping computation
to the data),
> > > > although
> > > > >> > >> Hadoop
> > > > >> > >> >> > MapReduce also supports running over
other distributed
> > > > >> filesystems.
> > > > >> > >> >> >
> > > > >> > >> >> > Matei
> > > > >> > >> >> >
> > > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM,
Amandeep Khurana <
> > > > >> > amansk@gmail.com
> > > > >> > >> >
> > > > >> > >> >> > wrote:
> > > > >> > >> >> >
> > > > >> > >> >> > > Hi
> > > > >> > >> >> > >
> > > > >> > >> >> > > Is the HDFS architecture completely
based on the
> Google
> > > > >> > Filesystem?
> > > > >> > >> If
> > > > >> > >> >> it
> > > > >> > >> >> > > isnt, what are the differences
between the two?
> > > > >> > >> >> > >
> > > > >> > >> >> > > Secondly, is the coupling between
Hadoop and HDFS same
> > as
> > > > how
> > > > >> it
> > > > >> > is
> > > > >> > >> >> > between
> > > > >> > >> >> > > the Google's version of Map
Reduce and GFS?
> > > > >> > >> >> > >
> > > > >> > >> >> > > Amandeep
> > > > >> > >> >> > >
> > > > >> > >> >> > >
> > > > >> > >> >> > > Amandeep Khurana
> > > > >> > >> >> > > Computer Science Graduate Student
> > > > >> > >> >> > > University of California, Santa
Cruz
> > > > >> > >> >> > >
> > > > >> > >> >> >
> > > > >> > >> >>
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message