hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <ma...@cloudera.com>
Subject Re: HDFS architecture based on GFS?
Date Mon, 16 Feb 2009 05:37:24 GMT
In general, yeah, the scripts can access any resource they want (within the
permissions of the user that the task runs as). It's also possible to access
HDFS from scripts because HDFS provides a FUSE interface that can make it
look like a regular file system on the machine. (The FUSE module in turn
talks to the namenode as a regular HDFS client.)

On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana <amansk@gmail.com> wrote:

> I dont know much about Hadoop streaming and have a quick question here.
>
> The snippets of code/programs that you attach into the map reduce job might
> want to access outside resources (like you mentioned). Now these might not
> need to go to the namenode right? For example a python script. How would it
> access the data? Would it ask the parent java process (in the tasktracker)
> to get the data or would it go and do stuff on its own?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia <matei@cloudera.com> wrote:
>
> > Nope, typically the JobTracker just starts the process, and the
> tasktracker
> > talks directly to the namenode to get a pointer to the datanode, and then
> > directly to the datanode.
> >
> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana <amansk@gmail.com>
> > wrote:
> >
> > > Alright.. Got it.
> > >
> > > Now, do the task trackers talk to the namenode and the data node
> directly
> > > or
> > > do they go through the job tracker for it? So, if my code is such that
> I
> > > need to access more files from the hdfs, would the job tracker get
> > involved
> > > or not?
> > >
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia <matei@cloudera.com>
> > wrote:
> > >
> > > > Normally, HDFS files are accessed through the namenode. If there was
> a
> > > > malicious process though, then I imagine it could talk to a datanode
> > > > directly and request a specific block.
> > > >
> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana <amansk@gmail.com>
> > > > wrote:
> > > >
> > > > > Ok. Got it.
> > > > >
> > > > > Now, when my job needs to access another file, does it go to the
> > > Namenode
> > > > > to
> > > > > get the block ids? How does the java process know where the files
> are
> > > and
> > > > > how to access them?
> > > > >
> > > > >
> > > > > Amandeep Khurana
> > > > > Computer Science Graduate Student
> > > > > University of California, Santa Cruz
> > > > >
> > > > >
> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <matei@cloudera.com
> >
> > > > wrote:
> > > > >
> > > > > > I mentioned this case because even jobs written in Java can
use
> the
> > > > HDFS
> > > > > > API
> > > > > > to talk to the NameNode and access the filesystem. People often
> do
> > > this
> > > > > > because their job needs to read a config file, some small data
> > table,
> > > > etc
> > > > > > and use this information in its map or reduce functions. In
this
> > > case,
> > > > > you
> > > > > > open the second file separately in your mapper's init function
> and
> > > read
> > > > > > whatever you need from it. In general I wanted to point out
that
> > you
> > > > > can't
> > > > > > know which files a job will access unless you look at its source
> > code
> > > > or
> > > > > > monitor the calls it makes; the input file(s) you provide in
the
> > job
> > > > > > description are a hint to the MapReduce framework to place your
> job
> > > on
> > > > > > certain nodes, but it's reasonable for the job to access other
> > files
> > > as
> > > > > > well.
> > > > > >
> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <
> > amansk@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Another question that I have here - When the jobs run arbitrary
> > > code
> > > > > and
> > > > > > > access data from the HDFS, do they go to the namenode to
get
> the
> > > > block
> > > > > > > information?
> > > > > > >
> > > > > > >
> > > > > > > Amandeep Khurana
> > > > > > > Computer Science Graduate Student
> > > > > > > University of California, Santa Cruz
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
> > > amansk@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Assuming that the job is purely in Java and not involving
> > > streaming
> > > > > or
> > > > > > > > pipes, wouldnt the resources (files) required by the
job as
> > > inputs
> > > > be
> > > > > > > known
> > > > > > > > beforehand? So, if the map task is accessing a second
file,
> how
> > > > does
> > > > > it
> > > > > > > make
> > > > > > > > it different except that there are multiple files.
The
> > JobTracker
> > > > > would
> > > > > > > know
> > > > > > > > beforehand that multiple files would be accessed.
Right?
> > > > > > > >
> > > > > > > > I am slightly confused why you have mentioned this
case
> > > > separately...
> > > > > > Can
> > > > > > > > you elaborate on it a little bit?
> > > > > > > >
> > > > > > > > Amandeep
> > > > > > > >
> > > > > > > >
> > > > > > > > Amandeep Khurana
> > > > > > > > Computer Science Graduate Student
> > > > > > > > University of California, Santa Cruz
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <
> > > matei@cloudera.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Typically the data flow is like this:1) Client
submits a job
> > > > > > description
> > > > > > > >> to
> > > > > > > >> the JobTracker.
> > > > > > > >> 2) JobTracker figures out block locations for
the input
> > file(s)
> > > by
> > > > > > > talking
> > > > > > > >> to HDFS NameNode.
> > > > > > > >> 3) JobTracker creates a job description file in
HDFS which
> > will
> > > be
> > > > > > read
> > > > > > > by
> > > > > > > >> the nodes to copy over the job's code etc.
> > > > > > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers)
> > with
> > > > the
> > > > > > > >> appropriate data blocks.
> > > > > > > >> 5) After running, maps create intermediate output
files on
> > those
> > > > > > slaves.
> > > > > > > >> These are not in HDFS, they're in some temporary
storage
> used
> > by
> > > > > > > >> MapReduce.
> > > > > > > >> 6) JobTracker starts reduces on a series of slaves,
which
> copy
> > > > over
> > > > > > the
> > > > > > > >> appropriate map outputs, apply the reduce function,
and
> write
> > > the
> > > > > > > outputs
> > > > > > > >> to
> > > > > > > >> HDFS (one output file per reducer).
> > > > > > > >> 7) Some logs for the job may also be put into
HDFS by the
> > > > > JobTracker.
> > > > > > > >>
> > > > > > > >> However, there is a big caveat, which is that
the map and
> > reduce
> > > > > tasks
> > > > > > > run
> > > > > > > >> arbitrary code. It is not unusual to have a map
that opens a
> > > > second
> > > > > > HDFS
> > > > > > > >> file to read some information (e.g. for doing
a join of a
> > small
> > > > > table
> > > > > > > >> against a big file). If you use Hadoop Streaming
or Pipes to
> > > write
> > > > a
> > > > > > job
> > > > > > > >> in
> > > > > > > >> Python, Ruby, C, etc, then you are launching arbitrary
> > processes
> > > > > which
> > > > > > > may
> > > > > > > >> also access external resources in this manner.
Some people
> > also
> > > > > > > read/write
> > > > > > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive
> security
> > > > > > solution
> > > > > > > >> would ideally deal with these cases too.
> > > > > > > >>
> > > > > > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana
<
> > > > amansk@gmail.com
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > A quick question here. How does a typical
hadoop job work
> at
> > > the
> > > > > > > system
> > > > > > > >> > level? What are the various interactions
and how does the
> > data
> > > > > flow?
> > > > > > > >> >
> > > > > > > >> > Amandeep
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Amandeep Khurana
> > > > > > > >> > Computer Science Graduate Student
> > > > > > > >> > University of California, Santa Cruz
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep
Khurana <
> > > > > amansk@gmail.com
> > > > > > >
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Thanks Matei. If the basic architecture
is similar to
> the
> > > > Google
> > > > > > > >> stuff, I
> > > > > > > >> > > can safely just work on the project
using the
> information
> > > from
> > > > > the
> > > > > > > >> > papers.
> > > > > > > >> > >
> > > > > > > >> > > I am aware of the 4487 jira and the
current status of
> the
> > > > > > > permissions
> > > > > > > >> > > mechanism. I had a look at them earlier.
> > > > > > > >> > >
> > > > > > > >> > > Cheers
> > > > > > > >> > > Amandeep
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > Amandeep Khurana
> > > > > > > >> > > Computer Science Graduate Student
> > > > > > > >> > > University of California, Santa Cruz
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei
Zaharia <
> > > > > > matei@cloudera.com>
> > > > > > > >> > wrote:
> > > > > > > >> > >
> > > > > > > >> > >> Forgot to add, this JIRA details
the latest security
> > > features
> > > > > > that
> > > > > > > >> are
> > > > > > > >> > >> being
> > > > > > > >> > >> worked on in Hadoop trunk:
> > > > > > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> > > > > > > >> > >> This document describes the current
status and
> > limitations
> > > of
> > > > > the
> > > > > > > >> > >> permissions mechanism:
> > > > > > > >> > >>
> > > > > > > >>
> > > > > >
> > > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html
> > > > .
> > > > > > > >> > >>
> > > > > > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM,
Matei Zaharia <
> > > > > > matei@cloudera.com
> > > > > > > >
> > > > > > > >> > >> wrote:
> > > > > > > >> > >>
> > > > > > > >> > >> > I think it's safe to assume
that Hadoop works like
> > > > > > MapReduce/GFS
> > > > > > > at
> > > > > > > >> > the
> > > > > > > >> > >> > level described in those papers.
In particular, in
> > HDFS,
> > > > > there
> > > > > > is
> > > > > > > a
> > > > > > > >> > >> master
> > > > > > > >> > >> > node containing metadata and
a number of slave nodes
> > > > > > (datanodes)
> > > > > > > >> > >> containing
> > > > > > > >> > >> > blocks, as in GFS. Clients
start by talking to the
> > master
> > > > to
> > > > > > list
> > > > > > > >> > >> > directories, etc. When they
want to read a region of
> > some
> > > > > file,
> > > > > > > >> they
> > > > > > > >> > >> tell
> > > > > > > >> > >> > the master the filename and
offset, and they receive
> a
> > > list
> > > > > of
> > > > > > > >> block
> > > > > > > >> > >> > locations (datanodes). They
then contact the
> individual
> > > > > > datanodes
> > > > > > > >> to
> > > > > > > >> > >> read
> > > > > > > >> > >> > the blocks. When clients write
a file, they first
> > obtain
> > > a
> > > > > new
> > > > > > > >> block
> > > > > > > >> > ID
> > > > > > > >> > >> and
> > > > > > > >> > >> > list of nodes to write it to
from the master, then
> > > contact
> > > > > the
> > > > > > > >> > datanodes
> > > > > > > >> > >> to
> > > > > > > >> > >> > write it (actually, the datanodes
pipeline the write
> as
> > > in
> > > > > GFS)
> > > > > > > and
> > > > > > > >> > >> report
> > > > > > > >> > >> > when the write is complete.
HDFS actually has some
> > > security
> > > > > > > >> mechanisms
> > > > > > > >> > >> built
> > > > > > > >> > >> > in, authenticating users based
on their Unix ID and
> > > > providing
> > > > > > > >> > Unix-like
> > > > > > > >> > >> file
> > > > > > > >> > >> > permissions. I don't know much
about how these are
> > > > > implemented,
> > > > > > > but
> > > > > > > >> > they
> > > > > > > >> > >> > would be a good place to start
looking.
> > > > > > > >> > >> >
> > > > > > > >> > >> > On Sun, Feb 15, 2009 at 1:36
PM, Amandeep Khurana <
> > > > > > > >> amansk@gmail.com
> > > > > > > >> > >> >wrote:
> > > > > > > >> > >> >
> > > > > > > >> > >> >> Thanks Matie
> > > > > > > >> > >> >>
> > > > > > > >> > >> >> I had gone through the
architecture document online.
> I
> > > am
> > > > > > > >> currently
> > > > > > > >> > >> >> working
> > > > > > > >> > >> >> on a project towards Security
in Hadoop. I do know
> how
> > > the
> > > > > > data
> > > > > > > >> moves
> > > > > > > >> > >> >> around
> > > > > > > >> > >> >> in the GFS but wasnt sure
how much of that does HDFS
> > > > follow
> > > > > > and
> > > > > > > >> how
> > > > > > > >> > >> >> different it is from GFS.
Can you throw some light
> on
> > > > that?
> > > > > > > >> > >> >>
> > > > > > > >> > >> >> Security would also involve
the Map Reduce jobs
> > > following
> > > > > the
> > > > > > > same
> > > > > > > >> > >> >> protocols. Thats why the
question about how does the
> > > > Hadoop
> > > > > > > >> framework
> > > > > > > >> > >> >> integrate with the HDFS,
and how different is it
> from
> > > Map
> > > > > > Reduce
> > > > > > > >> and
> > > > > > > >> > >> GFS.
> > > > > > > >> > >> >> The GFS and Map Reduce
papers give a good
> information
> > on
> > > > how
> > > > > > > those
> > > > > > > >> > >> systems
> > > > > > > >> > >> >> are designed but there
is nothing that concrete for
> > > Hadoop
> > > > > > that
> > > > > > > I
> > > > > > > >> > have
> > > > > > > >> > >> >> been
> > > > > > > >> > >> >> able to find.
> > > > > > > >> > >> >>
> > > > > > > >> > >> >> Amandeep
> > > > > > > >> > >> >>
> > > > > > > >> > >> >>
> > > > > > > >> > >> >> Amandeep Khurana
> > > > > > > >> > >> >> Computer Science Graduate
Student
> > > > > > > >> > >> >> University of California,
Santa Cruz
> > > > > > > >> > >> >>
> > > > > > > >> > >> >>
> > > > > > > >> > >> >> On Sun, Feb 15, 2009 at
12:07 PM, Matei Zaharia <
> > > > > > > >> matei@cloudera.com>
> > > > > > > >> > >> >> wrote:
> > > > > > > >> > >> >>
> > > > > > > >> > >> >> > Hi Amandeep,
> > > > > > > >> > >> >> > Hadoop is definitely
inspired by MapReduce/GFS and
> > > aims
> > > > to
> > > > > > > >> provide
> > > > > > > >> > >> those
> > > > > > > >> > >> >> > capabilities as an
open-source project. HDFS is
> > > similar
> > > > to
> > > > > > GFS
> > > > > > > >> > (large
> > > > > > > >> > >> >> > blocks, replication,
etc); some notable things
> > missing
> > > > are
> > > > > > > >> > read-write
> > > > > > > >> > >> >> > support in the middle
of a file (unlikely to be
> > > provided
> > > > > > > because
> > > > > > > >> > few
> > > > > > > >> > >> >> Hadoop
> > > > > > > >> > >> >> > applications require
it) and multiple appenders
> (the
> > > > > record
> > > > > > > >> append
> > > > > > > >> > >> >> > operation). You can
read about HDFS architecture
> at
> > > > > > > >> > >> >> >
> > > > > http://hadoop.apache.org/core/docs/current/hdfs_design.html
> > > > > > .
> > > > > > > >> The
> > > > > > > >> > >> >> MapReduce
> > > > > > > >> > >> >> > part of Hadoop interacts
with HDFS in the same way
> > > that
> > > > > > > Google's
> > > > > > > >> > >> >> MapReduce
> > > > > > > >> > >> >> > interacts with GFS
(shipping computation to the
> > data),
> > > > > > > although
> > > > > > > >> > >> Hadoop
> > > > > > > >> > >> >> > MapReduce also supports
running over other
> > distributed
> > > > > > > >> filesystems.
> > > > > > > >> > >> >> >
> > > > > > > >> > >> >> > Matei
> > > > > > > >> > >> >> >
> > > > > > > >> > >> >> > On Sun, Feb 15, 2009
at 11:57 AM, Amandeep Khurana
> <
> > > > > > > >> > amansk@gmail.com
> > > > > > > >> > >> >
> > > > > > > >> > >> >> > wrote:
> > > > > > > >> > >> >> >
> > > > > > > >> > >> >> > > Hi
> > > > > > > >> > >> >> > >
> > > > > > > >> > >> >> > > Is the HDFS architecture
completely based on the
> > > > Google
> > > > > > > >> > Filesystem?
> > > > > > > >> > >> If
> > > > > > > >> > >> >> it
> > > > > > > >> > >> >> > > isnt, what are
the differences between the two?
> > > > > > > >> > >> >> > >
> > > > > > > >> > >> >> > > Secondly, is
the coupling between Hadoop and
> HDFS
> > > same
> > > > > as
> > > > > > > how
> > > > > > > >> it
> > > > > > > >> > is
> > > > > > > >> > >> >> > between
> > > > > > > >> > >> >> > > the Google's
version of Map Reduce and GFS?
> > > > > > > >> > >> >> > >
> > > > > > > >> > >> >> > > Amandeep
> > > > > > > >> > >> >> > >
> > > > > > > >> > >> >> > >
> > > > > > > >> > >> >> > > Amandeep Khurana
> > > > > > > >> > >> >> > > Computer Science
Graduate Student
> > > > > > > >> > >> >> > > University of
California, Santa Cruz
> > > > > > > >> > >> >> > >
> > > > > > > >> > >> >> >
> > > > > > > >> > >> >>
> > > > > > > >> > >> >
> > > > > > > >> > >> >
> > > > > > > >> > >>
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message