hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <ma...@cloudera.com>
Subject Re: HDFS architecture based on GFS?
Date Mon, 16 Feb 2009 03:20:39 GMT
Normally, HDFS files are accessed through the namenode. If there was a
malicious process though, then I imagine it could talk to a datanode
directly and request a specific block.

On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana <amansk@gmail.com> wrote:

> Ok. Got it.
>
> Now, when my job needs to access another file, does it go to the Namenode
> to
> get the block ids? How does the java process know where the files are and
> how to access them?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <matei@cloudera.com> wrote:
>
> > I mentioned this case because even jobs written in Java can use the HDFS
> > API
> > to talk to the NameNode and access the filesystem. People often do this
> > because their job needs to read a config file, some small data table, etc
> > and use this information in its map or reduce functions. In this case,
> you
> > open the second file separately in your mapper's init function and read
> > whatever you need from it. In general I wanted to point out that you
> can't
> > know which files a job will access unless you look at its source code or
> > monitor the calls it makes; the input file(s) you provide in the job
> > description are a hint to the MapReduce framework to place your job on
> > certain nodes, but it's reasonable for the job to access other files as
> > well.
> >
> > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <amansk@gmail.com>
> > wrote:
> >
> > > Another question that I have here - When the jobs run arbitrary code
> and
> > > access data from the HDFS, do they go to the namenode to get the block
> > > information?
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <amansk@gmail.com>
> > > wrote:
> > >
> > > > Assuming that the job is purely in Java and not involving streaming
> or
> > > > pipes, wouldnt the resources (files) required by the job as inputs be
> > > known
> > > > beforehand? So, if the map task is accessing a second file, how does
> it
> > > make
> > > > it different except that there are multiple files. The JobTracker
> would
> > > know
> > > > beforehand that multiple files would be accessed. Right?
> > > >
> > > > I am slightly confused why you have mentioned this case separately...
> > Can
> > > > you elaborate on it a little bit?
> > > >
> > > > Amandeep
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <matei@cloudera.com>
> > > wrote:
> > > >
> > > >> Typically the data flow is like this:1) Client submits a job
> > description
> > > >> to
> > > >> the JobTracker.
> > > >> 2) JobTracker figures out block locations for the input file(s) by
> > > talking
> > > >> to HDFS NameNode.
> > > >> 3) JobTracker creates a job description file in HDFS which will be
> > read
> > > by
> > > >> the nodes to copy over the job's code etc.
> > > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> > > >> appropriate data blocks.
> > > >> 5) After running, maps create intermediate output files on those
> > slaves.
> > > >> These are not in HDFS, they're in some temporary storage used by
> > > >> MapReduce.
> > > >> 6) JobTracker starts reduces on a series of slaves, which copy over
> > the
> > > >> appropriate map outputs, apply the reduce function, and write the
> > > outputs
> > > >> to
> > > >> HDFS (one output file per reducer).
> > > >> 7) Some logs for the job may also be put into HDFS by the
> JobTracker.
> > > >>
> > > >> However, there is a big caveat, which is that the map and reduce
> tasks
> > > run
> > > >> arbitrary code. It is not unusual to have a map that opens a second
> > HDFS
> > > >> file to read some information (e.g. for doing a join of a small
> table
> > > >> against a big file). If you use Hadoop Streaming or Pipes to write
a
> > job
> > > >> in
> > > >> Python, Ruby, C, etc, then you are launching arbitrary processes
> which
> > > may
> > > >> also access external resources in this manner. Some people also
> > > read/write
> > > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security
> > solution
> > > >> would ideally deal with these cases too.
> > > >>
> > > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <amansk@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > A quick question here. How does a typical hadoop job work at
the
> > > system
> > > >> > level? What are the various interactions and how does the data
> flow?
> > > >> >
> > > >> > Amandeep
> > > >> >
> > > >> >
> > > >> > Amandeep Khurana
> > > >> > Computer Science Graduate Student
> > > >> > University of California, Santa Cruz
> > > >> >
> > > >> >
> > > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <
> amansk@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Thanks Matei. If the basic architecture is similar to the
Google
> > > >> stuff, I
> > > >> > > can safely just work on the project using the information
from
> the
> > > >> > papers.
> > > >> > >
> > > >> > > I am aware of the 4487 jira and the current status of the
> > > permissions
> > > >> > > mechanism. I had a look at them earlier.
> > > >> > >
> > > >> > > Cheers
> > > >> > > Amandeep
> > > >> > >
> > > >> > >
> > > >> > > Amandeep Khurana
> > > >> > > Computer Science Graduate Student
> > > >> > > University of California, Santa Cruz
> > > >> > >
> > > >> > >
> > > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <
> > matei@cloudera.com>
> > > >> > wrote:
> > > >> > >
> > > >> > >> Forgot to add, this JIRA details the latest security
features
> > that
> > > >> are
> > > >> > >> being
> > > >> > >> worked on in Hadoop trunk:
> > > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> > > >> > >> This document describes the current status and limitations
of
> the
> > > >> > >> permissions mechanism:
> > > >> > >>
> > > >>
> > http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
> > > >> > >>
> > > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia <
> > matei@cloudera.com
> > > >
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >> > I think it's safe to assume that Hadoop works like
> > MapReduce/GFS
> > > at
> > > >> > the
> > > >> > >> > level described in those papers. In particular,
in HDFS,
> there
> > is
> > > a
> > > >> > >> master
> > > >> > >> > node containing metadata and a number of slave
nodes
> > (datanodes)
> > > >> > >> containing
> > > >> > >> > blocks, as in GFS. Clients start by talking to
the master to
> > list
> > > >> > >> > directories, etc. When they want to read a region
of some
> file,
> > > >> they
> > > >> > >> tell
> > > >> > >> > the master the filename and offset, and they receive
a list
> of
> > > >> block
> > > >> > >> > locations (datanodes). They then contact the individual
> > datanodes
> > > >> to
> > > >> > >> read
> > > >> > >> > the blocks. When clients write a file, they first
obtain a
> new
> > > >> block
> > > >> > ID
> > > >> > >> and
> > > >> > >> > list of nodes to write it to from the master, then
contact
> the
> > > >> > datanodes
> > > >> > >> to
> > > >> > >> > write it (actually, the datanodes pipeline the
write as in
> GFS)
> > > and
> > > >> > >> report
> > > >> > >> > when the write is complete. HDFS actually has some
security
> > > >> mechanisms
> > > >> > >> built
> > > >> > >> > in, authenticating users based on their Unix ID
and providing
> > > >> > Unix-like
> > > >> > >> file
> > > >> > >> > permissions. I don't know much about how these
are
> implemented,
> > > but
> > > >> > they
> > > >> > >> > would be a good place to start looking.
> > > >> > >> >
> > > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana
<
> > > >> amansk@gmail.com
> > > >> > >> >wrote:
> > > >> > >> >
> > > >> > >> >> Thanks Matie
> > > >> > >> >>
> > > >> > >> >> I had gone through the architecture document
online. I am
> > > >> currently
> > > >> > >> >> working
> > > >> > >> >> on a project towards Security in Hadoop. I
do know how the
> > data
> > > >> moves
> > > >> > >> >> around
> > > >> > >> >> in the GFS but wasnt sure how much of that
does HDFS follow
> > and
> > > >> how
> > > >> > >> >> different it is from GFS. Can you throw some
light on that?
> > > >> > >> >>
> > > >> > >> >> Security would also involve the Map Reduce
jobs following
> the
> > > same
> > > >> > >> >> protocols. Thats why the question about how
does the Hadoop
> > > >> framework
> > > >> > >> >> integrate with the HDFS, and how different
is it from Map
> > Reduce
> > > >> and
> > > >> > >> GFS.
> > > >> > >> >> The GFS and Map Reduce papers give a good information
on how
> > > those
> > > >> > >> systems
> > > >> > >> >> are designed but there is nothing that concrete
for Hadoop
> > that
> > > I
> > > >> > have
> > > >> > >> >> been
> > > >> > >> >> able to find.
> > > >> > >> >>
> > > >> > >> >> Amandeep
> > > >> > >> >>
> > > >> > >> >>
> > > >> > >> >> Amandeep Khurana
> > > >> > >> >> Computer Science Graduate Student
> > > >> > >> >> University of California, Santa Cruz
> > > >> > >> >>
> > > >> > >> >>
> > > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia
<
> > > >> matei@cloudera.com>
> > > >> > >> >> wrote:
> > > >> > >> >>
> > > >> > >> >> > Hi Amandeep,
> > > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS
and aims to
> > > >> provide
> > > >> > >> those
> > > >> > >> >> > capabilities as an open-source project.
HDFS is similar to
> > GFS
> > > >> > (large
> > > >> > >> >> > blocks, replication, etc); some notable
things missing are
> > > >> > read-write
> > > >> > >> >> > support in the middle of a file (unlikely
to be provided
> > > because
> > > >> > few
> > > >> > >> >> Hadoop
> > > >> > >> >> > applications require it) and multiple
appenders (the
> record
> > > >> append
> > > >> > >> >> > operation). You can read about HDFS architecture
at
> > > >> > >> >> >
> http://hadoop.apache.org/core/docs/current/hdfs_design.html
> > .
> > > >> The
> > > >> > >> >> MapReduce
> > > >> > >> >> > part of Hadoop interacts with HDFS in
the same way that
> > > Google's
> > > >> > >> >> MapReduce
> > > >> > >> >> > interacts with GFS (shipping computation
to the data),
> > > although
> > > >> > >> Hadoop
> > > >> > >> >> > MapReduce also supports running over other
distributed
> > > >> filesystems.
> > > >> > >> >> >
> > > >> > >> >> > Matei
> > > >> > >> >> >
> > > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep
Khurana <
> > > >> > amansk@gmail.com
> > > >> > >> >
> > > >> > >> >> > wrote:
> > > >> > >> >> >
> > > >> > >> >> > > Hi
> > > >> > >> >> > >
> > > >> > >> >> > > Is the HDFS architecture completely
based on the Google
> > > >> > Filesystem?
> > > >> > >> If
> > > >> > >> >> it
> > > >> > >> >> > > isnt, what are the differences between
the two?
> > > >> > >> >> > >
> > > >> > >> >> > > Secondly, is the coupling between
Hadoop and HDFS same
> as
> > > how
> > > >> it
> > > >> > is
> > > >> > >> >> > between
> > > >> > >> >> > > the Google's version of Map Reduce
and GFS?
> > > >> > >> >> > >
> > > >> > >> >> > > Amandeep
> > > >> > >> >> > >
> > > >> > >> >> > >
> > > >> > >> >> > > Amandeep Khurana
> > > >> > >> >> > > Computer Science Graduate Student
> > > >> > >> >> > > University of California, Santa Cruz
> > > >> > >> >> > >
> > > >> > >> >> >
> > > >> > >> >>
> > > >> > >> >
> > > >> > >> >
> > > >> > >>
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message