hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: HDFS architecture based on GFS?
Date Mon, 16 Feb 2009 03:15:42 GMT
Ok. Got it.

Now, when my job needs to access another file, does it go to the Namenode to
get the block ids? How does the java process know where the files are and
how to access them?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia <matei@cloudera.com> wrote:

> I mentioned this case because even jobs written in Java can use the HDFS
> API
> to talk to the NameNode and access the filesystem. People often do this
> because their job needs to read a config file, some small data table, etc
> and use this information in its map or reduce functions. In this case, you
> open the second file separately in your mapper's init function and read
> whatever you need from it. In general I wanted to point out that you can't
> know which files a job will access unless you look at its source code or
> monitor the calls it makes; the input file(s) you provide in the job
> description are a hint to the MapReduce framework to place your job on
> certain nodes, but it's reasonable for the job to access other files as
> well.
>
> On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <amansk@gmail.com>
> wrote:
>
> > Another question that I have here - When the jobs run arbitrary code and
> > access data from the HDFS, do they go to the namenode to get the block
> > information?
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <amansk@gmail.com>
> > wrote:
> >
> > > Assuming that the job is purely in Java and not involving streaming or
> > > pipes, wouldnt the resources (files) required by the job as inputs be
> > known
> > > beforehand? So, if the map task is accessing a second file, how does it
> > make
> > > it different except that there are multiple files. The JobTracker would
> > know
> > > beforehand that multiple files would be accessed. Right?
> > >
> > > I am slightly confused why you have mentioned this case separately...
> Can
> > > you elaborate on it a little bit?
> > >
> > > Amandeep
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <matei@cloudera.com>
> > wrote:
> > >
> > >> Typically the data flow is like this:1) Client submits a job
> description
> > >> to
> > >> the JobTracker.
> > >> 2) JobTracker figures out block locations for the input file(s) by
> > talking
> > >> to HDFS NameNode.
> > >> 3) JobTracker creates a job description file in HDFS which will be
> read
> > by
> > >> the nodes to copy over the job's code etc.
> > >> 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the
> > >> appropriate data blocks.
> > >> 5) After running, maps create intermediate output files on those
> slaves.
> > >> These are not in HDFS, they're in some temporary storage used by
> > >> MapReduce.
> > >> 6) JobTracker starts reduces on a series of slaves, which copy over
> the
> > >> appropriate map outputs, apply the reduce function, and write the
> > outputs
> > >> to
> > >> HDFS (one output file per reducer).
> > >> 7) Some logs for the job may also be put into HDFS by the JobTracker.
> > >>
> > >> However, there is a big caveat, which is that the map and reduce tasks
> > run
> > >> arbitrary code. It is not unusual to have a map that opens a second
> HDFS
> > >> file to read some information (e.g. for doing a join of a small table
> > >> against a big file). If you use Hadoop Streaming or Pipes to write a
> job
> > >> in
> > >> Python, Ruby, C, etc, then you are launching arbitrary processes which
> > may
> > >> also access external resources in this manner. Some people also
> > read/write
> > >> to DBs (e.g. MySQL) from their tasks. A comprehensive security
> solution
> > >> would ideally deal with these cases too.
> > >>
> > >> On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <amansk@gmail.com>
> > >> wrote:
> > >>
> > >> > A quick question here. How does a typical hadoop job work at the
> > system
> > >> > level? What are the various interactions and how does the data flow?
> > >> >
> > >> > Amandeep
> > >> >
> > >> >
> > >> > Amandeep Khurana
> > >> > Computer Science Graduate Student
> > >> > University of California, Santa Cruz
> > >> >
> > >> >
> > >> > On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <amansk@gmail.com
> >
> > >> > wrote:
> > >> >
> > >> > > Thanks Matei. If the basic architecture is similar to the Google
> > >> stuff, I
> > >> > > can safely just work on the project using the information from
the
> > >> > papers.
> > >> > >
> > >> > > I am aware of the 4487 jira and the current status of the
> > permissions
> > >> > > mechanism. I had a look at them earlier.
> > >> > >
> > >> > > Cheers
> > >> > > Amandeep
> > >> > >
> > >> > >
> > >> > > Amandeep Khurana
> > >> > > Computer Science Graduate Student
> > >> > > University of California, Santa Cruz
> > >> > >
> > >> > >
> > >> > > On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <
> matei@cloudera.com>
> > >> > wrote:
> > >> > >
> > >> > >> Forgot to add, this JIRA details the latest security features
> that
> > >> are
> > >> > >> being
> > >> > >> worked on in Hadoop trunk:
> > >> > >> https://issues.apache.org/jira/browse/HADOOP-4487.
> > >> > >> This document describes the current status and limitations
of the
> > >> > >> permissions mechanism:
> > >> > >>
> > >>
> http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
> > >> > >>
> > >> > >> On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia <
> matei@cloudera.com
> > >
> > >> > >> wrote:
> > >> > >>
> > >> > >> > I think it's safe to assume that Hadoop works like
> MapReduce/GFS
> > at
> > >> > the
> > >> > >> > level described in those papers. In particular, in HDFS,
there
> is
> > a
> > >> > >> master
> > >> > >> > node containing metadata and a number of slave nodes
> (datanodes)
> > >> > >> containing
> > >> > >> > blocks, as in GFS. Clients start by talking to the master
to
> list
> > >> > >> > directories, etc. When they want to read a region of
some file,
> > >> they
> > >> > >> tell
> > >> > >> > the master the filename and offset, and they receive
a list of
> > >> block
> > >> > >> > locations (datanodes). They then contact the individual
> datanodes
> > >> to
> > >> > >> read
> > >> > >> > the blocks. When clients write a file, they first obtain
a new
> > >> block
> > >> > ID
> > >> > >> and
> > >> > >> > list of nodes to write it to from the master, then contact
the
> > >> > datanodes
> > >> > >> to
> > >> > >> > write it (actually, the datanodes pipeline the write
as in GFS)
> > and
> > >> > >> report
> > >> > >> > when the write is complete. HDFS actually has some security
> > >> mechanisms
> > >> > >> built
> > >> > >> > in, authenticating users based on their Unix ID and
providing
> > >> > Unix-like
> > >> > >> file
> > >> > >> > permissions. I don't know much about how these are implemented,
> > but
> > >> > they
> > >> > >> > would be a good place to start looking.
> > >> > >> >
> > >> > >> > On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana <
> > >> amansk@gmail.com
> > >> > >> >wrote:
> > >> > >> >
> > >> > >> >> Thanks Matie
> > >> > >> >>
> > >> > >> >> I had gone through the architecture document online.
I am
> > >> currently
> > >> > >> >> working
> > >> > >> >> on a project towards Security in Hadoop. I do know
how the
> data
> > >> moves
> > >> > >> >> around
> > >> > >> >> in the GFS but wasnt sure how much of that does
HDFS follow
> and
> > >> how
> > >> > >> >> different it is from GFS. Can you throw some light
on that?
> > >> > >> >>
> > >> > >> >> Security would also involve the Map Reduce jobs
following the
> > same
> > >> > >> >> protocols. Thats why the question about how does
the Hadoop
> > >> framework
> > >> > >> >> integrate with the HDFS, and how different is it
from Map
> Reduce
> > >> and
> > >> > >> GFS.
> > >> > >> >> The GFS and Map Reduce papers give a good information
on how
> > those
> > >> > >> systems
> > >> > >> >> are designed but there is nothing that concrete
for Hadoop
> that
> > I
> > >> > have
> > >> > >> >> been
> > >> > >> >> able to find.
> > >> > >> >>
> > >> > >> >> Amandeep
> > >> > >> >>
> > >> > >> >>
> > >> > >> >> Amandeep Khurana
> > >> > >> >> Computer Science Graduate Student
> > >> > >> >> University of California, Santa Cruz
> > >> > >> >>
> > >> > >> >>
> > >> > >> >> On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia
<
> > >> matei@cloudera.com>
> > >> > >> >> wrote:
> > >> > >> >>
> > >> > >> >> > Hi Amandeep,
> > >> > >> >> > Hadoop is definitely inspired by MapReduce/GFS
and aims to
> > >> provide
> > >> > >> those
> > >> > >> >> > capabilities as an open-source project. HDFS
is similar to
> GFS
> > >> > (large
> > >> > >> >> > blocks, replication, etc); some notable things
missing are
> > >> > read-write
> > >> > >> >> > support in the middle of a file (unlikely to
be provided
> > because
> > >> > few
> > >> > >> >> Hadoop
> > >> > >> >> > applications require it) and multiple appenders
(the record
> > >> append
> > >> > >> >> > operation). You can read about HDFS architecture
at
> > >> > >> >> > http://hadoop.apache.org/core/docs/current/hdfs_design.html
> .
> > >> The
> > >> > >> >> MapReduce
> > >> > >> >> > part of Hadoop interacts with HDFS in the same
way that
> > Google's
> > >> > >> >> MapReduce
> > >> > >> >> > interacts with GFS (shipping computation to
the data),
> > although
> > >> > >> Hadoop
> > >> > >> >> > MapReduce also supports running over other
distributed
> > >> filesystems.
> > >> > >> >> >
> > >> > >> >> > Matei
> > >> > >> >> >
> > >> > >> >> > On Sun, Feb 15, 2009 at 11:57 AM, Amandeep
Khurana <
> > >> > amansk@gmail.com
> > >> > >> >
> > >> > >> >> > wrote:
> > >> > >> >> >
> > >> > >> >> > > Hi
> > >> > >> >> > >
> > >> > >> >> > > Is the HDFS architecture completely based
on the Google
> > >> > Filesystem?
> > >> > >> If
> > >> > >> >> it
> > >> > >> >> > > isnt, what are the differences between
the two?
> > >> > >> >> > >
> > >> > >> >> > > Secondly, is the coupling between Hadoop
and HDFS same as
> > how
> > >> it
> > >> > is
> > >> > >> >> > between
> > >> > >> >> > > the Google's version of Map Reduce and
GFS?
> > >> > >> >> > >
> > >> > >> >> > > Amandeep
> > >> > >> >> > >
> > >> > >> >> > >
> > >> > >> >> > > Amandeep Khurana
> > >> > >> >> > > Computer Science Graduate Student
> > >> > >> >> > > University of California, Santa Cruz
> > >> > >> >> > >
> > >> > >> >> >
> > >> > >> >>
> > >> > >> >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message