hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <st...@designware.dk>
Subject Re: From a newbie: Questions and will MapReduce fit our needs
Date Mon, 29 Aug 2011 09:04:43 GMT
Can you point me to at good place to read about Sqoop. I only find 
http://incubator.apache.org/projects/sqoop.html and 
https://cwiki.apache.org/confluence/display/SQOOP. There is really not 
much to find, about what Sqoop can do, how to use it etc.

Regards, Per Steffensen

Peyman Mohajerian skrev:
> Hi,
> You should definitely take a  look at Apache Sqoop as previously 
> mentioned, if your file is large enough and you have several map jobs 
> running and hitting your database concurrently, you will experience 
> issues at the db level. 
> In terms of speculative jobs (redundant jobs) running to deal with 
> slow jobs, you have control over that in Hadoop. You can turn off 
> speculative jobs or make sure when one job is finished the other one 
> for the same input file is shutdown.
> Good Luck,
> On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain 
> <alain.montmory@thalesgroup.com 
> <mailto:alain.montmory@thalesgroup.com>> wrote:
>     Hi,
>     I am going to try to response to your response in the text. I am
>     not an hadoop expert but we are facing the same kind of problem
>     (dealing with file which are external to HDFS) in our project and
>     we use hadoop.
>     -----Message d'origine-----
>     De : Per Steffensen [mailto:steff@designware.dk]
>     Envoyé : vendredi 26 août 2011 13:13
>     À : mapreduce-user@hadoop.apache.org
>     <mailto:mapreduce-user@hadoop.apache.org>
>     Objet : From a newbie: Questions and will MapReduce fit our needs
>     Hi
>     We are considering to use MapReduce for a project. I am
>     participating in
>     an "investigation"-phase where we try to reveal if we would
>     benefit from
>     using the MapReduce framework.
>     A little bit about the project:
>     We will be receiving data from the "outside world" in files via
>     FTP. It
>     will be a mix of very small files (50 records/lines) and very big
>     files
>     (5mio+ records/lines). The FTP server will be running in a DMZ
>     where we
>     have no plans of using any Hadoop technology. For every file arriving
>     over FTP we will add a message (just pointing to that file) to a
>     MQ also
>     running in DMZ - how we do that is not relevant for my questions
>     here.
>     In the secure zone of our system we plan to run many machines
>     (shards if
>     you like) a.o. being consumers on the MQ in DMZ. Their job will be
>     a.o.
>     to "load" (storing i db, indexing etc.) the files pointed to by the
>     messages they receive from the MQ. For resonably small files they
>     will
>     probably just do the "loading" of the entire file themselves. For
>     very
>     big files we would like to have more machines/shards, than the single
>     machine/shard that happens to receive the corresponding message,
>     participating in "loading" that particular file.
>     Questions:
>     - In general, do you think MapReduce will be beneficial for us to
>     use?
>     Please remember that the files to be "loaded" does not live on a
>     HDFS.
>     Any descriptions on why you would suggest that we use MapReduce
>     will be
>     very velcome.
>     Response : Yes because you could treat the "big file" in parallel
>     and the parallesisation done by hadoop is very effective. To treat
>     your file you need to have an InputFormat class which is able to
>     read it. Here, two solutions :
>        1. you copy your file inside the HDFS file system and you use
>           "FileInputFormat" (for text based file some are already
>           produced by hadoop). inconvenient the copy may be long…(in
>           our case it is unacceptable) and this copy is an extra cost
>           in the whole treatment
>        2. You make your "BigFile" accessible by NFS or other Shared FS
>           from Hadoop cluster Node. The first job in your treatment
>           pipeline read the file and split it by record offset
>           *reference* (Output1 : record from 0 to N , Ouput2 : N to M
>           and so on…)
>        3. On each OuputX a Map task is launch in // which will treat
>           file (still accessible through sharedFS) from reord N to M
>           according to OutputX info
>     - Reading about MapReduce it sounds to be a general framework able to
>     split a "big job" into many smaller "sub-jobs", and have those
>     "sub-jobs" executed concurrently (potentially on other different
>     machines), all-in-all to complete the "big job". This could be
>     used for
>     many other things than "working with files", but then again
>     examples and
>     some of the descriptions makes it sound like it is all only about
>     "jobs
>     working with files". Is MapReduce only usefull/concerned with "jobs"
>     related to "working with files" or is it more general-purpose so
>     that it
>     is usefull for any
>     split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem?
>     Response : Hadoop are not only specialised with (while i think it
>     is 99% of its utilisation…). As a say before your input are
>     accessible through InputFormat interface.
>     - I believe we will end up having a HDFS over the disks on the
>     machines/shards in secure zone. Is HDFS a "must have" for
>     MapReduce to
>     work at all? E.g. HDFS might be the way sub-jobs are distributed
>     and/or
>     persisted (so that they will not be forgotten i case of a shard
>     breakdown or something).
>     Response : Hadoop can work on other FS (Amazon S3 for example), or
>     with other style of input (like NoSql Cassandra table), but i
>     think there is a need for either a small HDFS to store the working
>     space of running jobs. I think that most of usage rely on HDFS
>     which take care of data localisation. The JobTracker launch the
>     job on the node which hold the data in its local disk to avoid
>     netwok exchange…
>     - I think it sounds like an overhead to copy the big file (it will
>     have
>     to be deleted after succesful "loading") from the FTP server disk
>     in DMZ
>     to the HDFS in secure zone, just to be able to use MapReduce to
>     distribute the work of "loading" it. We might want to do it in way so
>     that each "sub-job" (of a "big job" about loading e.g. a big file
>     big.txt) just points to big.txt together with from- and to-
>     indexes into
>     the file. Each "sub-job" will then have to only read the part of
>     big.txt
>     from from-index to to-index and "load" that. Will we be able to do
>     something like that using MapReduce or is it all kind of "based on
>     operating on files on the HDFS"?
>     Response : I don't clearly understand all what you said but it
>     sounds like to me not far from the solution we use and that i
>     proposed to you in previous response.
>     - Depending on the answer to the above question, we might want to be
>     able to make the disk on the FTP server "join" the HDFS, in a way so
>     that it is visible, but in a way so that data on it will not get
>     copied
>     in several copies (for redundancy matters) thoughout the disks on the
>     shards (the "real" part of the HDFS) - remember the file will have
>     to be
>     deleted as soon as it has been "loaded". Is there such a
>     concept/possibility of making "external" disk visible from HDFS, to
>     enable MapReduce to work on files on such disks, without the files on
>     such disks automatically will be copied to several different other
>     disks
>     (on the shards)?
>     Response : Hadoop jobs are (generally) Java jobs so it is still
>     possible to open file external to HDFS provides they could be
>     accessed (through NFS or Other shared FS (Glouster FS, GPFS, etc))..
>     - As it understand it, each "sub-job" (the result of the
>     split-operation) will be run on new dedicated JVM. It sounds like
>     a big
>     overhead to start a new JVM just to run a "small" job. Is it correct
>     that each "sub-job" will run on its own new JVM that has to be
>     started
>     for that purpose only? If yes, it seems to me like the overhead is
>     only
>     "worth it" for fairly large "sub-jobs". Do you agree?
>     Response : due to Hadoop overhead to launch a task on a task
>     tracker, it is not recommended to have jobs running less than a
>     minute. In the proposed solution we could adjust the time by the
>     number of record treated in one OutputX split…
>     remenber that the jobs are launch on different computers. With
>     modern java JVM the overhead of launching a JVM is not so eavy.
>     Hadoop try (since 0.19) to reuse JVM which are already exist to
>     launch similar jobs see : mapred.job.reuse.jvm.num.tasks property 
>     If yes, I find the "WordCount" example on
>     http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>     kinda
>     stupid, because it seems like each "sub-job" is only about
>     handling one
>     single line, and that seems to me to be way too small "sub-jobs"
>     to make
>     it "worth the effort" to move it to a remote machine and start a
>     new JVM
>     to handle it. Do you agree that it is stupid (yes, it is just an
>     example, I know), or what did I miss?
>     Response : 99% of the example deal with word count… it is a big
>     problem where i have to face when i begin with hadoop…and Yes one
>     job to treat one line is not efficient (seen response above…)
>     - Finally with respect to side effects. When handling the files we
>     plan
>     to load the records in the files into some kind of database (maybe
>     several instances of a database). It is important that each record
>     will
>     only get inserted into one database once. As I understand it,
>     MapReduce
>     will make every "sub-job" run in several instances concurrently on
>     several different machines, in order to make sure that it is finished
>     quickly even if one of the attempts to handle the particular
>     "sub-job"
>     fails. It that true?
>     If yes, isnt that a big problem with respect to "sub-jobs" with side
>     effects (like inserting into a database)? Or are there some kind of
>     build-in assumption that all side effects are done on HDFS and
>     that HDFS
>     supports some kind of transaction-handling so that it is easy for
>     MapReduce to rollback the side effects of one of the "identical"
>     sub-jobs if two should both succeed?
>     In general, is it a build-in thing that each sub-job is running in
>     one
>     single transaction, so that it is not possible that a sub-job will
>     "partly" succeed and "partly" fail (e.g. if it has to load 10000
>     records
>     into a database, and succeeds with 9999 of those it might be
>     stupud to
>     roll it all back in order to try it all all-over again)
>     Response : Have a look to Apache sqoop may it could help you
>     import/export data into a database. Otherwise your could set a
>     reduce phase in your treatment and in the reduce the input key are
>     sorted for the whole data set and then you could deal with "will
>     only get inserted into one database once"
>     I know my english is not perfect, but I hope you at least get the
>     essence of my questions. I hope you will try to answer all the
>     questions, even though some of them might seem stupid to you.
>     Remember
>     that I am a newbie :-) I have been running thourgh the FAQ, but didnt
>     find any answers to my questions (maybe because they are stupid
>     :-) ). I
>     wasnt able to search the archives of the mailing-list, so I
>     quickly gave
>     up finding my answers in "old threads". Can someone point me to a
>     way of
>     searching in the archives?
>     Response : My english is not perfect too!
>     extra advice : use a 0.20.xxx version (we use a 0.20.2 cloudera
>     distrbution) and old api (the 0.21 version and New API (mapreduce
>     package) are not yet complete and stable, see Todd Lipcon
>     advice..). Don't be afraid by multiple depreaceated class using
>     old API…they are not so depreaceated. I spend a lot of time at the
>     begining trying to use New API..
>     Hadoop framework is not so simple to handle, if your file contains
>     text information consider use of high level tool like pig or hive.
>     If your file contains binary information consider use of cascading
>     (_www.cascading.org_ <http://www.cascading.org>) library. For us
>     it dramasticly simplify the writting (but we have complex query to
>     do on the binary data hold in Hadoop), depends on the kind of
>     treatment you have to perform…
>     hope my response could help you..
>     Regards Alain Montmory
>     (Thales company)
>     Regards, Per Steffensen

View raw message