hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "arvind@cloudera.com" <arv...@cloudera.com>
Subject Re: From a newbie: Questions and will MapReduce fit our needs
Date Mon, 29 Aug 2011 15:02:55 GMT
On Mon, Aug 29, 2011 at 2:04 AM, Per Steffensen <steff@designware.dk> wrote:
> Can you point me to at good place to read about Sqoop. I only find
> http://incubator.apache.org/projects/sqoop.html and
> https://cwiki.apache.org/confluence/display/SQOOP. There is really not much
> to find, about what Sqoop can do, how to use it etc.

Please see the Sqoop user guide:


> Regards, Per Steffensen
> Peyman Mohajerian skrev:
> Hi,
> You should definitely take a  look at Apache Sqoop as previously mentioned,
> if your file is large enough and you have several map jobs running and
> hitting your database concurrently, you will experience issues at the db
> level.
> In terms of speculative jobs (redundant jobs) running to deal with slow
> jobs, you have control over that in Hadoop. You can turn off speculative
> jobs or make sure when one job is finished the other one for the same input
> file is shutdown.
> Good Luck,
> On Fri, Aug 26, 2011 at 7:43 AM, MONTMORY Alain
> <alain.montmory@thalesgroup.com> wrote:
>> Hi,
>> I am going to try to response to your response in the text. I am not an
>> hadoop expert but we are facing the same kind of problem (dealing with file
>> which are external to HDFS) in our project and we use hadoop.
>> -----Message d'origine-----
>> De : Per Steffensen [mailto:steff@designware.dk]
>> Envoyé : vendredi 26 août 2011 13:13
>> À : mapreduce-user@hadoop.apache.org
>> Objet : From a newbie: Questions and will MapReduce fit our needs
>> Hi
>> We are considering to use MapReduce for a project. I am participating in
>> an "investigation"-phase where we try to reveal if we would benefit from
>> using the MapReduce framework.
>> A little bit about the project:
>> We will be receiving data from the "outside world" in files via FTP. It
>> will be a mix of very small files (50 records/lines) and very big files
>> (5mio+ records/lines). The FTP server will be running in a DMZ where we
>> have no plans of using any Hadoop technology. For every file arriving
>> over FTP we will add a message (just pointing to that file) to a MQ also
>> running in DMZ - how we do that is not relevant for my questions here.
>> In the secure zone of our system we plan to run many machines (shards if
>> you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o.
>> to "load" (storing i db, indexing etc.) the files pointed to by the
>> messages they receive from the MQ. For resonably small files they will
>> probably just do the "loading" of the entire file themselves. For very
>> big files we would like to have more machines/shards, than the single
>> machine/shard that happens to receive the corresponding message,
>> participating in "loading" that particular file.
>> Questions:
>> - In general, do you think MapReduce will be beneficial for us to use?
>> Please remember that the files to be "loaded" does not live on a HDFS.
>> Any descriptions on why you would suggest that we use MapReduce will be
>> very velcome.
>> Response : Yes because you could treat the "big file" in parallel and the
>> parallesisation done by hadoop is very effective. To treat your file you
>> need to have an InputFormat class which is able to read it. Here, two
>> solutions :
>> you copy your file inside the HDFS file system and you use
>> "FileInputFormat" (for text based file some are already produced by hadoop).
>> inconvenient the copy may be long…(in our case it is unacceptable) and this
>> copy is an extra cost in the whole treatment
>> You make your "BigFile" accessible by NFS or other Shared FS from Hadoop
>> cluster Node. The first job in your treatment pipeline read the file and
>> split it by record offset reference (Output1 : record from 0 to N , Ouput2 :
>> N to M and so on…)
>> On each OuputX a Map task is launch in // which will treat file (still
>> accessible through sharedFS) from reord N to M according to OutputX info
>> - Reading about MapReduce it sounds to be a general framework able to
>> split a "big job" into many smaller "sub-jobs", and have those
>> "sub-jobs" executed concurrently (potentially on other different
>> machines), all-in-all to complete the "big job". This could be used for
>> many other things than "working with files", but then again examples and
>> some of the descriptions makes it sound like it is all only about "jobs
>> working with files". Is MapReduce only usefull/concerned with "jobs"
>> related to "working with files" or is it more general-purpose so that it
>> is usefull for any
>> split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem?
>> Response : Hadoop are not only specialised with (while i think it is 99%
>> of its utilisation…). As a say before your input are accessible through
>> InputFormat interface.
>> - I believe we will end up having a HDFS over the disks on the
>> machines/shards in secure zone. Is HDFS a "must have" for MapReduce to
>> work at all? E.g. HDFS might be the way sub-jobs are distributed and/or
>> persisted (so that they will not be forgotten i case of a shard
>> breakdown or something).
>> Response : Hadoop can work on other FS (Amazon S3 for example), or with
>> other style of input (like NoSql Cassandra table), but i think there is a
>> need for either a small HDFS to store the working space of running jobs. I
>> think that most of usage rely on HDFS which take care of data localisation.
>> The JobTracker launch the job on the node which hold the data in its local
>> disk to avoid netwok exchange…
>> - I think it sounds like an overhead to copy the big file (it will have
>> to be deleted after succesful "loading") from the FTP server disk in DMZ
>> to the HDFS in secure zone, just to be able to use MapReduce to
>> distribute the work of "loading" it. We might want to do it in way so
>> that each "sub-job" (of a "big job" about loading e.g. a big file
>> big.txt) just points to big.txt together with from- and to- indexes into
>> the file. Each "sub-job" will then have to only read the part of big.txt
>> from from-index to to-index and "load" that. Will we be able to do
>> something like that using MapReduce or is it all kind of "based on
>> operating on files on the HDFS"?
>> Response : I don't clearly understand all what you said but it sounds like
>> to me not far from the solution we use and that i proposed to you in
>> previous response.
>> - Depending on the answer to the above question, we might want to be
>> able to make the disk on the FTP server "join" the HDFS, in a way so
>> that it is visible, but in a way so that data on it will not get copied
>> in several copies (for redundancy matters) thoughout the disks on the
>> shards (the "real" part of the HDFS) - remember the file will have to be
>> deleted as soon as it has been "loaded". Is there such a
>> concept/possibility of making "external" disk visible from HDFS, to
>> enable MapReduce to work on files on such disks, without the files on
>> such disks automatically will be copied to several different other disks
>> (on the shards)?
>> Response : Hadoop jobs are (generally) Java jobs so it is still possible
>> to open file external to HDFS provides they could be accessed (through NFS
>> or Other shared FS (Glouster FS, GPFS, etc))..
>> - As it understand it, each "sub-job" (the result of the
>> split-operation) will be run on new dedicated JVM. It sounds like a big
>> overhead to start a new JVM just to run a "small" job. Is it correct
>> that each "sub-job" will run on its own new JVM that has to be started
>> for that purpose only? If yes, it seems to me like the overhead is only
>> "worth it" for fairly large "sub-jobs". Do you agree?
>> Response : due to Hadoop overhead to launch a task on a task tracker, it
>> is not recommended to have jobs running less than a minute. In the proposed
>> solution we could adjust the time by the number of record treated in one
>> OutputX split…
>> remenber that the jobs are launch on different computers. With modern java
>> JVM the overhead of launching a JVM is not so eavy. Hadoop try (since 0.19)
>> to reuse JVM which are already exist to launch similar jobs see :
>> mapred.job.reuse.jvm.num.tasks property
>> If yes, I find the "WordCount" example on
>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html kinda
>> stupid, because it seems like each "sub-job" is only about handling one
>> single line, and that seems to me to be way too small "sub-jobs" to make
>> it "worth the effort" to move it to a remote machine and start a new JVM
>> to handle it. Do you agree that it is stupid (yes, it is just an
>> example, I know), or what did I miss?
>> Response : 99% of the example deal with word count… it is a big problem
>> where i have to face when i begin with hadoop…and Yes one job to treat one
>> line is not efficient (seen response above…)
>> - Finally with respect to side effects. When handling the files we plan
>> to load the records in the files into some kind of database (maybe
>> several instances of a database). It is important that each record will
>> only get inserted into one database once. As I understand it, MapReduce
>> will make every "sub-job" run in several instances concurrently on
>> several different machines, in order to make sure that it is finished
>> quickly even if one of the attempts to handle the particular "sub-job"
>> fails. It that true?
>> If yes, isnt that a big problem with respect to "sub-jobs" with side
>> effects (like inserting into a database)? Or are there some kind of
>> build-in assumption that all side effects are done on HDFS and that HDFS
>> supports some kind of transaction-handling so that it is easy for
>> MapReduce to rollback the side effects of one of the "identical"
>> sub-jobs if two should both succeed?
>> In general, is it a build-in thing that each sub-job is running in one
>> single transaction, so that it is not possible that a sub-job will
>> "partly" succeed and "partly" fail (e.g. if it has to load 10000 records
>> into a database, and succeeds with 9999 of those it might be stupud to
>> roll it all back in order to try it all all-over again)
>> Response : Have a look to Apache sqoop may it could help you import/export
>> data into a database. Otherwise your could set a reduce phase in your
>> treatment and in the reduce the input key are sorted for the whole data set
>> and then you could deal with "will only get inserted into one database once"
>> I know my english is not perfect, but I hope you at least get the
>> essence of my questions. I hope you will try to answer all the
>> questions, even though some of them might seem stupid to you. Remember
>> that I am a newbie :-) I have been running thourgh the FAQ, but didnt
>> find any answers to my questions (maybe because they are stupid :-) ). I
>> wasnt able to search the archives of the mailing-list, so I quickly gave
>> up finding my answers in "old threads". Can someone point me to a way of
>> searching in the archives?
>> Response : My english is not perfect too!
>> extra advice : use a 0.20.xxx version (we use a 0.20.2 cloudera
>> distrbution) and old api (the 0.21 version and New API (mapreduce package)
>> are not yet complete and stable, see Todd Lipcon advice..). Don't be afraid
>> by multiple depreaceated class using old API…they are not so depreaceated. I
>> spend a lot of time at the begining trying to use New API..
>> Hadoop framework is not so simple to handle, if your file contains text
>> information consider use of high level tool like pig or hive. If your file
>> contains binary information consider use of cascading (www.cascading.org)
>> library. For us it dramasticly simplify the writting (but we have complex
>> query to do on the binary data hold in Hadoop), depends on the kind of
>> treatment you have to perform…
>> hope my response could help you..
>> Regards Alain Montmory
>> (Thales company)
>> Regards, Per Steffensen

View raw message