hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin Yates <colin.ya...@gmail.com>
Subject Re: Can I use hadoop/map-reduce for this usecase?
Date Sun, 18 Apr 2010 17:29:49 GMT
Many thanks for an excellent and thorough answer - cloudera certainly looks
very interesting as a developer resource.

I have a few more questions if that is OK.

So I have a directory filled with .CSV files representing each dump of data,
or alternatively I have a single SequenceFiles.  How do I ensure a job is
executed on as many nodes (i.e. all) as possible?  If I set a HDFS
replication factor of N, does that restrict N parallel jobs?

I want to get as many jobs running parallel as possible.

The most common map/reduce job will be the equivalent of:

"select the aggregation of some columns from this table where columnX=this
and columnY=that group by columnP"

where columnX, columnY and columnP change from job to job :)

I am absolutely trying to get the most out of divide and conquer - tens, or
even hundreds of little virtual machines (I take your point about
virtualisation) throwing their CPUs at this job.

My second question is the number of jobs per node.  If I run this on a
(non-virtual) machine with 2 quad-core CPUs (each hyper-threaded), so 16
cores, that (simplistically) means that there could be at least 16 map
and/or reduce jobs in parallel on each machine.  How do I ensure this, or
will hadoop do this automatically?  My only motivation for using VMs was to
achieve the best utilisation of CPU.  (I get life isn't as simple as this
and the point about IO wait and CPU wait etc.)

Many thanks,

Col

Thanks,

Col

On 16 April 2010 22:54, Eric Sammer <esammer@cloudera.com> wrote:

> Colin:
>
> This sounds like a reasonable case for Hive or Pig on Hadoop (assuming
> you're right about your data growth being "big"). Hive will let you
> query a table using something that resembles SQL, similar to an RDBMS
> but bakes down to map reduce. Tables are made up by one or more files
> in a directory. Files can be text or something like Hadoop's
> SequenceFile format.
>
> Generally, you want to build a system that batches up updates into a
> file and then move that file into a directory so it can be queried by
> Hive (again, or Pig). Given that both tools will operate on the entire
> directory, the data in the file will be seen as one large dataset
> (very common in Hadoop MR jobs).
>
> I'll answer some other questions inline.
>
> On Fri, Apr 16, 2010 at 2:26 PM, Colin Yates <colin.yates@gmail.com>
> wrote:
>
> > The problem is that a user wants to generate reports off this table.  To
> > phrase it in terms of mapreduce (map-reduce or map/reduce or mapreduce?
> :)
> >  The reduce is usually the same (a simple aggregation) but the map phase
> > will change with each query.  This isn't a high concurrency requirement -
> it
> > is likely to be one or two users doing it once a month.
>
> This is more than fine and can be handled by Hadoop.
>
> > I realise the hadoop and mapreduce architecture isn't designed for
> real-time
> > analytics, but an execution time of minutes would be sufficient.
>
> It will depend on the dataset size, but with this amount of data,
> minutes should be easy enough to achieve.
>
> > The new rows will come in batch (maybe once a minute, maybe once a day
> > depending on the environment) and absolutely live data isn't essential.
>
> As I mentioned above, batch updates into larger files. Rather than 1 a
> minute, go for something like >= a few hundred megs or even gigs, or
> once a day, whichever comes first. Something like that gets a
> reasonable batch size.
>
> > My plan was to have lots and lots of (virtual) nodes with a small memory
> > footprint (<1GB) so that the parallelisation of map/reduce can be
> utilised
> > as much as possible.
>
> That's going to be hard. Hadoop is pure java (mostly) and can get
> memory hungry. Having a machine with < 1GB is going to be tough to
> pull off. Expect to "spend" about 4GB per node in a cluster with a few
> cores. See
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
>
> Virtualization hurts Hadoop in almost all cases.
>
> > Haven't quite thought out how to get the data into hdfs, whether to use
> > HBase or a single, ever-growing CSV.
>
> HBase is generally more useful when you need random access to the
> individual records. It doesn't sound like you want / need that. Go
> with a directory of X terminated field, \n terminated record files and
> make your life easier.
>
> > I don't quite get how hdfs figures out
> > which bit to give to each map job.  I realise the file is chunked into
> (by
> > default) 64MB files but it can't be as simple as every map process gets
> the
> > full 64MB file?  Do I need to split the file into chunks myself?  If so,
> > that is fine.
>
> The short answer is that files are split into blocks purely on size
> and spread around the cluster. When an MR job processes a bunch of
> files, Hadoop calculates the number of splits (roughly) based on the
> number of HDFS blocks that make up the files. Records of the file are
> reconstructed from these splits. Records that span blocks are
> reconstructed for you, the way you'd expect. If you use one of the
> existing formats (like TextInputFormat for text or
> SequenceFileInputFormat for SequenceFiles) this splitting of data is
> transparent to you and you'll naturally get what you want - parallel
> processing of N files which are split into M chunks with each line
> processed once and only once.
>
> > I realise this data is quite easily managed by existing RDBMs but the
> data
> > will grow very very quickly, and there are other reasons I don't want to
> go
> > down that route.
>
> Honestly, 4M rows per year is tiny. The reason to use something like
> Hadoop is if the data is tough to squeeze into an RDBMS or - and it
> sounds like this might be your case - your queries are generally "full
> table scan" type operations over a moderate amount of data that can be
> easily performed in parallel or drown the db. Again, I think Hive is
> what you're after (simple summation, variable filtering logic).
>
> > So, am I barking up the wrong tree with this?  Is there a better
> solution?
> >  I have also evaluated mongo-db and couchDB (both excellent for their
> > use-cases).
>
> Document oriented databases are another beast. Definitely useful but I
> don't know if you need those access patterns here.
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message