hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "GOEKE, MATTHEW (AG/1000)" <matthew.go...@monsanto.com>
Subject RE: Queue support from HDFS
Date Fri, 24 Jun 2011 19:15:32 GMT

Two questions come to mind that could help you narrow down a solution:

1) How quickly do the downstream processes need the transformed data?
	Reason: If you can delay the processing for a period of time, enough to batch the data into
a blob that is a multiple of your block size, then you are obviously going to be working more
towards the strong suit of vanilla MR.

2) What else will be running on the cluster?
	Reason: If this is primarily setup for this use case then how often it runs / what resources
it consumes when it does only needs to be optimized if it can't process them fast enough.
If it is not then you could always setup a separate pool for this in the fairscheduler and
allow for this to use a certain amount of overhead on the cluster when these events are being

Outside of the fact that you would have a lot of small files on the cluster (which can be
resolved by running a nightly job to blob them and then delete originals) I am not sure I
would be too concerned about at least trying out this method. It would be helpful to know
the size and type of data coming in as well as what type of operation you are looking to do
if you would like a more concrete suggestion. Log data is a prime example of this type of
workflow and there are many suggestions out there as well as projects that attempt to address
this (i.e. Chukwa). 


-----Original Message-----
From: saumitra.shahapure@gmail.com [mailto:saumitra.shahapure@gmail.com] On Behalf Of Saumitra
Sent: Friday, June 24, 2011 12:12 PM
To: common-user@hadoop.apache.org
Subject: Queue support from HDFS


Is queue-like structure supported from HDFS where stream of data is
processed when it's generated?
Specifically, I will have stream of data coming; and data independent
operation needs to be applied to it (so only Map function, reducer is
I wish to distribute data among nodes using HDFS and start processing it as
it arrives, preferably in single MR job.

I agree that it can be done by starting new MR job for each batch of data,
but is starting many MR jobs frequently for small data chunks a good idea?
(Consider new batch arrives after every few sec and processing of one batch
takes few mins)

Saumitra S. Shahapure
This e-mail message may contain privileged and/or confidential information, and is intended
to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the
sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail
by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival
by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence
of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such
code transmitted by or accompanying
this e-mail or any attachment.

The information contained in this email may be subject to the export control laws and regulations
of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations
issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you
are obligated to comply with all
applicable U.S. export laws and regulations.

View raw message