hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Inder Pall <inder.p...@gmail.com>
Subject Re: Running a job continuously
Date Sun, 11 Dec 2011 11:13:56 GMT
have you looked at kafka. it provides a streaming view of data stream.
flume at the moment is getting rewritten as flume ng
On Dec 6, 2011 4:28 PM, "Praveen Sripati" <praveensripati@gmail.com> wrote:

> If the requirement is for real time data processing, using Flume
> will not suffice as there is a time lag between the collection of files
> by Flume and processing done by Hadoop. Consider frameworks like S4,
> Storm (from Twitter), HStreaming etc which suits realtime processing.
>
> Regards,
> Praveen
>
> On Tue, Dec 6, 2011 at 10:39 AM, Ravi teja ch n v
> <raviteja.chnv@huawei.com>wrote:
>
> > Hi Burak,
> >
> > >Bejoy Ks, i have a continuous inflow of data but i think i need a near
> > real-time system.
> >
> > Just to add to Bejoy's point,
> > with Oozie, you can specify the data dependency for running your job.
> > When specific amount of data is in, your can configure Oozie to run your
> > job.
> > I think this will suffice your requirement.
> >
> > Regards,
> > Ravi Teja
> >
> > ________________________________________
> > From: burakkk [burak.isikli@gmail.com]
> > Sent: 06 December 2011 04:03:59
> > To: mapreduce-user@hadoop.apache.org
> > Cc: common-user@hadoop.apache.org
> > Subject: Re: Running a job continuously
> >
> > Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
> > execute the MR job on the same algorithm but different files have
> different
> > velocity.
> >
> > Both Storm and facebook's hadoop are designed for that. But i want to use
> > apache distribution.
> >
> > Bejoy Ks, i have a continuous inflow of data but i think i need a near
> > real-time system.
> >
> > Mike Spreitzer, both output and input are continuous. Output isn't
> relevant
> > to the input. Only that i want is all the incoming files are processed by
> > the same job and the same algorithm.
> > For ex, you think about wordcount problem. When you want to run
> wordcount,
> > you implement that:
> > http://wiki.apache.org/hadoop/WordCount
> >
> > But when the program find that code "job.waitForCompletion(true);",
> somehow
> > job will end up. When you want to make it continuously, what will you do
> in
> > hadoop without other tools?
> > One more thing is you assumption that the input file's name is
> > filename_timestamp(filename_20111206_0030)
> >
> > public static void main(String[] args) throws Exception {
>  Configuration
> > conf = new Configuration();                Job job = new Job(conf,
> > "wordcount");        job.setOutputKeyClass(Text.class);
> > job.setOutputValueClass(IntWritable.class);
> > job.setMapperClass(Map.class);    job.setReducerClass(Reduce.class);
> >  job.setInputFormatClass(TextInputFormat.class);
> > job.setOutputFormatClass(TextOutputFormat.class);
> > FileInputFormat.addInputPath(job, new Path(args[0]));
> > FileOutputFormat.setOutputPath(job, new Path(args[1]));
> > job.waitForCompletion(true); }
> >
> > On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks <bejoy.hadoop@gmail.com>
> wrote:
> >
> > > Burak
> > >        If you have a continuous inflow of data, you can choose flume to
> > > aggregate the files into larger sequence files or so if they are small
> > and
> > > when you have a substantial chunk of data(equal to hdfs block size).
> You
> > > can push that data on to hdfs based on your SLAs you need to schedule
> > your
> > > jobs using oozie or simpe shell script. In very simple terms
> > > - push input data (could be from flume collector) into a staging hdfs
> dir
> > > - before triggering the job(hadoop jar) copy the input from staging to
> > > main input dir
> > > - execute the job
> > > - archive the input and output into archive dirs(any other dirs).
> > >        - the output archive dir could be source of output data
> > > - delete output dir and empty input dir
> > >
> > > Hope it helps!...
> > >
> > > Regards
> > > Bejoy.K.S
> > >
> > > On Tue, Dec 6, 2011 at 2:19 AM, burakkk <burak.isikli@gmail.com>
> wrote:
> > >
> > >> Hi everyone,
> > >> I want to run a MR job continuously. Because i have streaming data
> and i
> > >> try to analyze it all the time in my way(algorithm). For example you
> > want
> > >> to solve wordcount problem. It's the simplest one :) If you have some
> > >> multiple files and the new files are keep going, how do you handle it?
> > >> You could execute a MR job per one file but you have to do it
> repeatly.
> > So
> > >> what do you think?
> > >>
> > >> Thanks
> > >> Best regards...
> > >>
> > >> --
> > >>
> > >> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > >> *
> > >> *
> > >>
> > >
> > >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message