hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: Writing click stream data to hadoop
Date Wed, 30 May 2012 14:56:35 GMT
On Fri, May 25, 2012 at 9:30 AM, Harsh J <harsh@cloudera.com> wrote:

> Mohit,
> Not if you call sync (or hflush/hsync in 2.0) periodically to persist
> your changes to the file. SequenceFile doesn't currently have a
> sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
> underlying output stream instead at the moment. This is possible to do
> in 1.0 (just own the output stream).
> Your use case also sounds like you may want to simply use Apache Flume
> (Incubating) [http://incubator.apache.org/flume/] that already does
> provide these features and the WAL-kinda reliability you seek.

Thanks Harsh, Does flume also provides API on top. I am getting this data
as http call, how would I go about using flume with http calls?

> On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <mohitanchlia@gmail.com>
> wrote:
> > We get click data through API calls. I now need to send this data to our
> > hadoop environment. I am wondering if I could open one sequence file and
> > write to it until it's of certain size. Once it's over the specified
> size I
> > can close that file and open a new one. Is this a good approach?
> >
> > Only thing I worry about is what happens if the server crashes before I
> am
> > able to cleanly close the file. Would I lose all previous data?
> --
> Harsh J

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message