hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: Combining AVRO files efficiently within HDFS
Date Fri, 06 Jan 2012 18:05:02 GMT
I would do it by staging the machine data into a temporary directory
and then renaming the directory when it's been verified. So, data
would be written into directories like this:


After verification, you'd rename the 2012-01/02/00/stage directory to
2012-01/02/00/done. Since renaming a directory in HDFS is an atomic
operation, you get the guarantee the you're looking for without having
to do extra IO. There shouldn't be a benefit to merging the individual
files unless they're too small.


On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrimes97@gmail.com> wrote:
> Hi Bobby,
> Actually, the problem we're trying to solve is one of completeness.
> Say we have 3 machines generating log events and putting them to HDFS on an
> hourly basis.
> e.g.
> 2012-01/01/00/machine1.log.avro
> 2012-01/01/00/machine2.log.avro
> 2012-01/01/00/machine3.log.avro
> Sometime after the hour, we would have a scheduled job verify that all the
> expected machines' log files are present and complete in HDFS.
> Before launching MapReduce jobs for a given date range, we want to verify
> that the job will run over complete data.
> If not, the query would error out.
> We want our query/MapReduce layer to not need to be aware of logs at the
> machine level, only the presence or not of an hour's worth of logs.
> We were thinking that after verifying all in individual log files for an
> hour, they could be combined into 2012-01/01/00/log.avro.
> The presence of 2012-01-01-00.log.avro would be all that needs to be
> verified.
> However, we're new to both Avro and Hadoop so not sure of the most efficient
> (and reliable) way to accomplish this.
> Thanks,
> Frank Grimes
> On 2012-01-06, at 11:46 AM, Robert Evans wrote:
> Frank,
> That depends on what you mean by combining. It sounds like you are trying to
> aggregate data from several days, which may involve doing a join so I would
> say a MapReduce job is your best bet.  If you are not going to do any
> processing at all then why are you trying to combine them?  Is there
> something that requires them all to be part of a single file?  MapReduce
> processing should be able to handle reading in multiple files just as well
> as reading in a single file.
> --Bobby Evans
> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrimes97@gmail.com> wrote:
> Hi All,
> I was wondering if there was an easy way to combing multiple .avro files
> efficiently.
> e.g. combining multiple hours of logs into a daily aggregate
> Note that our Avro schema might evolve to have new (nullable) fields added
> but no fields will be removed.
> I'd like to avoid needing to pull the data down for combining and subsequent
> "hadoop dfs -put".
> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that
> automatically?
> FYI, the following seems to indicate that Avro files might be easily
> combinable: https://issues.apache.org/jira/browse/AVRO-127
> Or is an M/R job the best way to go for this?
> Thanks,
> Frank Grimes

Joseph Echeverria
Cloudera, Inc.

View raw message