hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Grimes <frankgrime...@gmail.com>
Subject Re: Combining AVRO files efficiently within HDFS
Date Wed, 11 Jan 2012 21:29:15 GMT
Ok, so I wrote a MapReduce job to merge the files and it appears to be working with a limited
input set.
Thanks again, BTW.

However, if I increase the amount of input data I start getting the following types of errors:

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory
for output/file.out/file.out
or 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory
for output/map_0.out

Are there any logs I should be looking at to determine the exact cause of these errors?
Are there any settings I could/should be increasing?

Note that in order to avoid unnecessary sorting overhead, I made each key a constant (1L)
so that the logs are combined but ordering isn't necessarily preserved.
i.e.

	public static class AvroReachMapper extends AvroMapper<DeliveryLogEvent, Pair<Long,
DeliveryLogEvent>> {
		public void map(DeliveryLogEvent levent, AvroCollector<Pair<Long, DeliveryLogEvent>>
collector, Reporter reporter)
			throws IOException {
			
			collector.collect(new Pair<Long, DeliveryLogEvent>(1L, levent));
		}
	}
	
	public static class Reduce extends AvroReducer<Long, DeliveryLogEvent, DeliveryLogEvent>
{

		@Override
		public void reduce(Long key, Iterable<DeliveryLogEvent> values,
				AvroCollector<DeliveryLogEvent> collector, Reporter reporter)
				throws IOException {

			for (DeliveryLogEvent event : values) {
				collector.collect(event);
			}
		}

	}

I've also noticed that /tmp/mapred seems to fill up and doesn't automatically get cleaned
out.
Is Hadoop itself supposed to clean up those old temporary work files or do we need a Cron
job for that?

Thanks,

Frank Grimes




On 2012-01-06, at 3:56 PM, Joey Echeverria wrote:

> I would use a MapReduce job to merge them.
> 
> -Joey
> 
> On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes <frankgrimes97@gmail.com> wrote:
>> Hi Joey,
>> 
>> That's a very good suggestion and might suit us just fine.
>> 
>> However, many of the files will be much smaller than the HDFS block size.
>> That could affect the performance of the MapReduce jobs, correct?
>> Also, from my understanding it would put more burden on the name node (memory usage)
than is necessary.
>> 
>> Assuming we did want to combine the actual files... how would you suggest we might
go about it?
>> 
>> Thanks,
>> 
>> Frank Grimes
>> 
>> 
>> On 2012-01-06, at 1:05 PM, Joey Echeverria wrote:
>> 
>>> I would do it by staging the machine data into a temporary directory
>>> and then renaming the directory when it's been verified. So, data
>>> would be written into directories like this:
>>> 
>>> 2012-01/02/00/stage/machine1.log.avro
>>> 2012-01/02/00/stage/machine2.log.avro
>>> 2012-01/02/00/stage/machine3.log.avro
>>> 
>>> After verification, you'd rename the 2012-01/02/00/stage directory to
>>> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic
>>> operation, you get the guarantee the you're looking for without having
>>> to do extra IO. There shouldn't be a benefit to merging the individual
>>> files unless they're too small.
>>> 
>>> -Joey
>>> 
>>> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrimes97@gmail.com>
wrote:
>>>> Hi Bobby,
>>>> 
>>>> Actually, the problem we're trying to solve is one of completeness.
>>>> 
>>>> Say we have 3 machines generating log events and putting them to HDFS on
an
>>>> hourly basis.
>>>> e.g.
>>>> 2012-01/01/00/machine1.log.avro
>>>> 2012-01/01/00/machine2.log.avro
>>>> 2012-01/01/00/machine3.log.avro
>>>> 
>>>> Sometime after the hour, we would have a scheduled job verify that all the
>>>> expected machines' log files are present and complete in HDFS.
>>>> 
>>>> Before launching MapReduce jobs for a given date range, we want to verify
>>>> that the job will run over complete data.
>>>> If not, the query would error out.
>>>> 
>>>> We want our query/MapReduce layer to not need to be aware of logs at the
>>>> machine level, only the presence or not of an hour's worth of logs.
>>>> 
>>>> We were thinking that after verifying all in individual log files for an
>>>> hour, they could be combined into 2012-01/01/00/log.avro.
>>>> The presence of 2012-01-01-00.log.avro would be all that needs to be
>>>> verified.
>>>> 
>>>> However, we're new to both Avro and Hadoop so not sure of the most efficient
>>>> (and reliable) way to accomplish this.
>>>> 
>>>> Thanks,
>>>> 
>>>> Frank Grimes
>>>> 
>>>> 
>>>> On 2012-01-06, at 11:46 AM, Robert Evans wrote:
>>>> 
>>>> Frank,
>>>> 
>>>> That depends on what you mean by combining. It sounds like you are trying
to
>>>> aggregate data from several days, which may involve doing a join so I would
>>>> say a MapReduce job is your best bet.  If you are not going to do any
>>>> processing at all then why are you trying to combine them?  Is there
>>>> something that requires them all to be part of a single file?  MapReduce
>>>> processing should be able to handle reading in multiple files just as well
>>>> as reading in a single file.
>>>> 
>>>> --Bobby Evans
>>>> 
>>>> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrimes97@gmail.com> wrote:
>>>> 
>>>> Hi All,
>>>> 
>>>> I was wondering if there was an easy way to combing multiple .avro files
>>>> efficiently.
>>>> e.g. combining multiple hours of logs into a daily aggregate
>>>> 
>>>> Note that our Avro schema might evolve to have new (nullable) fields added
>>>> but no fields will be removed.
>>>> 
>>>> I'd like to avoid needing to pull the data down for combining and subsequent
>>>> "hadoop dfs -put".
>>>> 
>>>> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that
>>>> automatically?
>>>> FYI, the following seems to indicate that Avro files might be easily
>>>> combinable: https://issues.apache.org/jira/browse/AVRO-127
>>>> 
>>>> Or is an M/R job the best way to go for this?
>>>> 
>>>> Thanks,
>>>> 
>>>> Frank Grimes
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>> 
> 
> 
> 
> -- 
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434


Mime
View raw message