hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Grimes <frankgrime...@gmail.com>
Subject Re: Combining AVRO files efficiently within HDFS
Date Thu, 12 Jan 2012 15:42:33 GMT
As it turns out, this is due to our /tmp partition being too small.
We'll either need to increase it or put hadoop.tmp.dir on a bigger partition.


On 2012-01-11, at 4:29 PM, Frank Grimes wrote:

> Ok, so I wrote a MapReduce job to merge the files and it appears to be working with a
limited input set.
> Thanks again, BTW.
> 
> However, if I increase the amount of input data I start getting the following types of
errors:
> 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local
directory for output/file.out/file.out
> or 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local
directory for output/map_0.out
> 
> Are there any logs I should be looking at to determine the exact cause of these errors?
> Are there any settings I could/should be increasing?
> 
> Note that in order to avoid unnecessary sorting overhead, I made each key a constant
(1L) so that the logs are combined but ordering isn't necessarily preserved.
> i.e.
> 
> 	public static class AvroReachMapper extends AvroMapper<DeliveryLogEvent, Pair<Long,
DeliveryLogEvent>> {
> 		public void map(DeliveryLogEvent levent, AvroCollector<Pair<Long, DeliveryLogEvent>>
collector, Reporter reporter)
> 			throws IOException {
> 			
> 			collector.collect(new Pair<Long, DeliveryLogEvent>(1L, levent));
> 		}
> 	}
> 	
> 	public static class Reduce extends AvroReducer<Long, DeliveryLogEvent, DeliveryLogEvent>
{
> 
> 		@Override
> 		public void reduce(Long key, Iterable<DeliveryLogEvent> values,
> 				AvroCollector<DeliveryLogEvent> collector, Reporter reporter)
> 				throws IOException {
> 
> 			for (DeliveryLogEvent event : values) {
> 				collector.collect(event);
> 			}
> 		}
> 
> 	}
> 
> I've also noticed that /tmp/mapred seems to fill up and doesn't automatically get cleaned
out.
> Is Hadoop itself supposed to clean up those old temporary work files or do we need a
Cron job for that?
> 
> Thanks,
> 
> Frank Grimes
> 
> 
> 
> 
> On 2012-01-06, at 3:56 PM, Joey Echeverria wrote:
> 
>> I would use a MapReduce job to merge them.
>> 
>> -Joey
>> 
>> On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes <frankgrimes97@gmail.com> wrote:
>>> Hi Joey,
>>> 
>>> That's a very good suggestion and might suit us just fine.
>>> 
>>> However, many of the files will be much smaller than the HDFS block size.
>>> That could affect the performance of the MapReduce jobs, correct?
>>> Also, from my understanding it would put more burden on the name node (memory
usage) than is necessary.
>>> 
>>> Assuming we did want to combine the actual files... how would you suggest we
might go about it?
>>> 
>>> Thanks,
>>> 
>>> Frank Grimes
>>> 
>>> 
>>> On 2012-01-06, at 1:05 PM, Joey Echeverria wrote:
>>> 
>>>> I would do it by staging the machine data into a temporary directory
>>>> and then renaming the directory when it's been verified. So, data
>>>> would be written into directories like this:
>>>> 
>>>> 2012-01/02/00/stage/machine1.log.avro
>>>> 2012-01/02/00/stage/machine2.log.avro
>>>> 2012-01/02/00/stage/machine3.log.avro
>>>> 
>>>> After verification, you'd rename the 2012-01/02/00/stage directory to
>>>> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic
>>>> operation, you get the guarantee the you're looking for without having
>>>> to do extra IO. There shouldn't be a benefit to merging the individual
>>>> files unless they're too small.
>>>> 
>>>> -Joey
>>>> 
>>>> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrimes97@gmail.com>
wrote:
>>>>> Hi Bobby,
>>>>> 
>>>>> Actually, the problem we're trying to solve is one of completeness.
>>>>> 
>>>>> Say we have 3 machines generating log events and putting them to HDFS
on an
>>>>> hourly basis.
>>>>> e.g.
>>>>> 2012-01/01/00/machine1.log.avro
>>>>> 2012-01/01/00/machine2.log.avro
>>>>> 2012-01/01/00/machine3.log.avro
>>>>> 
>>>>> Sometime after the hour, we would have a scheduled job verify that all
the
>>>>> expected machines' log files are present and complete in HDFS.
>>>>> 
>>>>> Before launching MapReduce jobs for a given date range, we want to verify
>>>>> that the job will run over complete data.
>>>>> If not, the query would error out.
>>>>> 
>>>>> We want our query/MapReduce layer to not need to be aware of logs at
the
>>>>> machine level, only the presence or not of an hour's worth of logs.
>>>>> 
>>>>> We were thinking that after verifying all in individual log files for
an
>>>>> hour, they could be combined into 2012-01/01/00/log.avro.
>>>>> The presence of 2012-01-01-00.log.avro would be all that needs to be
>>>>> verified.
>>>>> 
>>>>> However, we're new to both Avro and Hadoop so not sure of the most efficient
>>>>> (and reliable) way to accomplish this.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Frank Grimes
>>>>> 
>>>>> 
>>>>> On 2012-01-06, at 11:46 AM, Robert Evans wrote:
>>>>> 
>>>>> Frank,
>>>>> 
>>>>> That depends on what you mean by combining. It sounds like you are trying
to
>>>>> aggregate data from several days, which may involve doing a join so I
would
>>>>> say a MapReduce job is your best bet.  If you are not going to do any
>>>>> processing at all then why are you trying to combine them?  Is there
>>>>> something that requires them all to be part of a single file?  MapReduce
>>>>> processing should be able to handle reading in multiple files just as
well
>>>>> as reading in a single file.
>>>>> 
>>>>> --Bobby Evans
>>>>> 
>>>>> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrimes97@gmail.com> wrote:
>>>>> 
>>>>> Hi All,
>>>>> 
>>>>> I was wondering if there was an easy way to combing multiple .avro files
>>>>> efficiently.
>>>>> e.g. combining multiple hours of logs into a daily aggregate
>>>>> 
>>>>> Note that our Avro schema might evolve to have new (nullable) fields
added
>>>>> but no fields will be removed.
>>>>> 
>>>>> I'd like to avoid needing to pull the data down for combining and subsequent
>>>>> "hadoop dfs -put".
>>>>> 
>>>>> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle
that
>>>>> automatically?
>>>>> FYI, the following seems to indicate that Avro files might be easily
>>>>> combinable: https://issues.apache.org/jira/browse/AVRO-127
>>>>> 
>>>>> Or is an M/R job the best way to go for this?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Frank Grimes
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Joseph Echeverria
>>>> Cloudera, Inc.
>>>> 443.305.9434
>>> 
>> 
>> 
>> 
>> -- 
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
> 


Mime
View raw message