hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@thoughtworks.com>
Subject Re: Detect when file is not being written by another process
Date Wed, 26 Sep 2012 08:52:48 GMT
Agree with Bejoy. The problem you've mentioned sounds like building
something like a workflow, which is what Oozie is supposed to do.

Thanks
hemanth

On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:

> Hi Peter
>
> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
> soon as the files are written to a  certain hdfs directory.
>
>
> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  These are log files being deposited by other processes, which we may
>> not have control over.
>>
>>  We don't want multiple processes to write to the same files — we just
>> don't want to start our jobs until they have been completely written.
>>
>>  Sorry for lack of clarity & thanks for the response.
>>
>>
>>  --Pete
>>
>>   From: Bertrand Dechoux <dechouxb@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Date: Tuesday, September 25, 2012 12:33 PM
>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Subject: Re: Detect when file is not being written by another process
>>
>>  Hi,
>>
>> Multiple files and aggregation or something like hbase?
>>
>> Could you tell use more about your context? What are the volumes? Why do
>> you want multiple processes to write to the same file?
>>
>> Regards
>>
>> Bertrand
>>
>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
>> psheridan@millennialmedia.com> wrote:
>>
>>>  Hi all.
>>>
>>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>> files when they've finished being written to HDFS by a different process.
>>>  There doesn't appear to be an API specifically for this.  We had
>>> discovered through experimentation that the FileSystem.append() method can
>>> be used for this purpose — it will fail if another process is writing to
>>> the file.
>>>
>>>  However: when running this on a multi-node cluster, using that API
>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>>> bunch of similar-sounding things.
>>>
>>>  What's the right way to solve this problem?  Thanks.
>>>
>>>
>>>  --Pete
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Mime
View raw message