hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alejandro Abdelnur <t...@cloudera.com>
Subject Re: Detect when file is not being written by another process
Date Thu, 27 Sep 2012 22:03:15 GMT
AFAIK there is not way to determine i a file has been fully written or not.

Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in
the output directory of a job. This _SUCCESS file is written at job
completion time, thus ensuring all the output of the job is ready.
This means that when Oozie is configured to look for a directory FOO/,
in practice it looks for the existence of FOO/_SUCCESS file.

You can configure Oozie to look for existence of FOO/ but this means
you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and
do a rename to FOO/ once you finished writing the data.

Thx

On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala
<yhemanth@thoughtworks.com> wrote:
> Agree with Bejoy. The problem you've mentioned sounds like building
> something like a workflow, which is what Oozie is supposed to do.
>
> Thanks
> hemanth
>
>
> On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:
>>
>> Hi Peter
>>
>> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
>> soon as the files are written to a  certain hdfs directory.
>>
>>
>> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan
>> <psheridan@millennialmedia.com> wrote:
>>>
>>> These are log files being deposited by other processes, which we may not
>>> have control over.
>>>
>>> We don't want multiple processes to write to the same files — we just
>>> don't want to start our jobs until they have been completely written.
>>>
>>> Sorry for lack of clarity & thanks for the response.
>>>
>>>
>>> --Pete
>>>
>>> From: Bertrand Dechoux <dechouxb@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>>> Date: Tuesday, September 25, 2012 12:33 PM
>>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>>> Subject: Re: Detect when file is not being written by another process
>>>
>>> Hi,
>>>
>>> Multiple files and aggregation or something like hbase?
>>>
>>> Could you tell use more about your context? What are the volumes? Why do
>>> you want multiple processes to write to the same file?
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan
>>> <psheridan@millennialmedia.com> wrote:
>>>>
>>>> Hi all.
>>>>
>>>> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>>> files when they've finished being written to HDFS by a different process.
>>>> There doesn't appear to be an API specifically for this.  We had discovered
>>>> through experimentation that the FileSystem.append() method can be used for
>>>> this purpose — it will fail if another process is writing to the file.
>>>>
>>>> However: when running this on a multi-node cluster, using that API
>>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch
>>>> of similar-sounding things.
>>>>
>>>> What's the right way to solve this problem?  Thanks.
>>>>
>>>>
>>>> --Pete
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>



-- 
Alejandro

Mime
View raw message