hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: File Reloading
Date Fri, 31 May 2013 15:59:45 GMT
I do not see Raj's response but first, yes you can overwrite data (file) as
many times as you want at the same location in HDFS/Hadoop. Secondly, you
say that the file is small and you indeed want to read it as whole. So, as
I said, then the issue of making sure that the reader task gets the latest
version, then this becomes a generic problem rather than specific to Hadoop
or HDFS. Basically, you would do or adopt the same approach in resolving
this if you were doing this on any file system. As far as I understand,
there is nothing special that you need to do for Hadoop/HDFS.


On Fri, May 31, 2013 at 11:51 AM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> @Raj: so, updating the data and storing them into the same destination
> would work?
> @Shahab the file is very small, and therefore I am expecting to read it at
> once. what would you suggest?
> On Fri, May 31, 2013 at 5:30 PM, Shahab Yunus <shahab.yunus@gmail.com>wrote:
>> I might not have understood your usecase properly so I apologize for
>> that.
>> But what I think here you need is something outside of Hadoop/HDFS. I am
>> presuming that you need to read the whole updated file when you are going
>> to process it with your never-ending job, right? You don't expect to read
>> it piecemeal or in chunks. If that is indeed the case, then your never
>> ending job can use generic techniques to check whether file's signature or
>> any property has changed from the last time and only process it if it has
>> changed. You file writing/updating process can update the file
>> independently of the reading/processing one.
>> Regards,
>> Shahab
>> On Fri, May 31, 2013 at 11:23 AM, Adamantios Corais <
>> adamantios.corais@gmail.com> wrote:
>>> I am new to hadoop so apologize beforehand for my very-fundamental
>>> question.
>>> Lets assume that I have a file stored into hadoop that it gets updated
>>> once a day, Also assume that there is a task running at the back end of
>>> hadoop that never stops. How could I reload this file so that hadoop starts
>>> considering the updated values than the old ones???

View raw message