hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Warehouse 'symlinks'
Date Mon, 20 Apr 2009 14:59:01 GMT
Actually, I am working to get the files moved into the warehouse by
default :), but I still think there might be a general need for this.

External tables will work in some cases but not in others. For example
suppose a directory inside hadoop:
/user/edward/weblogs/{web1.log,web2.log,web3.log}. I can use EXTERNAL
to point to the parent directory. This will work unless a process
creates another file in this directory with a different format that
holds different data. say web_logsummary.csv. (this is my case)

Being able to drop in a 'symlink' where a file would go could be used
like an SQL VIEW. Or could be used to create structures from already
existing data. Imagine a user that has a large hadoop deployment and
wishing to migrate/ start using  hive. They would need to recode
application paths because external table is nice but not very
flexible. If you had a 'symlink' concept anyone can start using hive
without re-organizing or copying data.

In the end managing the 'symlinks' could get cumbersome, but I think
its a powerful concept. Right now hive has a lot of facilities to deal
with all input formats, such as specifying delimiters etc, that is
super helpful, but forcing the data either into a warehouse or into an
external table is limiting.

On Mon, Apr 20, 2009 at 5:29 AM, Jeff Hammerbacher <hammer@cloudera.com> wrote:
> Hey Edward,
> Can you just treat the files as external tables?
> Later,
> Jeff
> On Sun, Apr 19, 2009 at 8:24 AM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>> On Sun, Apr 19, 2009 at 3:19 AM, Dhruba Borthakur <dhruba@gmail.com>
>> wrote:
>> > HADOOP-4044 is scheduled to finally make it to 0.21 release. And 0.21 is
>> > still a while away.
>> >
>> > That said, if one imports a data-set (set of files, or directory) into a
>> > warehouse, isn't it safer to move that dataset into the warehouse itself
>> > rather than letting it sit outside. For one thing, the target of the
>> symlink
>> > might not be accessible to all hadoop slave nodes.
>> >
>> > -dhruba
>> >
>> >
>> > On Sat, Apr 18, 2009 at 7:41 PM, Edward Capriolo <edlinuxguru@gmail.com
>> >wrote:
>> >
>> >> I was looking at HADOOP-4044. It would be nice to be able to work on
>> >> files without moving them into the warehouse. Could a SerDe handle a
>> >> similar task?
>> >>
>> >
>> Yes it would be safer to move it inside.
>> The reason I would like to do this is in our deployment map reduce
>> programs are creating files outside of the warehouse. I do not want to
>> move them into the warehouse and I do not want to copy them. Being
>> able to 'symlink' would allow me to assemble virtual tables/ without
>> moving data changing the flow of an already existing process.
>> So I am only looking to symlink to other files in the same filesystem.
>> On the extreme end a symlink to an external resource could be very
>> useful to but that is not what I was thinking of.

View raw message