hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhruba Borthakur <dhr...@gmail.com>
Subject Re: Warehouse 'symlinks'
Date Mon, 20 Apr 2009 18:01:44 GMT
Hi Edward,

Nice explanation. Can you pl describe ur use-case in the comments for
HADOOP-4044. It will help in making the case for this JIRA to get into trunk
sooner rather than later.

thanks,
dhruba


On Mon, Apr 20, 2009 at 7:59 AM, Edward Capriolo <edlinuxguru@gmail.com>wrote:

> Actually, I am working to get the files moved into the warehouse by
> default :), but I still think there might be a general need for this.
>
> External tables will work in some cases but not in others. For example
> suppose a directory inside hadoop:
> /user/edward/weblogs/{web1.log,web2.log,web3.log}. I can use EXTERNAL
> to point to the parent directory. This will work unless a process
> creates another file in this directory with a different format that
> holds different data. say web_logsummary.csv. (this is my case)
>
> Being able to drop in a 'symlink' where a file would go could be used
> like an SQL VIEW. Or could be used to create structures from already
> existing data. Imagine a user that has a large hadoop deployment and
> wishing to migrate/ start using  hive. They would need to recode
> application paths because external table is nice but not very
> flexible. If you had a 'symlink' concept anyone can start using hive
> without re-organizing or copying data.
>
> In the end managing the 'symlinks' could get cumbersome, but I think
> its a powerful concept. Right now hive has a lot of facilities to deal
> with all input formats, such as specifying delimiters etc, that is
> super helpful, but forcing the data either into a warehouse or into an
> external table is limiting.
>
> On Mon, Apr 20, 2009 at 5:29 AM, Jeff Hammerbacher <hammer@cloudera.com>
> wrote:
> > Hey Edward,
> >
> > Can you just treat the files as external tables?
> >
> > Later,
> > Jeff
> >
> > On Sun, Apr 19, 2009 at 8:24 AM, Edward Capriolo <edlinuxguru@gmail.com
> >wrote:
> >
> >> On Sun, Apr 19, 2009 at 3:19 AM, Dhruba Borthakur <dhruba@gmail.com>
> >> wrote:
> >> > HADOOP-4044 is scheduled to finally make it to 0.21 release. And 0.21
> is
> >> > still a while away.
> >> >
> >> > That said, if one imports a data-set (set of files, or directory) into
> a
> >> > warehouse, isn't it safer to move that dataset into the warehouse
> itself
> >> > rather than letting it sit outside. For one thing, the target of the
> >> symlink
> >> > might not be accessible to all hadoop slave nodes.
> >> >
> >> > -dhruba
> >> >
> >> >
> >> > On Sat, Apr 18, 2009 at 7:41 PM, Edward Capriolo <
> edlinuxguru@gmail.com
> >> >wrote:
> >> >
> >> >> I was looking at HADOOP-4044. It would be nice to be able to work on
> >> >> files without moving them into the warehouse. Could a SerDe handle
a
> >> >> similar task?
> >> >>
> >> >
> >>
> >> Yes it would be safer to move it inside.
> >>
> >> The reason I would like to do this is in our deployment map reduce
> >> programs are creating files outside of the warehouse. I do not want to
> >> move them into the warehouse and I do not want to copy them. Being
> >> able to 'symlink' would allow me to assemble virtual tables/ without
> >> moving data changing the flow of an already existing process.
> >>
> >> So I am only looking to symlink to other files in the same filesystem.
> >> On the extreme end a symlink to an external resource could be very
> >> useful to but that is not what I was thinking of.
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message