hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qing Yan <qing...@gmail.com>
Subject Re: How to simplify our development flow under the means of using Hive?
Date Mon, 23 Feb 2009 05:33:39 GMT
**
Therotically, the streaming facility is as powerful as raw java M/R
programs,less efficent but easier to use(argubly). It does support seconardy
sort (DISTRIBUTED BY... SORT BY..). Though I do agree with you that not
everything can be easily expressed in SQL but that's
why they include TRANSFORM/streaming facility in Hive.

Regarding the scheduling part, my guess is it is beyond the scope of Hive ,
can you just use shell script?

my 2

Qingc

On Mon, Feb 23, 2009 at 11:59 AM, Min Zhou <coderplay@gmail.com> wrote:

> Hi Prasad ,
>
> This is just streaming,  a sort of tech how to complement the ability of
> Hive sql.  Sometimes this trick is also useless. For example,  if I want to
> do jobs like SecondarySort,  can that way be okay?
> My major intention is that I want to know how to schedule those two things,
> hive and raw mapreduce.
>
> On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka <prasad@facebook.com>wrote:
>
>> You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE
>> facilities. Check the wiki on how to use them. Or do you want something
>> different?
>>
>>
>>
>>
>> ------------------------------
>> *From: *Min Zhou <coderplay@gmail.com>
>> *Reply-To: *<hive-user@hadoop.apache.org>
>> *Date: *Sun, 22 Feb 2009 19:42:50 -0800
>> *To: *<hive-user@hadoop.apache.org>
>> *Subject: *How to simplify our development flow under the means of using
>> Hive?
>>
>>
>> Hi list,
>>
>>     I'm goint to take Hive into production to analyze our web logs, which
>> are hundreds of  giga-bytes per day. Previously, we did this job by using
>> Apache hadoop, running our raw mapreduce code. It did work, but it also
>> decreased our productivity directly. We were suffering from writting code
>> with similar logic again and again. It could be worse, when the format of
>> our logs being changed. For example, when we want to insert one more field
>> in each line of the log, the previous work would be useless, then we have to
>> redo it. Hence we are thinking about using Hive as a persistent layer, to
>> store and retrieve the schemes of the data easily. But we found that
>> sometimes Hive could not do some sort of complex analysis, because of the
>> limitation of the ideographic ability of SQL.   We have to write our own
>> UDFs, even though, some difficulties Hive still cannot go through.  Thus we
>> also need to write raw mapreduces code,  which let us come up against
>> another problem.  Since one is a set of SQL scripts, the other is pieces of
>> java or hybrid code, How to coordinate  Hive and raw mapreduce code and how
>> to shedule them? How does Facebook use Hive? And what is your solution when
>> you come across the similar problems?
>>
>>     In the end, we are considering about using Hive as our data warehouse.
>> Any suggestions?
>>
>> Thanks in advance!
>> Min
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> http://coderplay.javaeye.com
>>
>>
>
> Regards,
>
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> http://coderplay.javaeye.com
>

Mime
View raw message