hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Chauhan <hashut...@apache.org>
Subject Re: Reading and Writing with Hive 0.13 from a Yarn application
Date Wed, 03 Sep 2014 18:25:40 GMT
This api is designed for use cases like yours only. So, I will say api is
failing if it cannot service what you are trying to do with it. So, I will
encourage you to use this api and consider current shortcoming as missing
feature in it.
Feel free to file a jira requesting addition of these methods in
ReaderContext. Patches are welcome too :)

Hope it helps,
Ashutosh


On Wed, Sep 3, 2014 at 11:12 AM, Nathan Bamford <nathan.bamford@redpoint.net
> wrote:

>  Hi Ashutosh,
>
>   Thanks for the reply!
>
>   Well, we are a yarn app that is essentially doing the same things
> mapreduce does. For regular files in Hadoop, we get the block locations and
> sizes and perform some internal sorting and load balancing on the master
> which then creates the slave yarn apps on individual nodes for reading. We
> strive for data locality, as much as possible.
>
>   To interface with Hive, the HCatalog api seemed like the appropriate
> interface.  It does a lot of things we want via the ReadEntity, allowing us
> to query and read the Hive tables at a high level.
>
>   I used the readerwriter example (from Hive 0.12) to get things running,
> but I was using HCatSplit just like our internal split classes. I retrieved
> them from the ReaderContext and ran them through the same sorting
> algorithms, then serialized them and sent them to the individual yarn apps,
> etc.
>
>   I understand the rationale for the smaller api, which is why I wondered
> if there's another avenue I should be pursuing as a yarn app (metadata vs.
> HCatalog, for instance).
>
>   All that being said :), the ability to get the block locations (and
> sizes, if possible) would certainly solve my problems.
>
>
>  Thanks,
>
>
>  Nathan
>
>
>  ------------------------------
> *From:* Ashutosh Chauhan <hashutosh@apache.org>
> *Sent:* Wednesday, September 3, 2014 9:16 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Reading and Writing with Hive 0.13 from a Yarn application
>
>   Hi Nathan,
>
>  This was done in https://issues.apache.org/jira/browse/HIVE-6248
> Reasoning was to minimize api surface area to users so that they are immune
> of incompatible changes in internal classes and thus making it easier for
> them to consume this and not get worried about version upgrade. Seems like
> in the process some of the functionality went away.
> Which info you are looking for exactly? Is it String[] getBlockLocations()
> equivalent of InputSplit? If so, we can consider adding that in
> ReaderContext() since that one need not to expose any hadoop or hive
> classes.
>
>  Thanks,
> Ashutosh
>
>
> On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford <
> nathan.bamford@redpoint.net> wrote:
>
>>  Hi,
>>
>>   My company has been working on a Yarn application for a couple of
>> years-- we essentially take the place of MapReduce and split our data and
>> processing ourselves.
>>
>>   One of the things we've been working to support is Hive access, and the
>> HCatalog interfaces and API seemed perfect. Using this information:
>> <https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html>
>> https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and
>> TestReaderWriter.java from the source code, I was able to create and use
>> HCatSplits to allow balanced data local parallel reading (using the size
>> and locations methods available from each HCatSplit).
>>
>>   Much to my dismay, 0.13 removes a lot of that functionality. The
>> ReaderContext class is now an interface that only exposes numSplits,
>> whereas all of the other methods are in the inaccessible (package
>> only) ReaderContextImpl class.
>>
>>   Since I no longer have access to the actual HCatSplits from the
>> ReaderContext, I am unable to process them and send them to our yarn app on
>> the data local nodes.  My only choice seems to be to partition out the
>> splits to slave nodes more or less at random.
>>
>>   Does anyone know if, as of 0.13, this is the intended way to interface
>> with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit
>> only intended for internal use, now?
>>
>>
>>  Thanks,
>>
>>
>>  Nathan Bamford
>>
>
>

Mime
View raw message