hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Chauhan <>
Subject Re: Reading and Writing with Hive 0.13 from a Yarn application
Date Wed, 03 Sep 2014 18:25:40 GMT
This api is designed for use cases like yours only. So, I will say api is
failing if it cannot service what you are trying to do with it. So, I will
encourage you to use this api and consider current shortcoming as missing
feature in it.
Feel free to file a jira requesting addition of these methods in
ReaderContext. Patches are welcome too :)

Hope it helps,

On Wed, Sep 3, 2014 at 11:12 AM, Nathan Bamford <
> wrote:

>  Hi Ashutosh,
>   Thanks for the reply!
>   Well, we are a yarn app that is essentially doing the same things
> mapreduce does. For regular files in Hadoop, we get the block locations and
> sizes and perform some internal sorting and load balancing on the master
> which then creates the slave yarn apps on individual nodes for reading. We
> strive for data locality, as much as possible.
>   To interface with Hive, the HCatalog api seemed like the appropriate
> interface.  It does a lot of things we want via the ReadEntity, allowing us
> to query and read the Hive tables at a high level.
>   I used the readerwriter example (from Hive 0.12) to get things running,
> but I was using HCatSplit just like our internal split classes. I retrieved
> them from the ReaderContext and ran them through the same sorting
> algorithms, then serialized them and sent them to the individual yarn apps,
> etc.
>   I understand the rationale for the smaller api, which is why I wondered
> if there's another avenue I should be pursuing as a yarn app (metadata vs.
> HCatalog, for instance).
>   All that being said :), the ability to get the block locations (and
> sizes, if possible) would certainly solve my problems.
>  Thanks,
>  Nathan
>  ------------------------------
> *From:* Ashutosh Chauhan <>
> *Sent:* Wednesday, September 3, 2014 9:16 AM
> *To:*
> *Subject:* Re: Reading and Writing with Hive 0.13 from a Yarn application
>   Hi Nathan,
>  This was done in
> Reasoning was to minimize api surface area to users so that they are immune
> of incompatible changes in internal classes and thus making it easier for
> them to consume this and not get worried about version upgrade. Seems like
> in the process some of the functionality went away.
> Which info you are looking for exactly? Is it String[] getBlockLocations()
> equivalent of InputSplit? If so, we can consider adding that in
> ReaderContext() since that one need not to expose any hadoop or hive
> classes.
>  Thanks,
> Ashutosh
> On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford <
>> wrote:
>>  Hi,
>>   My company has been working on a Yarn application for a couple of
>> years-- we essentially take the place of MapReduce and split our data and
>> processing ourselves.
>>   One of the things we've been working to support is Hive access, and the
>> HCatalog interfaces and API seemed perfect. Using this information:
>> <>
>> and
>> from the source code, I was able to create and use
>> HCatSplits to allow balanced data local parallel reading (using the size
>> and locations methods available from each HCatSplit).
>>   Much to my dismay, 0.13 removes a lot of that functionality. The
>> ReaderContext class is now an interface that only exposes numSplits,
>> whereas all of the other methods are in the inaccessible (package
>> only) ReaderContextImpl class.
>>   Since I no longer have access to the actual HCatSplits from the
>> ReaderContext, I am unable to process them and send them to our yarn app on
>> the data local nodes.  My only choice seems to be to partition out the
>> splits to slave nodes more or less at random.
>>   Does anyone know if, as of 0.13, this is the intended way to interface
>> with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit
>> only intended for internal use, now?
>>  Thanks,
>>  Nathan Bamford

View raw message