hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vivek thakre <>
Subject Re: Loopup objects in distributed cache
Date Fri, 05 Apr 2013 03:52:02 GMT
Thanks Jan for your reply. This is helpful


On Thu, Apr 4, 2013 at 12:11 AM, Jan DolinĂ¡r <> wrote:

> Hello Vivek,
> GenericUDTF has method initialize() which is only called once per task. So
> if you read your files in this method and store the structures in memory
> then the overhead is relatively small (reading 15MB per mapper is
> negligible compared to several GB of processed data).
> Best regards,
> Jan
> On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre <>wrote:
>> Hello,
>> I want to write a functionality using UDTF. The functionality involves
>> reading 7 different text files and create lookup structures such as Map,
>> Set, List , Map of String and List etc to be used in the logic.
>> These files are small size average 15 MB.
>> I can add these files in distributed cache and access them in UDTF, read
>> the files, and create the necessary lookup data structures, but this would
>> mean that the files will be opened, read and closed every time the UDTF is
>> invoked.
>> Is there a way that I can just read the files once, create the data
>> structures needed , put them in distributed cache and access them from UDTF?
>> I don't think creating hive tables from these files and doing a map side
>> join is possible, as the functionality that I want to implement is fairly
>> complex and I am not sure if it can be done just using hive query and join
>> without using UDTF.
>> Thanks in advance.

View raw message