hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drake민영근 <drake....@nexr.com>
Subject Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
Date Wed, 21 Jan 2015 02:00:40 GMT
Hi,

How about this ? The large model data stay in HDFS but with many
replications and MapReduce program read the model from HDFS. In theory, the
replication factor of model data equals with number of data nodes and with
the Short Circuit Local Reads function of HDFS datanode, the map or reduce
tasks read the model data in their own disks.

In this way, maybe use too many usage of HDFS, but the annoying partition
problem will be gone.

Thanks

Drake 민영근 Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com>
wrote:

> Is there any way..
> Waiting for a reply.I have posted the question every where..but none is
> responding back.
> I feel like this is the right place to ask doubts. As some of u may came
> across the same issue and get stuck.
>
> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com>
> wrote:
>
>> Yes, One of my friend is implemeting the same. I know global sharing of
>> Data is not possible across Hadoop MapReduce. But I need to check if that
>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>> in KNN hadoop also.
>> And I trying to compare the performance too.
>>
>> Hope some pointers can help me.
>>
>>
>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>>
>>>
>>> have you considered implementing using something like spark?  That could
>>> be much easier than raw map-reduce
>>>
>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> In KNN like algorithm we need to load model Data into cache for
>>>> predicting the records.
>>>>
>>>> Here is the example for KNN.
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>> load them into Distributed cache.
>>>>
>>>> The one way is to split/partition the model Result into some files and
>>>> perform the distance calculation for all records in that file and then find
>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>
>>>> How can we parttion the file and perform the operation on these
>>>> partition ?
>>>>
>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>      2nd record <Distance> parttition1,partition2,...
>>>>
>>>> This is what came to my thought.
>>>>
>>>> Is there any further way.
>>>>
>>>> Any pointers would help me.
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Mime
View raw message