hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From unmesha sreeveni <unmeshab...@gmail.com>
Subject Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
Date Wed, 21 Jan 2015 04:49:15 GMT
But stil if the model is very large enough, how can we load them inti
Distributed cache or some thing like that.
Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
But it is confusing me

On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <drake.min@nexr.com> wrote:

> Hi,
>
> How about this ? The large model data stay in HDFS but with many
> replications and MapReduce program read the model from HDFS. In theory, the
> replication factor of model data equals with number of data nodes and with
> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
> tasks read the model data in their own disks.
>
> In this way, maybe use too many usage of HDFS, but the annoying partition
> problem will be gone.
>
> Thanks
>
> Drake 민영근 Ph.D
>
> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com>
> wrote:
>
>> Is there any way..
>> Waiting for a reply.I have posted the question every where..but none is
>> responding back.
>> I feel like this is the right place to ask doubts. As some of u may came
>> across the same issue and get stuck.
>>
>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>> in KNN hadoop also.
>>> And I trying to compare the performance too.
>>>
>>> Hope some pointers can help me.
>>>
>>>
>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com>
>>> wrote:
>>>
>>>>
>>>> have you considered implementing using something like spark?  That
>>>> could be much easier than raw map-reduce
>>>>
>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>> predicting the records.
>>>>>
>>>>> Here is the example for KNN.
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>> load them into Distributed cache.
>>>>>
>>>>> The one way is to split/partition the model Result into some files and
>>>>> perform the distance calculation for all records in that file and then
find
>>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>>
>>>>> How can we parttion the file and perform the operation on these
>>>>> partition ?
>>>>>
>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>
>>>>> This is what came to my thought.
>>>>>
>>>>> Is there any further way.
>>>>>
>>>>> Any pointers would help me.
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Mime
View raw message