hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From unmesha sreeveni <unmeshab...@gmail.com>
Subject Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
Date Wed, 21 Jan 2015 05:12:24 GMT
Yes I tried the same Drake.

I dont know if I understood your answer.

 Instead of loading them into setup() through cache I read them directly
from HDFS in map section. and for each incoming record .I found the
distance between all the records in HDFS.
ie if R ans S are my dataset, R is the model data stored in HDFs
and when S taken for processing
S1-R(finding distance with whole R set)
S2-R

But it is taking a long time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <drake.min@nexr.com> wrote:

> In my suggestion, map or reduce tasks do not use distributed cache. They
> use file directly from HDFS with short circuit local read. Like a shared
> storage method, but almost every node has the data with high-replication
> factor.
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <unmeshabiju@gmail.com>
> wrote:
>
>> But stil if the model is very large enough, how can we load them inti
>> Distributed cache or some thing like that.
>> Here is one source :
>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>> But it is confusing me
>>
>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <drake.min@nexr.com> wrote:
>>
>>> Hi,
>>>
>>> How about this ? The large model data stay in HDFS but with many
>>> replications and MapReduce program read the model from HDFS. In theory, the
>>> replication factor of model data equals with number of data nodes and with
>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>> tasks read the model data in their own disks.
>>>
>>> In this way, maybe use too many usage of HDFS, but the annoying
>>> partition problem will be gone.
>>>
>>> Thanks
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> Is there any way..
>>>> Waiting for a reply.I have posted the question every where..but none is
>>>> responding back.
>>>> I feel like this is the right place to ask doubts. As some of u may
>>>> came across the same issue and get stuck.
>>>>
>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>> of Data is not possible across Hadoop MapReduce. But I need to check
if
>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>> papers in KNN hadoop also.
>>>>> And I trying to compare the performance too.
>>>>>
>>>>> Hope some pointers can help me.
>>>>>
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> have you considered implementing using something like spark?  That
>>>>>> could be much easier than raw map-reduce
>>>>>>
>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>> predicting the records.
>>>>>>>
>>>>>>> Here is the example for KNN.
>>>>>>>
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> So if the model will be a large file say1 or 2 GB we will be
able to
>>>>>>> load them into Distributed cache.
>>>>>>>
>>>>>>> The one way is to split/partition the model Result into some
files
>>>>>>> and perform the distance calculation for all records in that
file and then
>>>>>>> find the min ditance and max occurance of classlabel and predict
the
>>>>>>> outcome.
>>>>>>>
>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>> partition ?
>>>>>>>
>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>
>>>>>>> This is what came to my thought.
>>>>>>>
>>>>>>> Is there any further way.
>>>>>>>
>>>>>>> Any pointers would help me.
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Mime
View raw message