Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC161DE33 for ; Wed, 22 May 2013 03:36:00 +0000 (UTC) Received: (qmail 93409 invoked by uid 500); 22 May 2013 03:35:56 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 93149 invoked by uid 500); 22 May 2013 03:35:55 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 93117 invoked by uid 99); 22 May 2013 03:35:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 03:35:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of juan.suero@gmail.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-we0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 03:35:50 +0000 Received: by mail-we0-f176.google.com with SMTP id p58so732716wes.21 for ; Tue, 21 May 2013 20:35:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=qLCi3MoSOK4nju3ak3dVWK/6PyxMBuikrWMENDnjDzM=; b=DmZI1jOg093QikiC1HNK+/bxhl5RZ0PSXt7+mbdJSixnG+14D8nvJPUBkvFSLaGyio HcLy8edF/Hgf1RGJ9HlBbwSWN8yqk1+xEIh5+I8jErOrG2D2KhBT0i3nlA4H06OajKMM 8VBYE2XPVH2TvKmXYcY77hNhSNaUA2SIx30O8XPbYbjDTZxlGqOKxyJOQGjtfLw9YiWb hL2BK/S/gj/n3qah8CaiGCparZBUuw6pCUv7XaIFKueJBYh9QXlY/IY6SxJ6maEfjwNI qLkTHdv1U62tAaY+UJrTQghMdzSQVx1/60jRMAO3LNCTK9ATPbwba0fTPuPu7e8le9PX CmkQ== MIME-Version: 1.0 X-Received: by 10.180.105.231 with SMTP id gp7mr29161670wib.23.1369193728629; Tue, 21 May 2013 20:35:28 -0700 (PDT) Received: by 10.217.110.133 with HTTP; Tue, 21 May 2013 20:35:28 -0700 (PDT) In-Reply-To: <1369191774.59557.YahooMailNeo@web190702.mail.sg3.yahoo.com> References: <1369191774.59557.YahooMailNeo@web190702.mail.sg3.yahoo.com> Date: Tue, 21 May 2013 23:35:28 -0400 Message-ID: Subject: Re: Project ideas From: Juan Suero To: user@hadoop.apache.org, Sai Sai Content-Type: multipart/alternative; boundary=f46d0442881ad2e0e304dd46413e X-Virus-Checked: Checked by ClamAV on apache.org --f46d0442881ad2e0e304dd46413e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable im a newbie but maybe this will also add some value... it is my understanding that mapreduce is like a distributed "group by" statement when you run a statement like this against your petabyes of dataset it can take a long time.. first and foremost because the first thing you have to do before you apply the group by logic is to read the data off disk. if your disk reads at 100/MBs then you can do the math. The time frame that this query will run take at least this long to complete= . If you need this info really fast like in the next hour to support i dunno personalization features on a ecommerce site or month end report that needs to be complete in 2 hours. Then it would be nice to put equal parts of your data on 100s of disks and run the same algorithm in parralel but thats just if your bottleneck is disk. what if your dataset is relatively small but calculations done on each element coming in is large so therefore your bottleneck there is CPU power there are alot of bottlenecks you could run into. number of threads memory latency of remote apis or remote database you hit as you analyze the data Theres a book called programming collective intelligence from oreilly that should help you out too http://shop.oreilly.com/product/9780596529321.do On Tue, May 21, 2013 at 11:02 PM, Sai Sai wrote: > Excellent Sanjay, really excellent input. Many Thanks for this input. > I have been always thinking about some ideas but never knowing what to > proceed with. > Thanks again. > Sai > > ------------------------------ > *From:* Sanjay Subramanian > *To:* "user@hadoop.apache.org" > *Sent:* Tuesday, 21 May 2013 11:51 PM > *Subject:* Re: Project ideas > > +1 > > My $0.02 is look look around and see problems u can solve=E2=80=A6Its be= tter to > get a list of problems and see if u can model a solution using map-reduce > framework > > An example is as follows > > PROBLEM > Build a Cars Pricing Model based on advertisements on Craigs list > > OBJECTIVE > Recommend a price to the Craigslist car seller when the user gives info > about make,model,color,miles > > DATA required > Collect RSS feeds daily from Craigs List (don't pound their website , els= e > they will lock u down) > > DESIGN COMPONENTS > - Daily RSS Collector - pulls data and puts into HDFS > - Data Loader - Structures the columns u need to analyze and puts into HD= FS > - Hive Aggregator and analyzer - studies and queries data and brings out > recommendation models for car pricing > - REST Web service to return query results in XML/JSON > - iPhone App that talks to web service and gets info > > There u go=E2=80=A6this should keep a couple of students busy for 3 mont= hs > > I find this kind of problem statement and solutions simpler to > understand because its all there in the real world ! > > An example of my way of thinking led to me founding this non profit > called www.medicalsidefx.org that gives users valuable metrics regarding > medical side fx. > It uses Hadoop to aggregate , Lucene to search=E2=80=A6.This year I am re= designing > the core to use Hive :-) > > Good luck > > Sanjay > > > > > > From: Michael Segel > Reply-To: "user@hadoop.apache.org" > Date: Tuesday, May 21, 2013 6:46 AM > To: "user@hadoop.apache.org" > Subject: Re: Project ideas > > Drink heavily? > > Sorry. > > Let me rephrase. > > Part of the exercise is for you, the student to come up with the idea. > Not solicit someone else for a suggestion. This is how you learn. > > The exercise is to get you to think about the following: > > 1) What is Hadoop > 2) How does it work > 3) Why would you want to use it > > You need to understand #1 and #2 to be able to #3. > > But at the same time... you need to also incorporate your own view of > the world. > What are your hobbies? What do you like to do? > What scares you the most? What excites you the most? > Why are you here? > And most importantly, what do you think you can do within the time period= . > (What data can you easily capture and work with...) > > Have you ever seen 'Eden of the East' ? ;-) > > HTH > > > On May 21, 2013, at 8:35 AM, Anshuman Mathur wrote: > > Hello fellow users, > We are a group of students studying in National University of Singapore. > As part of our course curriculum we need to develop an application using > Hadoop and map-reduce. Can you please suggest some innovative ideas for > our project? > Thanks in advance. > Anshuman > > > > CONFIDENTIALITY NOTICE > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > This email message and any attachments are for the exclusive use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution is > prohibited. If you are not the intended recipient, please contact the > sender by reply email and destroy all copies of the original message alon= g > with any attachments, from your computer system. If you are the intended > recipient, please be advised that the content of this message is subject = to > access, review and disclosure by the sender's Email System Administrator. > > > --f46d0442881ad2e0e304dd46413e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
im a newbie but maybe this will also add some value...
it is my = understanding that mapreduce is like a distributed "group by" sta= tement

when you run a statement like this against your petabyes of datas= et it can take a long time.. first and foremost because the first thing you= have to do before you apply the group by logic is to read the data off dis= k.

if your disk reads at 100/MBs then you can do the math.
= The time frame that this query will run take at least this long to complete= .

If you need this info really fast like in the next hour to s= upport i dunno personalization features on a ecommerce site or month end re= port that needs to be complete in 2 hours.
Then it would be nice to put equal parts of your data on 100s of disk= s and run the same algorithm in parralel

but thats just if you= r bottleneck is disk.
what if your dataset is relatively small but calcu= lations done on each element coming in is large
so therefore your bottleneck there is CPU power

there ar= e alot of bottlenecks you could run into.
number of threads
mem= ory
latency of remote apis or remote database you hit as you analy= ze the data

Theres a book called programming collective intelligence from ore= illy that should help you out too


On Tue, May 21, 2013 at= 11:02 PM, Sai Sai <saigraph@yahoo.in> wrote:
Excellent Sanjay, really= excellent input. Many Thanks for this input.
I have been always thinking about some ideas but never knowing what to pr= oceed with.
Thanks again.
Sai


From: Sanjay Subrama= nian <Sanjay.Subramanian@wizecommerce.com>
To: "user@hadoop.apache.org"= <user@hadoo= p.apache.org>
Sent: Tuesday, 21 May 2013 = 11:51 PM
Subject: Re: Pr= oject ideas

=20
+1=C2=A0

My $0.02 is look look around and see problems u can solve=E2=80=A6Its = better to get a list of problems and see if u can model a solution using ma= p-reduce framework=C2=A0

An example is as follows

PROBLEM=C2=A0
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives inf= o about make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their websit= e , else they will lock u down)=C2=A0

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into= HDFS
- Hive Aggregator and analyzer - studies and queries data and brings o= ut recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go=E2=80=A6this should keep a couple of students busy for 3 mo= nths

I find this kind of problem statement and solutions simpler to underst= and because its all there in the real world !

An example of my way of thinking led to me founding this non profit ca= lled www.medical= sidefx.org that gives users valuable metrics regarding medical side fx.=
It uses Hadoop to aggregate , Lucene to search=E2=80=A6.This year I am= redesigning the core to use Hive :-)=C2=A0

Good luck=C2=A0

Sanjay

=C2=A0



From: Michael Segel <michae= l_segel@hotmail.com>
Reply-To: "user@hadoop.apach= e.org" <user@hadoop.apache.org>
Date: Tuesday, May 21, 2013 6:46 AM=
To: "user@hadoop.apache.org= " <user@hadoop.apache.org>
Subject: Re: Project ideas

Drink heavily?=C2=A0

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea.= Not solicit someone else for a suggestion. =C2=A0This is how you learn.=C2= =A0

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of = the world.=C2=A0
What are your hobbies? What do you like to do?=C2=A0
What scares you the most? =C2=A0What excites you the most?=C2=A0
Why are you here?=C2=A0
And most importantly, what do you think you can do within the time per= iod.=C2=A0
(What data can you easily capture and work with...)=C2=A0

Have you ever seen 'Eden of the East' ? ;-)=C2=A0

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur <ansmat@gmail.com> = wrote:

Hello fellow users,
We are a group of students studying in National University= of Singapore. As part of our course curriculum we need to develop an appli= cation using Hadoop and=C2=A0 map-reduce. Can you please suggest some innov= ative ideas for our project?
Thanks in advance.
Anshuman


CONFIDENTIALITY NOTICE
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
This email message and any attachments are for the exclusive use of the int= ended recipient(s) and may contain confidential and privileged information.= Any unauthorized review, use, disclosure or distribution is prohibited. If= you are not the intended recipient, please contact the sender by reply email and destroy all copies of the ori= ginal message along with any attachments, from your computer system. If you= are the intended recipient, please be advised that the content of this mes= sage is subject to access, review and disclosure by the sender's Email System Administrator.



--f46d0442881ad2e0e304dd46413e--