Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9CE6D98E1 for ; Sat, 8 Dec 2012 05:09:18 +0000 (UTC) Received: (qmail 87530 invoked by uid 500); 8 Dec 2012 05:09:13 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 87204 invoked by uid 500); 8 Dec 2012 05:09:12 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 87172 invoked by uid 99); 8 Dec 2012 05:09:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Dec 2012 05:09:11 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sujitdhamale89@gmail.com designates 209.85.223.171 as permitted sender) Received: from [209.85.223.171] (HELO mail-ie0-f171.google.com) (209.85.223.171) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Dec 2012 05:09:05 +0000 Received: by mail-ie0-f171.google.com with SMTP id 17so3843695iea.16 for ; Fri, 07 Dec 2012 21:08:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=2gNAGNCnArhHBwwxDDP0ViCqc9t2yEV+XgDls7XGvK4=; b=yh3s9a2wTmluTApvnwTlPNL6ma4r25xcTv4FawQqXAD5fttOyz0R/WIS4ezCXcFOvQ 7EO1Mf1Q5knL5SJZ1cvZxAv/vEuIbyz39QcHcupnP7jhAowlAVPye83GiGv1iBWQRvrs 0yi7f/Prf6+kPcuBZmTvHa0j4v+O7MeQRTY3+WUZrLIwp0Sd5hG9qUnUHgdL5UvPwERG 7eYffbOnxQmOjSRALiTfO2S9d5Ry5hDGxOGdQQiBy9IvzdtQ/DNFwq/7FgalQ3OBEq8S RpvdwcuptKM53/RJgWbCS0MOMagoDXl2VkGJ1X33KhOFZitzl2V66WJv0NNzZw/67aUO g5Jg== MIME-Version: 1.0 Received: by 10.50.57.234 with SMTP id l10mr1201329igq.18.1354943323550; Fri, 07 Dec 2012 21:08:43 -0800 (PST) Received: by 10.64.59.10 with HTTP; Fri, 7 Dec 2012 21:08:43 -0800 (PST) In-Reply-To: References: Date: Sat, 8 Dec 2012 10:38:43 +0530 Message-ID: Subject: Re: I need some raw big data From: Sujit Dhamale To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=14dae93411177db51d04d0505389 X-Virus-Checked: Checked by ClamAV on apache.org --14dae93411177db51d04d0505389 Content-Type: text/plain; charset=ISO-8859-1 Hi, you can use National Climatic Data Center (NCDC) data which is good candidate for Hadoop Below are steps to download Data. 1. Create one Folder in your Local drive i created as "*/home/sujit/Desktop/Data/*" 2. Create below script and run for i in {1901..2012} do cd */home/sujit/Desktop/Data/* wget -r --no-parent --reject "index.html*" http://ftp3.ncdc .noaa.gov/pub/data/noaa/$i/ done Kind Regards Sujit Dhamale (+91 9970086652) On Sat, Dec 8, 2012 at 4:05 AM, Mohammad Tariq wrote: > Hello Yin, > > You may find this interesting : > https://github.com/unitedstates > > Regards, > Mohammad Tariq > > > > On Sat, Dec 8, 2012 at 3:25 AM, Chris Nauroth wrote: > >> Another suggestion is Google Books Ngrams: >> >> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html >> >> >> On Fri, Dec 7, 2012 at 7:57 AM, Phillip Rhodes > > wrote: >> >>> On Fri, Dec 7, 2012 at 10:48 AM, Harsh J wrote: >>> > >>> > On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve >>> wrote: >>> >> Hello, I'm Steve who need some raw big data for studying mapreduce >>> >> programming. Where can i find them? especially those about weblog, >>> traffic >>> >> info etc. My English is not so well, if you can give me a URL which >>> directly >>> >> help me download the big file, That'll be great. >>> >> Waiting for your reply...... >>> >>> Try some of the links off of this Quora thread: >>> >>> >>> http://www.quora.com/Data/Where-can-I-find-large-datasets-for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-public >>> >>> You might also try googling "Enron corpus". Or check out >>> CommonCrawl.org. >>> >>> >>> Phil >>> >> >> > --14dae93411177db51d04d0505389 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi,
you can use National Climatic Data Center (NCDC)=A0 data which is go= od candidate for Hadoop=A0
Below are steps to download Data.

1. Create one Folder in your Local drive
=A0 i created as "/ho= me/sujit/Desktop/Data/"

2. Create below script and run

for i in {1901..2012}
do
cd /home/sujit/Desktop/Data/
wget -r --no-parent --reject "index= .html*"=A0 http://ftp3.ncdc.noaa.gov/pub/data= /noaa/$i/
done

Kind Regards
Sujit= Dhamale
(+91 9970086652)

On Sat, Dec = 8, 2012 at 4:05 AM, Mohammad Tariq <dontariq@gmail.com> wro= te:
Hello Yin,

=A0 =A0 =A0 = =A0You may find this interesting :

Regards,
=A0=A0 =A0Mohammad Tariq



On Sat, Dec 8, 2012 at 3:25 AM, Chris Na= uroth <cnauroth@hortonworks.com> wrote:
Another suggestion is Google Books Ngrams:



On Fri, Dec 7, 2012 at 7:57 AM, Phillip= Rhodes <motley.crue.fan@gmail.com> wrote:
On Fri, Dec 7, 2012 at 10:48 AM, Harsh J <harsh@cloudera.com> wrote:
>
> On Fri, Dec 7, 2012 at 8:31 PM, Yin Steve <steveyin92@gmail.com> wrote:
>> =A0Hello, I'm Steve who need some raw big data for studying ma= preduce
>> programming. Where can i find them? especially those about weblog,= traffic
>> info etc. My English is not so well, if you can give me a URL whic= h directly
>> help me download the big file, That'll be great.
>> Waiting for your reply......

Try some of the links off of this Quora thread:

http://www.quora.com/Data/Where-can-I-find-large-datasets= -for-modeling-confidence-during-the-financial-crisis-which-is-open-to-the-p= ublic

You might also try googling "Enron corpus". =A0 Or check out Comm= onCrawl.org.


Phil



--14dae93411177db51d04d0505389--