Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 61251 invoked from network); 10 Dec 2010 07:29:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Dec 2010 07:29:39 -0000 Received: (qmail 8882 invoked by uid 500); 10 Dec 2010 07:29:37 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 8774 invoked by uid 500); 10 Dec 2010 07:29:36 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 8766 invoked by uid 99); 10 Dec 2010 07:29:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Dec 2010 07:29:36 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mp2893@gmail.com designates 209.85.214.182 as permitted sender) Received: from [209.85.214.182] (HELO mail-iw0-f182.google.com) (209.85.214.182) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Dec 2010 07:29:29 +0000 Received: by iwn39 with SMTP id 39so5332004iwn.41 for ; Thu, 09 Dec 2010 23:29:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=QpGia5ZwdJu+pLYqrGOmzCn8gFnZ4ttDku0bxoVGKjE=; b=KCOlwu25ptxvIzFFMA/GRupQ4zYGEk6MTZr/nehYazYFkaLMEPVzDyMhO42EwSzE/+ H7UlNVT0yMEcUXRFUQl9k1o1lr0JtOJxf9M7x/tgpAU4RWh+5qiBoTmp0tJjpapcoNAm HqjoVoc9XmTX+eFohJfcq0Rnd6GFAGfrX8JKE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=BR9X/y9HgNNZyFltSM4zcRbekbLhQCDzI7XIrdYj6xK+qYMrRTMa9gj4Ce33T3qu7m pkIRDZnbZEAmr04ySxm4ABeJDI6eMwEaTMMVRRKrNm96PLshDyxpQaZPauRWtydRt+qn Iwy28BR5AKxOLJvRRr2BGz5cEQD2Jof3oax60= MIME-Version: 1.0 Received: by 10.231.14.199 with SMTP id h7mr200759iba.158.1291966147817; Thu, 09 Dec 2010 23:29:07 -0800 (PST) Received: by 10.231.152.4 with HTTP; Thu, 9 Dec 2010 23:29:07 -0800 (PST) In-Reply-To: References: Date: Fri, 10 Dec 2010 16:29:07 +0900 Message-ID: Subject: Re: Question from a Desperate Java Newbie From: edward choi To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=002354332a264d71760497094f44 X-Virus-Checked: Checked by ClamAV on apache.org --002354332a264d71760497094f44 Content-Type: text/plain; charset=ISO-8859-1 I would, but I am trying to integrate the crawler with Hadoop, so I wanted to write in Java :-) 2010/12/10 Santosh Borse > You can use open source wget as well. > > -----Original Message----- > From: Hemanth Yamijala [mailto:yhemanth@gmail.com] > Sent: Friday, December 10, 2010 8:04 AM > To: common-user@hadoop.apache.org > Subject: Re: Question from a Desperate Java Newbie > > Not exactly what you may want - but could you try using a HTTP client > in Java ? Some of them have the ability to automatically follow > redirects, manage cookies etc. > > Thanks > hemanth > > On Thu, Dec 9, 2010 at 4:35 PM, edward choi wrote: > > Excuse me for asking a general Java question here. > > I tried to find Java mailing list from Google but none of them were > active. > > > > There is a problem that's been driving me crazy for a while. > > > > I am trying to download webpages from New York Times. > > With Java URL.openStream(), I can't get past the login requirement. > > But with c++ socket programming (using read() and write()), I can > download > > any webpage just fine. > > > > Interesting thing is that with c++, I get redirected like 10 times. Below > is > > the content of the header of the firstly redirected webpage when I try to > > download > > " > > > http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh > > " > > > > HTTP/1.1 302 Moved Temporarily > > Server: Sun-ONE-Web-Server/6.1 > > Date: Thu, 09 Dec 2010 08:42:35 GMT > > Content-type: text/html > > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 > > 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. > > nytimes.com > > Set-cookie: > > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; > > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. > > nytimes.com > > Location: > > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp > > Expires: Thu, 01 Dec 1994 16:00:00 GMT > > Cache-control: no-cache > > Pragma: no-cache > > Connection: close > > > > But with Java, I get redirected only once to a https:// webpage and it's > a > > dead end. Below is the result of java.net.URLConnection.getHeaderFiles() > > > > HTTP/1.1 301 Moved Permanently, > > Date: Thu, 09 Dec 2010 10:50:53 GMT, > > Content-type: text/html, > > Content-length: 0, > > Location: > > > https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR > > , > > Server: Sun-ONE-Web-Server/6.1, > > > > There is a clear difference between the two. I don't know why and it's > been > > driving me crazy. > > My guess is that c++ write() function can create some kind of cookie by > > itself, but Java URL.openStream() can't. > > > > Am I right? Or can anyone explain this for me? > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. > --002354332a264d71760497094f44--