Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EA043901E for ; Wed, 9 Jan 2013 14:12:59 +0000 (UTC) Received: (qmail 87576 invoked by uid 500); 9 Jan 2013 14:12:59 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 87542 invoked by uid 500); 9 Jan 2013 14:12:59 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 87534 invoked by uid 99); 9 Jan 2013 14:12:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jan 2013 14:12:59 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=FREEMAIL_REPLY,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vb0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jan 2013 14:12:52 +0000 Received: by mail-vb0-f44.google.com with SMTP id fc26so1568389vbb.31 for ; Wed, 09 Jan 2013 06:12:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/RfkcV5aPmSP9fJAEVp9IvgLW947qolNUUw9LGOSqoY=; b=HnvSBBQu1ZZ0+feh1KTugS06nSEueJll6Z35iSOvV2FW7eNMRFJvO+RHwfEyiqSxZa FZezLtg4mYu8fRTRAACZOQE/iMXleqBYpmq988AVHQ4gY/2Yj/g6tyzIF3fAiEeHj7hR AINxzUziOpQqMJ+pEa4H2MvqhpaS+ToqPHsYiNAA94zSMnOjSjLZgiEJRCTrQ9xDF5Ma r3o5y7yvuuMKH+dXb0ODrR3K3deFx9ht0JaIfkg6Xavd2pWa+tjzeOfKOFrF3/3z8dIm CgOxzNmjl2huTftYGSKsXSd+fq1619/7KAvjt29hjoiolXQ+AtH9sA4L+CZmpQ251C9F xLNw== MIME-Version: 1.0 Received: by 10.52.96.198 with SMTP id du6mr6081530vdb.104.1357740751950; Wed, 09 Jan 2013 06:12:31 -0800 (PST) Received: by 10.58.233.243 with HTTP; Wed, 9 Jan 2013 06:12:31 -0800 (PST) In-Reply-To: <59F524CC-82E1-4A8E-864F-D745695C0A2B@gmail.com> References: <5F7592F0-151B-4C13-B50A-044723FE2CC5@gmail.com> <59F524CC-82E1-4A8E-864F-D745695C0A2B@gmail.com> Date: Wed, 9 Jan 2013 09:12:31 -0500 Message-ID: Subject: Re: Http status code 302 From: Karl Wright To: user@manifoldcf.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Wire debugging with MCF 1.0.1 requires different logging.ini parameters, because it uses commons-httpclient instead. That's described here: http://hc.apache.org/httpclient-3.x/logging.html I will need a working comparison to diagnose what is happening, so please either get a log from curl, or better yet from MCF 1.0.1. Thanks! Karl On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe wrote: > Hi, > > I did wire debugging: > curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 200. > > The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs. > > [1] > DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 HTTP/1.1 > DEBUG 2013-01-09 22:07:26,495 (Thread-474) - >> "GET /?eid=39 HTTP/1.1[\r][\n]" > DEBUG 2013-01-09 22:07:26,496 (Thread-474) - >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.abe.1@gmail.com)[\r][\n]" > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "From: shinichiro.abe.1@gmail.com[\r][\n]" > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Host: lucene.jugem.jp:80[\r][\n]" > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Connection: Keep-Alive[\r][\n]" > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "[\r][\n]" > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> GET /?eid=39 HTTP/1.1 > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; shinichiro.abe.1@gmail.com) > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> From: shinichiro.abe.1@gmail.com > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Host: lucene.jugem.jp:80 > DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Connection: Keep-Alive > DEBUG 2013-01-09 22:07:26,556 (Thread-474) - << "HTTP/1.1 302 Found[\r][\n]" > DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Date: Wed, 09 Jan 2013 13:06:39 GMT[\r][\n]" > DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Server: Apache/2.0.59 (Unix)[\r][\n]" > DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Location: http://error.jugem.jp/[\r][\n]" > DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Length: 285[\r][\n]" > DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Connection: close[\r][\n]" > DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Type: text/html; charset=iso-8859-1[\r][\n]" > DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "[\r][\n]" > DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1 302 Found > DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << HTTP/1.1 302 Found > DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Date: Wed, 09 Jan 2013 13:06:39 GMT > DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Server: Apache/2.0.59 (Unix) > DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Location: http://error.jugem.jp/ > DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Content-Length: 285 > DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Connection: close > DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Content-Type: text/html; charset=iso-8859-1 > DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "[\n]" > DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "[\n]" > DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "302 Found[\n]" > DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "[\n]" > DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "

Found

[\n]" > DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "

The document has moved here.

[\n]" > DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "
[\n]" > DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "
Apache/2.0.59 (Unix) Server at lucene.jugem.jp Port 80
[\n]" > DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "[\n]" > DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection 0.0.0.0:56784<->210.172.160.170:80 closed > > > > Hmm.. It looks like moving to the error location anyway. > > Thanks, > Shinichiro Abe > > > On 2013/01/09, at 21:08, Karl Wright wrote: > >> Odd that curl would yield a 200 while ManifoldCF gets a 302. Maybe >> Koji's blog site does not like one of the headers, crawler-agent >> perhaps? >> >> I am behind a firewall now but I will explore this later today. In >> the meantime, if you want to research the problem, could you turn on >> wire debugging? You do this in the logging.ini file following these >> instructions: >> >> http://hc.apache.org/httpcomponents-client-ga/logging.html >> >> You should see everything happening in the log then, and you can then >> compare against curl using -vvv. Please let me know what you find. >> >> Thanks! >> Karl >> >> On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe >> wrote: >>> I'm using web connector. >>> >>>> Are you trying to crawl through a proxy? >>> No. I just set seeds that url without a proxy. >>> (Also I didn't obey robots.txt) >>> >>> Using curl, it is the same as your result. >>> >>> Could you reproduce that? >>> >>> Shinichiro >>> >>> On 2013/01/09, at 17:49, Karl Wright wrote: >>> >>>> When I try the URL you gave using curl and no special arguments, I get this: >>>> >>>> >>>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39" >>>> * About to connect() to lucene.jugem.jp port 80 (#0) >>>> * Trying 210.172.160.170... connected >>>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) >>>>> GET /?eid=39 HTTP/1.1 >>>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c zlib/1.2 >>>> .5 librtmp/2.3 >>>>> Host: lucene.jugem.jp >>>>> Accept: */* >>>>> >>>> < HTTP/1.1 200 OK >>>> < Date: Wed, 09 Jan 2013 08:47:52 GMT >>>> < Server: Apache/2.0.59 (Unix) >>>> < Vary: User-Agent,Host,Accept-Encoding >>>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT >>>> < Accept-Ranges: bytes >>>> < Content-Length: 22594 >>>> < Cache-Control: private >>>> < Pragma: no-cache >>>> < Connection: close >>>> < Content-Type: text/html >>>> >>>> There's no 302 from here. >>>> >>>> Are you trying to crawl through a proxy? If so, that might be where >>>> the problem lies. >>>> >>>> Karl >>>> >>>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright wrote: >>>>> It sounds like the httpclient upgrade definitely broke something. We >>>>> should open a ticket. >>>>> >>>>> But first, can you confirm what connector this is? Is it the web >>>>> connector? If so, I am puzzled because the web connector has always >>>>> logged any 302 return, but then queued a second document which it >>>>> subsequently fetches. >>>>> >>>>> Karl >>>>> >>>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe >>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> I'm using trunk code and crawling web site with seeds which have http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt). >>>>>> As I'm look at Simple History, it shows 302 result code at fetch activity and doesn't ingest document. >>>>>> >>>>>> When I used MCF 1.0.1 in the same situation, Simple History showed 200 result code and MCF could ingest documents. >>>>>> >>>>>> Why does the trunk shows 302 status? Is it relevant to upgrading httpclient? >>>>>> >>>>>> Thanks in advance, >>>>>> Shinichiro Abe >>> >