nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Gross <cogr...@gmail.com>
Subject Re: Success Error?
Date Thu, 15 Dec 2011 18:26:29 GMT
I added the -dumpText and this is what I got:

[user@eval bin]$ ./nutch parsechecker -dumpText
"http://url/Home.aspx"fetching: http://url/Home.aspxparsing:
http://url/Home.aspxcontentType:
text/html---------Url---------------http://url/Home.aspx---------ParseData---------Version:
5Status: success(1,0)Title: ERROR: The requested URL could not be
retrievedOutlinks: 0Content Metadata: Connection=close
Content-Type=text/html Parse Metadata:
CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252 ---------ParseText---------ERROR:
The requested URL could not be retrieved ERROR The requested URL could
not be retrieved The following error was encountered: Invalid Request:
DNS lookup failed Some aspect of the HTTP Request is invalid.[

I've cut & paste the url so I know it works -- does someone have an
idea of the setting for nutch that I should change to make this work?

Thanks!

-- Chris



On Thu, Dec 15, 2011 at 12:58 PM, Christopher Gross <cogross@gmail.com> wrote:
> Markus - do you know of any other nutch tools/commands that I can use
> to debug the problem further?  My guess is that I don't have something
> configured correctly.
>
> I've seen other posts with people saying that they've connected nutch
> to SharePoint, have you run into a problem like this?  Is someone
> willing to share some of their config settings so that I can try to
> run on your working setup?
>
> Thanks!
>
> -- Chris
>
>
>
> On Thu, Dec 15, 2011 at 10:23 AM, Christopher Gross <cogross@gmail.com> wrote:
>> Here's a snipped from the curl response from the webserver:
>>
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
>> Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html
>> xmlns:o="urn:schemas-microsoft-com:office:office" lang="en-us"
>> dir="ltr"><head><meta http-equiv="X-UA-Compatible" content="IE=8"
>> /><meta name="GENERATOR" content="Microsoft SharePoint" /><meta
>> name="progid" content="SharePoint.WebPartPage.Document" /><meta
>> http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta
>> http-equiv="Expires" content="0" /><title>                Search
Team
>> - Home
>> </title><link rel="stylesheet" type="text/css"
>> href="/_layouts/1033/styles/Themable/search.css?rev=Uoc0fsLIo87aYwT%2FGX5UPw%3D%3D"/><link
>> rel="stylesheet" type="text/css"
>> href="/_layouts/1033/styles/Themable/wiki.css?rev=9pXM9jgtUVYAHk21JOAbIw%3D%3D"/><link
>> rel="stylesheet" type="text/css"
>> href="/_layouts/1033/styles/Themable/corev4.css?rev=iIikGkMuXBs8CWzKDAyjsQ%3D%3D"/>
>>         <script type="text/javascript">        var _fV4UI = true;
>>   </script>        <script type="text/javascript"> ......
>> You have any idea what I should be doing to make this work?
>>
>> -- Chris
>>
>>
>>
>> On Thu, Dec 15, 2011 at 10:24 AM, Markus Jelsma
>> <markus.jelsma@openindex.io> wrote:
>>> You can curl it from the same machine you run Nutch on? It is not a Nutch
>>> error, the error is embedded in the title by your webserver.
>>>
>>> On Thursday 15 December 2011 16:07:11 Christopher Gross wrote:
>>>> Any idea as to why?  I took the URL for the page directly from a
>>>> working browser.  I can curl the url and that works. Could part of the
>>>> problem stem from it thinking the encoding is windows-1252, when it is
>>>> actually UTF-8?
>>>>
>>>> -- Chris
>>>>
>>>>
>>>>
>>>> On Thu, Dec 15, 2011 at 9:59 AM, Markus Jelsma
>>>>
>>>> <markus.jelsma@openindex.io> wrote:
>>>> > The page was successfully fetched and parsed but the title just contains:
>>>> > "ERROR: The requested URL could not be retrieved" as it seems.
>>>> >
>>>> > On Thursday 15 December 2011 15:36:40 Christopher Gross wrote:
>>>> >> I'm getting a success status AND an error message when trying to
do a
>>>> >> parse check.  It is a SharePoint site, but this part allows for
>>>> >> anonymous access -- I can curl the page just fine without having
to do
>>>> >> anything funky.  I have a robots.txt in place that allows everyone
>>>> >> through (it is an internal test site, url has been redacted).  Here's
>>>> >> what I run:
>>>> >>
>>>> >> [user@eval bin]$ ./nutch parsechecker "http://sharepointurl/Home.aspx"
>>>> >> fetching: http://sharepointurl/Home.aspx
>>>> >> parsing: http://sharepointurl/Home.aspx
>>>> >> contentType: text/html
>>>> >> ---------
>>>> >> Url
>>>> >> ---------------
>>>> >> http://http://sharepointurl/Home.aspx---------
>>>> >> ParseData
>>>> >> ---------
>>>> >> Version: 5
>>>> >> Status: success(1,0)
>>>> >> Title: ERROR: The requested URL could not be retrieved
>>>> >> Outlinks: 0
>>>> >> Content Metadata: Connection=close Content-Type=text/html
>>>> >> Parse Metadata: CharEncodingForConversion=windows-1252
>>>> >> OriginalCharEncoding=windows-1252
>>>> >>
>>>> >> Google searches have been fruitless.  Can anyone help me make sense
of
>>>> >> what is going on here?  I can provide some snippets of config files
if
>>>> >> need be.
>>>> >>
>>>> >> Nutch 1.4, SharePoint 2010, Java 1.6.0_06-b02.
>>>> >>
>>>> >> Thanks!
>>>> >>
>>>> >> -- Chris
>>>> >
>>>> > --
>>>> > Markus Jelsma - CTO - Openindex
>>>
>>> --
>>> Markus Jelsma - CTO - Openindex

Mime
View raw message