nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Gross <cogr...@gmail.com>
Subject Re: Success Error?
Date Thu, 15 Dec 2011 17:58:44 GMT
Markus - do you know of any other nutch tools/commands that I can use
to debug the problem further?  My guess is that I don't have something
configured correctly.

I've seen other posts with people saying that they've connected nutch
to SharePoint, have you run into a problem like this?  Is someone
willing to share some of their config settings so that I can try to
run on your working setup?

Thanks!

-- Chris



On Thu, Dec 15, 2011 at 10:23 AM, Christopher Gross <cogross@gmail.com> wrote:
> Here's a snipped from the curl response from the webserver:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
> Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html
> xmlns:o="urn:schemas-microsoft-com:office:office" lang="en-us"
> dir="ltr"><head><meta http-equiv="X-UA-Compatible" content="IE=8"
> /><meta name="GENERATOR" content="Microsoft SharePoint" /><meta
> name="progid" content="SharePoint.WebPartPage.Document" /><meta
> http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta
> http-equiv="Expires" content="0" /><title>                Search Team
> - Home
> </title><link rel="stylesheet" type="text/css"
> href="/_layouts/1033/styles/Themable/search.css?rev=Uoc0fsLIo87aYwT%2FGX5UPw%3D%3D"/><link
> rel="stylesheet" type="text/css"
> href="/_layouts/1033/styles/Themable/wiki.css?rev=9pXM9jgtUVYAHk21JOAbIw%3D%3D"/><link
> rel="stylesheet" type="text/css"
> href="/_layouts/1033/styles/Themable/corev4.css?rev=iIikGkMuXBs8CWzKDAyjsQ%3D%3D"/>
>         <script type="text/javascript">        var _fV4UI = true;
>   </script>        <script type="text/javascript"> ......
> You have any idea what I should be doing to make this work?
>
> -- Chris
>
>
>
> On Thu, Dec 15, 2011 at 10:24 AM, Markus Jelsma
> <markus.jelsma@openindex.io> wrote:
>> You can curl it from the same machine you run Nutch on? It is not a Nutch
>> error, the error is embedded in the title by your webserver.
>>
>> On Thursday 15 December 2011 16:07:11 Christopher Gross wrote:
>>> Any idea as to why?  I took the URL for the page directly from a
>>> working browser.  I can curl the url and that works. Could part of the
>>> problem stem from it thinking the encoding is windows-1252, when it is
>>> actually UTF-8?
>>>
>>> -- Chris
>>>
>>>
>>>
>>> On Thu, Dec 15, 2011 at 9:59 AM, Markus Jelsma
>>>
>>> <markus.jelsma@openindex.io> wrote:
>>> > The page was successfully fetched and parsed but the title just contains:
>>> > "ERROR: The requested URL could not be retrieved" as it seems.
>>> >
>>> > On Thursday 15 December 2011 15:36:40 Christopher Gross wrote:
>>> >> I'm getting a success status AND an error message when trying to do
a
>>> >> parse check.  It is a SharePoint site, but this part allows for
>>> >> anonymous access -- I can curl the page just fine without having to
do
>>> >> anything funky.  I have a robots.txt in place that allows everyone
>>> >> through (it is an internal test site, url has been redacted).  Here's
>>> >> what I run:
>>> >>
>>> >> [user@eval bin]$ ./nutch parsechecker "http://sharepointurl/Home.aspx"
>>> >> fetching: http://sharepointurl/Home.aspx
>>> >> parsing: http://sharepointurl/Home.aspx
>>> >> contentType: text/html
>>> >> ---------
>>> >> Url
>>> >> ---------------
>>> >> http://http://sharepointurl/Home.aspx---------
>>> >> ParseData
>>> >> ---------
>>> >> Version: 5
>>> >> Status: success(1,0)
>>> >> Title: ERROR: The requested URL could not be retrieved
>>> >> Outlinks: 0
>>> >> Content Metadata: Connection=close Content-Type=text/html
>>> >> Parse Metadata: CharEncodingForConversion=windows-1252
>>> >> OriginalCharEncoding=windows-1252
>>> >>
>>> >> Google searches have been fruitless.  Can anyone help me make sense
of
>>> >> what is going on here?  I can provide some snippets of config files
if
>>> >> need be.
>>> >>
>>> >> Nutch 1.4, SharePoint 2010, Java 1.6.0_06-b02.
>>> >>
>>> >> Thanks!
>>> >>
>>> >> -- Chris
>>> >
>>> > --
>>> > Markus Jelsma - CTO - Openindex
>>
>> --
>> Markus Jelsma - CTO - Openindex

Mime
View raw message