nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From remi tassing <tassingr...@gmail.com>
Subject Re: Nutch and Sharepoint authentication
Date Thu, 01 Dec 2011 01:21:20 GMT
Hello Alexander,

I'm considering trying your suggestion.

I have one question thought. After Webdav does the crawling and saves the
files locally, does it keep the link intact?

Remi

On Fri, Nov 25, 2011 at 1:17 AM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> hi
>
> one of a available solution is to set up webdav and crawl resoutses as
> files e.g. file://. but it wont exclude authentication.
>
>
> Alexander
>
> On 24/11/2011, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com> wrote:
> > Hi Arkadi,
> >
> > Are you saying that this has been solved and that are successfully able
> to
> > crawl the server?
> >
> > Thanks
> >
> > On Thu, Nov 24, 2011 at 12:48 AM, <Arkadi.Kosmynin@csiro.au> wrote:
> >
> >> Hi,
> >>
> >> I am crawling a SharePoint server, no major problems. I do have to use
> >> protocol-httpclient for this. Here is an extract from my
> >> httpclient-auth.xml file, if it helps:
> >>
> >> <auth-configuration>
> >>  <credentials username="myusername" password="mypassword">
> >>    <default realm="myrealm" />
> >>  </credentials>
> >> </auth-configuration>
> >>
> >> Regards,
> >>
> >> Arkadi
> >>
> >> > -----Original Message-----
> >> > From: Lewis John Mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> >> > Sent: Tuesday, 22 November 2011 9:43 PM
> >> > To: user@nutch.apache.org
> >> > Subject: Re: Nutch and Sharepoint authentication
> >> >
> >> > Hi,
> >> >
> >> > From what I have read on the Nutch user@ archives [1] it is possible
> to
> >> > crawl a MS Sharepoint server which includes setting up NTLM
> >> > authentication
> >> > for your crawler. It is becoming a pretty major problem now the the
> >> > protocol-httpclient plugin is unstable, there are Jira issues open for
> >> > this.
> >> >
> >> > Unfortunately as Manifold CF is in incubation status, it can only be
> >> > expected that they might have not completed all documentation yet,
> >> > however
> >> > I advise you to try there as well, as them about the Sharepoint
> >> > configuration/documentation if it is not possible for you to work with
> >> > Nutch protocol-httpclient.
> >> >
> >> > hth
> >> >
> >> > [1]
> >> > http://www.mail-
> >> > archive.com/search?q=sharepoint&l=user%40nutch.apache.org
> >> >
> >> > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing <tassingremi@gmail.com>
> >> > wrote:
> >> >
> >> > > Hello guys,
> >> > >
> >> > > I read the wiki on
> >> > > "HttpAuthenticationSchemes<
> >> > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>".
> >> > > I previously managed to make Nutch crawl local folders and websites
> >> > (with
> >> > > SSL authentication). However, I'm trying to crawl some sites in a
> >> > corporate
> >> > > intranet environment running under MS Sharepoint. I was unsucceful
> so
> >> > far
> >> > > and I believe it's because of authentication.
> >> > >
> >> > >
> >> > >   - Is Nutch able to crawl Sharepoint? If yes, do you have a
> >> > link/mail
> >> > >   tutorial on this?
> >> > >
> >> > >
> >> > > I was recently aware of the ManifoldCF initiative and it seems to
be
> >> > an
> >> > > eventual solution to my problem. But it's currently poorly
> documented
> >> > (as
> >> > > far as Sharepoint connector is concerned).
> >> > >
> >> > >   - Do you have any recommendation on this regards?
> >> > >
> >> > >
> >> > > Thanks in advance for your help, I'll really appreciate it!
> >> > >
> >> > > --
> >> > > Remi Tassing
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > *Lewis*
> >>
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
> --
> Best Regards
> Alexander Aristov
>



-- 
Remi Tassing

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message