nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BELLINI ADAM <mbel...@msn.com>
Subject RE: Content of redirected urls empty
Date Tue, 09 Mar 2010 16:59:05 GMT


hi,

i dont know if you did find few minutes to see my problem :)

but i want to explain it again, mabe it wasnt clear :


i have HTTP  pages redirected to HTTPS   (but it's the same URL):

HTTP://page1.com   redirrected to HTTPS://page1.com

the content of my page HTTP is empty.
the content of my page HTTPS is not empty

in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not
empty

but in my index i found the HTTP one with the empty content.

is there a maner to tell to nutch to index the url with the non empty content? or why nutch
doesnt index the target URL rather than indexing the empty (origin) one ??

thx a lot





> From: mbellil@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: Content of redirected urls empty
> Date: Mon, 8 Mar 2010 17:08:06 +0000
> 
> 
> i'm sorry...i just checked twice...and in my index i have the original URL, which is
 the HTTP one with the empty content...but it dosent index the HTTPS one....and i using solr
index
> thx
> 
> 
> 
> > From: mbellil@msn.com
> > To: nutch-user@lucene.apache.org
> > Subject: RE: Content of redirected urls empty
> > Date: Mon, 8 Mar 2010 17:01:34 +0000
> > 
> > 
> > 
> > 
> > Hi, i'v just dumped my segments and found that i have both 2 URLS, the original
one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with
NON EMPTY content !
> > 
> > but in my search i found only the HTTPS URL with an empty content !! logically the
content of the HTTPS  URL is not empty !
> > it's just mixing the HTTPS url with the content of the HTTP one.
> > 
> > 
> > our redirect is done by java code  response.sendRedirect(…), so it seams to be
http redirect right ??
> > 
> > thx for helping me :)
> > 
> > 
> > > Date: Mon, 8 Mar 2010 15:51:34 +0100
> > > From: ab@getopt.org
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: Content of redirected urls empty
> > > 
> > > On 2010-03-08 14:55, BELLINI ADAM wrote:
> > > >
> > > >
> > > > is there any idea guys ??
> > > >
> > > >
> > > >> From: mbellil@msn.com
> > > >> To: nutch-user@lucene.apache.org
> > > >> Subject: Content of redirected urls empty
> > > >> Date: Fri, 5 Mar 2010 22:01:05 +0000
> > > >>
> > > >>
> > > >>
> > > >> hi,
> > > >> the content of my redirected urls is empty...but still have the other
metadata...
> > > >> i have an http urls that is redirected to https.
> > > >> in my index i find the http URL but with an empty content...
> > > >> could you explain it plz?
> > > 
> > > There are two ways to redirect - one is with protocol, and the other is 
> > > with content (either meta refresh, or javascript).
> > > 
> > > When you dump the segment, is there really no content for the redirected 
> > > url?
> > > 
> > > 
> > > -- 
> > > Best regards,
> > > Andrzej Bialecki     <><
> > >   ___. ___ ___ ___ _ _   __________________________________
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > > 
> >  		 	   		  
> > _________________________________________________________________
> > Live connected with Messenger on your phone
> > http://go.microsoft.com/?linkid=9712958
>  		 	   		  
> _________________________________________________________________
> IM on the go with Messenger on your phone
> http://go.microsoft.com/?linkid=9712960
 		 	   		  
_________________________________________________________________
Stay in touch.
http://go.microsoft.com/?linkid=9712959
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message