nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BELLINI ADAM <mbel...@msn.com>
Subject RE: Content of redirected urls empty
Date Thu, 18 Mar 2010 15:21:01 GMT


:( i realy dont know what to do now ! how all people before me resolved this probleme ?





> From: mbellil@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: Content of redirected urls empty
> Date: Mon, 15 Mar 2010 19:43:51 +0000
> 
> 
> Hi, 
> 
> finaly i learned how to display only indexed URLs in the solr index
> 
> the url is  http://localhost:8080/solr/select/?q=*:*&fl=url,content
> 
> q=*:*  is for all entries in the index
> &fl=url,content  display only urls and their content.
> 
> 
> Now i'm 100 % sure that i dont have the source HTTP urls in my index, i have only the
target ones (HTTPS) with an empty content.
> 
> 
> 
> i dont know if some one could explain why nutch is missing the content of redirected
urls  when indexing !!!
> 
> 
> 
> > Date: Mon, 15 Mar 2010 16:28:03 +0000
> > Subject: Re: Content of redirected urls empty
> > From: lists.digitalpebble@gmail.com
> > To: nutch-user@lucene.apache.org
> > 
> > > my index i have the HTTPS  url with the empty content (...it's exactely
> > > what you said : it's just mixing the HTTPS url with
> > > the content of the HTTP one,) and i expected the other way round : the
> > > HTTPS content *with* the HTTP URL.
> > >
> > 
> > strange
> > 
> > 
> > >
> > > i dont know if i have the HTTP url in my index, i dont know how to see all
> > > the indexed URLS in SOLR.
> > >
> > 
> > well you could query on the hostname or the whole URL is suppose.
> > 
> > You could also index with Lucene and use Luke to debug the content of the
> > index
> > 
> > 
> > > but i'm sure that when a perform a search using RMS i obtain only the HTTPS
> > > url with an empty content (i guess it's the empty content of the HTTP one).
> > > but again in the segment the content of the https is not empty.
> > >
> > 
> > _repr_  : representative -> see class ReprUrlFixer
> > 
> > 
> > 
> > 
> > 
> > 
> > >
> > >
> > >
> > > > Date: Mon, 15 Mar 2010 13:44:33 +0000
> > > > Subject: Re: Content of redirected urls empty
> > > > From: lists.digitalpebble@gmail.com
> > > > To: nutch-user@lucene.apache.org
> > > >
> > > > >
> > > > > and as i said the last day, on my segment the https has an empty
> > > content.
> > > >
> > > >
> > > > hmm it's not what you said in your previous message + I can see it has
a
> > > > signature in the crawlDB so it must have a content.
> > > >
> > > > I expect that the content would be indexed under the http://  URL thanks
> > > to
> > > > *_repr_: **http://myDNS/index.html*
> > > >
> > > > See BasicIndexingFilter for details.
> > > >
> > > > it's just mixing the HTTPS url with the content of the HTTP one.
> > > >
> > > >
> > > > it should be the other way round : the HTTPS content *with* the HTTP URL.
> > > > Actually the http:// document is not sent to the index at all (see
> > > around
> > > > line 86 in IndexerMapReduce 86) so what you are seeing in the index must
> > > be
> > > > the https doc with _repr_ used as a URL.
> > > >
> > > > can you please confirm that :
> > > > 1/ the segment has a content for the https:// doc
> > > > 2/ you can find the http:// URL in the index and it has no content
> > > >
> > > > HTH
> > > >
> > > > Julien
> > > >
> > > > --
> > > > DigitalPebble Ltd
> > > > http://www.digitalpebble.com
> > > > On 15 March 2010 13:00, BELLINI ADAM <mbellil@msn.com> wrote:
> > > >
> > > > >
> > > > > Hi
> > > > > thx for your help,
> > > > >
> > > > > this is a fresh crwal of today:
> > > > >
> > > > >
> > > > > 1- HTTP:
> > > > > bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
> > > > >
> > > > > URL: http://myDNS/index.html
> > > > > Version: 7
> > > > > Status: 4 (db_redir_temp)
> > > > > Fetch time: Mon Mar 15 12:15:52 EDT 2010
> > > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > > Retries since fetch: 0
> > > > > Retry interval: 36000 seconds (0 days)
> > > > > Score: 0.018119827
> > > > > Signature: null
> > > > > Metadata: _pst_: temp_moved(13), lastModified=0:
> > > https://myDNS/index.html
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 2- HTTPS:
> > > > > bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
> > > > >
> > > > > URL: https://myDNS/index.html
> > > > > Version: 7
> > > > > Status: 2 (db_fetched)
> > > > > Fetch time: Mon Mar 15 12:32:34 EDT 2010
> > > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > > Retries since fetch: 0
> > > > > Retry interval: 36000 seconds (0 days)
> > > > > Score: 0.00511379
> > > > > Signature: 5f84dcec905c24e3e2af902ad9ad7398
> > > > > Metadata: _pst_: success(1), lastModified=0_repr_:
> > > http://myDNS/index.html
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > and as i said the last day, on my segment the https has an empty
> > > content.
> > > > >
> > > > > thx
> > > > >
> > > > >
> > > > > > Date: Mon, 15 Mar 2010 11:39:46 +0000
> > > > > > Subject: Re: Content of redirected urls empty
> > > > > > From: lists.digitalpebble@gmail.com
> > > > > > To: nutch-user@lucene.apache.org
> > > > > >
> > > > > > Adam,
> > > > > >
> > > > > > Could you please tell us what the http and https entries look
like in
> > > the
> > > > > > crawlDB (using readdb -url)?
> > > > > >
> > > > > > J.
> > > > > > --
> > > > > > DigitalPebble Ltd
> > > > > > http://www.digitalpebble.com
> > > > > >
> > > > > > On 13 March 2010 04:29, BELLINI ADAM <mbellil@msn.com>
wrote:
> > > > > >
> > > > > > >
> > > > > > > no one have an answer !?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: mbellil@msn.com
> > > > > > > > To: nutch-user@lucene.apache.org; millebii@gmail.com
> > > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > > Date: Wed, 10 Mar 2010 21:01:54 +0000
> > > > > > > >
> > > > > > > >
> > > > > > > > i read lotoff post regarding redirected urls but didnt
find a
> > > > > sollution !
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: mbellil@msn.com
> > > > > > > > > To: nutch-user@lucene.apache.org; millebii@gmail.com
> > > > > > > > > Subject: RE: Content of redirected urls empty
> > > > > > > > > Date: Tue, 9 Mar 2010 16:59:05 +0000
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > hi,
> > > > > > > > >
> > > > > > > > > i dont know if you did find few minutes to see
my problem :)
> > > > > > > > >
> > > > > > > > > but i want to explain it again, mabe it wasnt
clear :
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > i have HTTP  pages redirected to HTTPS   (but
it's the same
> > > URL):
> > > > > > > > >
> > > > > > > > > HTTP://page1.com   redirrected to HTTPS://page1.com
> > > > > > > > >
> > > > > > > > > the content of my page HTTP is empty.
> > > > > > > > > the content of my page HTTPS is not empty
> > > > > > > > >
> > > > > > > > > in my segment i found botch the 2 URLS (HTTP
and HTTPS ) , the
> > > > > content
> > > > > > > of HTTPS page is not empty
> > > > > > > > >
> > > > > > > > > but in my index i found the HTTP one with the
empty content.
> > > > > > > > >
> > > > > > > > > is there a maner to tell to nutch to index the
url with the non
> > > > > empty
> > > > > > > content? or why nutch doesnt index the target URL rather
than
> > > indexing
> > > > > the
> > > > > > > empty (origin) one ??
> > > > > > > > >
> > > > > > > > > thx a lot
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: mbellil@msn.com
> > > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > > Subject: RE: Content of redirected urls
empty
> > > > > > > > > > Date: Mon, 8 Mar 2010 17:08:06 +0000
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > i'm sorry...i just checked twice...and in
my index i have the
> > > > > > > original URL, which is  the HTTP one with the empty content...but
> > > it
> > > > > dosent
> > > > > > > index the HTTPS one....and i using solr index
> > > > > > > > > > thx
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: mbellil@msn.com
> > > > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > > > Subject: RE: Content of redirected
urls empty
> > > > > > > > > > > Date: Mon, 8 Mar 2010 17:01:34 +0000
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi, i'v just dumped my segments and
found that i have both
> > > 2
> > > > > URLS,
> > > > > > > the original one (HTTP) with an empty content and the REDIRCTED
TO
> > > or
> > > > > the
> > > > > > > DESTINATION URL (HTTPS) with NON EMPTY content !
> > > > > > > > > > >
> > > > > > > > > > > but in my search i found only the HTTPS
URL with an empty
> > > > > content
> > > > > > > !! logically the content of the HTTPS  URL is not empty
!
> > > > > > > > > > > it's just mixing the HTTPS url with
the content of the HTTP
> > > > > one.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > our redirect is done by java code
> > >  response.sendRedirect(…), so
> > > > > it
> > > > > > > seams to be http redirect right ??
> > > > > > > > > > >
> > > > > > > > > > > thx for helping me :)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Date: Mon, 8 Mar 2010 15:51:34
+0100
> > > > > > > > > > > > From: ab@getopt.org
> > > > > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > > > > Subject: Re: Content of redirected
urls empty
> > > > > > > > > > > >
> > > > > > > > > > > > On 2010-03-08 14:55, BELLINI ADAM
wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > is there any idea guys ??
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >> From: mbellil@msn.com
> > > > > > > > > > > > >> To: nutch-user@lucene.apache.org
> > > > > > > > > > > > >> Subject: Content of redirected
urls empty
> > > > > > > > > > > > >> Date: Fri, 5 Mar 2010
22:01:05 +0000
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> hi,
> > > > > > > > > > > > >> the content of my redirected
urls is empty...but still
> > > > > have
> > > > > > > the other metadata...
> > > > > > > > > > > > >> i have an http urls that
is redirected to https.
> > > > > > > > > > > > >> in my index i find the
http URL but with an empty
> > > > > content...
> > > > > > > > > > > > >> could you explain it
plz?
> > > > > > > > > > > >
> > > > > > > > > > > > There are two ways to redirect
- one is with protocol,
> > > and
> > > > > the
> > > > > > > other is
> > > > > > > > > > > > with content (either meta refresh,
or javascript).
> > > > > > > > > > > >
> > > > > > > > > > > > When you dump the segment, is
there really no content for
> > > the
> > > > > > > redirected
> > > > > > > > > > > > url?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Andrzej Bialecki     <><
> > > > > > > > > > > >   ___. ___ ___ ___ _ _
> > > __________________________________
> > > > > > > > > > > > [__ || __|__/|__||\/|  Information
Retrieval, Semantic
> > > Web
> > > > > > > > > > > > ___|||__||  \|  ||  |  Embedded
Unix, System Integration
> > > > > > > > > > > > http://www.sigram.com  Contact:
info at sigram dot com
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > _________________________________________________________________
> > > > > > > > > > > Live connected with Messenger on your
phone
> > > > > > > > > > > http://go.microsoft.com/?linkid=9712958
> > > > > > > > > >
> > > > > > > > > >
> > > _________________________________________________________________
> > > > > > > > > > IM on the go with Messenger on your phone
> > > > > > > > > > http://go.microsoft.com/?linkid=9712960
> > > > > > > > >
> > > > > > > > >
> > > _________________________________________________________________
> > > > > > > > > Stay in touch.
> > > > > > > > > http://go.microsoft.com/?linkid=9712959
> > > > > > > >
> > > > > > > > _________________________________________________________________
> > > > > > > > Take your contacts everywhere
> > > > > > > > http://go.microsoft.com/?linkid=9712959
> > > > > > >
> > > > > > > _________________________________________________________________
> > > > > > > Stay in touch.
> > > > > > > http://go.microsoft.com/?linkid=9712959
> > > > > > >
> > > > >
> > > > > _________________________________________________________________
> > > > > IM on the go with Messenger on your phone
> > > > > http://go.microsoft.com/?linkid=9712960
> > > > >
> > >
> > > _________________________________________________________________
> > > Live connected with Messenger on your phone
> > > http://go.microsoft.com/?linkid=9712958
> > >
> > 
> > 
> > 
> > -- 
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
>  		 	   		  
> _________________________________________________________________
> Check your Hotmail from your phone. 
> http://go.microsoft.com/?linkid=9712957
 		 	   		  
_________________________________________________________________
Live connected with Messenger on your phone
http://go.microsoft.com/?linkid=9712958
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message