nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <m.bachm...@uni-kassel.de>
Subject Re: Question about solrclean
Date Tue, 19 Jul 2011 14:16:14 GMT
Hi Markus,

just for notice:

today I run the test for deleting 404 pages in solr again. This time the 
URL wasn't disappeared from the crawldb.

It works fine. Solrclean has removed it from the index as expected.

But it also always want to remove the pages which never could be fetched 
and therefore are not in the index. But I think this is a known problem.

On 18.07.2011 16:31, Marek Bachmann wrote:
> On 18.07.2011 16:04, Markus Jelsma wrote:
>> A 404-URL should not disappear from the CrawlDB, no matter what unless
>> filtered via URL filters. Can you check? Perhaps something else is
>> going on.
>
> I'll start the process over again :)
>
>>
>> On Monday 18 July 2011 15:59:22 Marek Bachmann wrote:
>>> On 18.07.2011 15:43, Markus Jelsma wrote:
>>>> On Monday 18 July 2011 15:13:41 Marek Bachmann wrote:
>>>>> Hi List,
>>>>>
>>>>> I have a small test set for working with nutch and solr. I wanted to
>>>>> see, if it is possible to delete pages from the solr index after nutch
>>>>> had fetched them with 404.
>>>>>
>>>>> As far as I know, there is a command "solrclean" witch should handle
>>>>> this task. It should go through the crawldb and delete all urls
>>>>> that are
>>>>> marked as gone.
>>>>>
>>>>> But for some reasons it doesn't work right in my case:
>>>>>
>>>>> I had made a crawl over a set of pages. All of them were fetchable
>>>>> (200). The total count of urls were 1999. I indexed them successfully
>>>>> to solr.
>>>>>
>>>>> After that, I wanted to know, what will happen after a recrawl if
>>>>> pages
>>>>> disappear. So I logged into the CMS and deleted a category of pages.
>>>>>
>>>>> After a recrawl, updatedb, invertlinks etc, my crawldb looked like
>>>>> this:
>>>>>
>>>>> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
>>>>> ./nutch readdb crawl/crawldb/ -stats
>>>>> CrawlDb statistics start: crawl/crawldb/
>>>>> Statistics for CrawlDb: crawl/crawldb/
>>>>> TOTAL urls: 1999
>>>>> retry 0: 1999
>>>>> min score: 0.0
>>>>> avg score: 0.04724062
>>>>> max score: 7.296
>>>>> status 3 (db_gone): 169
>>>>> status 6 (db_notmodified): 1830
>>>>> CrawlDb statistics: done
>>>>>
>>>>> That's what I had expected. 169 pages are gone. Fine.
>>>>>
>>>>> Next I'd run solrclean and solrdedup.
>>>>>
>>>>> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
>>>>> ./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr
>>>>> SolrClean: starting at 2011-07-18 15:04:51
>>>>> SolrClean: deleting 169 documents
>>>>> SolrClean: deleted a total of 169 documents
>>>>> SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02
>>>>>
>>>>> So SolrClean says, that it deletes all of the 169 documents that are
>>>>> gone.
>>>>>
>>>>> But when I query for the word "Datendienste" solr is still responding
>>>>> with pages that are actually gone. I'll show you an example:
>>>>>
>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>> <response>
>>>>>
>>>>> <lst name="responseHeader">
>>>>>
>>>>> <int name="status">0</int>
>>>>> <int name="QTime">2</int>
>>>>> <lst name="params">
>>>>>
>>>>> <str name="indent">on</str>
>>>>> <str name="start">0</str>
>>>>>
>>>>> <str
>>>>>
>>>>> name="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html"</
>>>>>
>>>>> str
>>>>>
>>>>>> <str name="version">2.2</str>
>>>>>>
>>>>> <str name="rows">10</str>
>>>>>
>>>>> </lst>
>>>>>
>>>>> </lst>
>>>>> <result name="response" numFound="1" start="0">
>>>>>
>>>>> <doc>
>>>>>
>>>>> <float name="boost">1.7714221</float>
>>>>>
>>>>> <str name="content">IT Servicezentrum: Datendienste (...)</str>
>>>>> <long name="contentLength">3947</long>
>>>>> <date name="date">2011-07-11T14:26:33.933Z</date>
>>>>> <str name="digest">06c077b62a8012772e6365333c74312d</str>
>>>>> <str
>>>>>
>>>>> name="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
>>>>>
>>>>>
>>>>> <str name="segment">20110708154848</str>
>>>>>
>>>>> <str name="title">IT Servicezentrum: Datendienste</str>
>>>>> <date name="tstamp">2011-07-11T14:26:33.933Z</date>
>>>>> <arr
>>>>>
>>>>> name="type"><str>text/html</str><str>text</str><str>html</str></arr>
>>>>>
>>>>> <str
>>>>>
>>>>> name="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str
>>>>>
>>>>>>
>>>>>
>>>>> </doc>
>>>>>
>>>>> </result>
>>>>>
>>>>> </response>
>>>>>
>>>>> After that I checked the url
>>>>> "http://www.uni-kassel.de/its-baustelle/datendienste.html" in the
>>>>> crawldb and got:
>>>>>
>>>>> ./nutch readdb crawl/crawldb/ -url
>>>>> http://www.uni-kassel.de/its-baustelle/datendienste.html
>>>>> URL: http://www.uni-kassel.de/its-baustelle/datendienste.html
>>>>> not found
>>>>
>>>> This means (IIRC) the URL is not in the CrawlDB and whatever is not in
>>>> the CrawlDB cannot be removed.
>>>
>>> Argh, ok, I'll test that. I interpreted the "not found" as the HTTP
>>> Status instead of not found in the db... :-/
>>>
>>> ... You are right, I dumped the db and the url isn't really there
>>> anymore... Guess for now I have to increase the number of retries for
>>> 404 pages.
>>>
>>> That means, if I configure nutch in that way, that it deletes urls from
>>> the db after a number of retries it would not be possible to delete this
>>> pages automatically in solr anymore?
>>>
>>>> Check the Solr log and see if it actually receives
>>>> the delete commands. Did you issue a commit as well?
>>>
>>> The command with the list of 169 elements is send to solr, and after
>>> that it commits as well.
>>>
>>> Thank you very much :-)
>>>
>>>>> Now I am wondering, why the page is still in the solr index.
>>>>>
>>>>> Thank you
>>
>


Mime
View raw message