<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>nutch-user@lucene.apache.org Archives</title>
<link rel="self" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/?format=atom"/>
<link href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/"/>
<id>http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/</id>
<updated>2009-12-08T12:28:42Z</updated>
<entry>
<title>RE: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>BELLINI ADAM &lt;mbellil@msn.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3cSNT106-W1604F90C7F3DED07FE4719AA900@phx.gbl%3e"/>
<id>urn:uuid:%3cSNT106-W1604F90C7F3DED07FE4719AA900@phx-gbl%3e</id>
<updated>2009-12-07T21:35:14Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>

yes i'v just tested nohup and it works :)
thx to all






&gt; Date: Mon, 7 Dec 2009 19:26:42 +0100
&gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error
&gt; From: millebii@gmail.com
&gt; To: nutch-user@lucene.apache.org
&gt; 
&gt; Another an alternative to crontab, I use nohup command to get my jobs running.
&gt; 
&gt; 2009/12/7, BELLINI ADAM &lt;mbellil@msn.com&gt;:
&gt; &gt;
&gt; &gt; thx fuad for the info...yes i was just closing my laptop without exting the
&gt; &gt; ssh session.
&gt; &gt; but now i hv it running form my cron and it didnt stop :)
&gt; &gt; thx again
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;&gt; From: fuad@efendi.ca
&gt; &gt;&gt; To: nutch-user@lucene.apache.org
&gt; &gt;&gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt;&gt; Date: Mon, 7 Dec 2009 12:58:48 -0500
&gt; &gt;&gt;
&gt; &gt;&gt; &gt;crawl.log 2&gt;&amp;1 &amp;
&gt; &gt;&gt;
&gt; &gt;&gt; You forgot 2&gt;&amp;1... output for errors...
&gt; &gt;&gt;
&gt; &gt;&gt; Also, you need to close _politely_ the SSH session by executing "exit".
&gt; &gt;&gt; Without it, it pipe is broken, OS will kill the process.
&gt; &gt;&gt;
&gt; &gt;&gt;
&gt; &gt;&gt; Fuad Efendi
&gt; &gt;&gt; +1 416-993-2060
&gt; &gt;&gt; http://www.tokenizer.ca
&gt; &gt;&gt; Data Mining, Vertical Search
&gt; &gt;&gt;
&gt; &gt;&gt;
&gt; &gt;&gt; &gt; -----Original Message-----
&gt; &gt;&gt; &gt; From: BELLINI ADAM [mailto:mbellil@msn.com]
&gt; &gt;&gt; &gt; Sent: December-07-09 12:01 PM
&gt; &gt;&gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt;&gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt; hi,
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt; mabe i found my probleme, it's not nutch mistake, i beleived when
&gt; &gt;&gt; &gt; running
&gt; &gt;&gt; &gt; the crawl command as background process when closing my console it will
&gt; &gt;&gt; &gt; not stop the process, but it seems that it realy kill the process
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt; i launched the porcess like this : ./bin/nutch crawl urls -dir crawl
&gt; &gt;&gt; &gt; depth
&gt; &gt;&gt; &gt; -10 &gt; crawl.log &amp;
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt; but even with the '&amp;' caractere when closing my console it kills the
&gt; &gt;&gt; &gt; process.
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt; thx
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt; &gt; Date: Mon, 7 Dec 2009 19:00:37 +0800
&gt; &gt;&gt; &gt; &gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error
&gt; &gt;&gt; &gt; &gt; From: yeahyf@gmail.com
&gt; &gt;&gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt;&gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; I sill want to  know the reason.
&gt; &gt;&gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; 2009/12/2 BELLINI ADAM &lt;mbellil@msn.com&gt;
&gt; &gt;&gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; hi,
&gt; &gt;&gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; anay idea guys ??
&gt; &gt;&gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; thanx
&gt; &gt;&gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt;&gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt;&gt; &gt; &gt; &gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt;&gt; &gt; &gt; &gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; hi,
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; this is the main loop of my recrawl.sh
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; do
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth
---"
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
&gt; &gt;&gt; &gt; $topN \
&gt; &gt;&gt; &gt; &gt; &gt; &gt;       -adddays $adddays
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   then
&gt; &gt;&gt; &gt; &gt; &gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs
to
&gt; &gt;&gt; &gt; &gt; &gt; &gt; fetch."
&gt; &gt;&gt; &gt; &gt; &gt; &gt;     break
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   fi
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   then
&gt; &gt;&gt; &gt; &gt; &gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1`
failed."
&gt; &gt;&gt; &gt; &gt; &gt; &gt;     echo "runbot: Deleting segment $segment."
&gt; &gt;&gt; &gt; &gt; &gt; &gt;     rm $RMARGS $segment
&gt; &gt;&gt; &gt; &gt; &gt; &gt;     continue
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   fi
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; done
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; in my log file i never find the message "----- Merge Segments
&gt; &gt;&gt; &gt; &gt; &gt; &gt; (Step
&gt; &gt;&gt; &gt; 3 of
&gt; &gt;&gt; &gt; &gt; &gt; $steps) -----" ! so it breaks the loop and stops the process.
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; i dont understand why it stops at depth 7 without any errors
!
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; hi,
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth
7/10
&gt; &gt;&gt; &gt; without
&gt; &gt;&gt; &gt; &gt; &gt; any error ! but when run the bin/crawl with the same crawl-urlfilter
&gt; &gt;&gt; &gt; and the
&gt; &gt;&gt; &gt; &gt; &gt; same seeds file it finishs softly in 1h50
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i
just
&gt; &gt;&gt; &gt; find
&gt; &gt;&gt; &gt; &gt; &gt; the last url it was parsing
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; do fetching or crawling has a timeout ?
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the
time fetch
&gt; &gt;&gt; &gt; interval
&gt; &gt;&gt; &gt; &gt; &gt; 24 hours and i'm running the generate with adddays = 1
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; best regards
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can
upgrade to
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; Windows
&gt; &gt;&gt; &gt; 7
&gt; &gt;&gt; &gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt;&gt; &gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt;&gt; &gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt;&gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade
to Windows
&gt; &gt;&gt; &gt; &gt; &gt; &gt; 7
&gt; &gt;&gt; &gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt;&gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt;&gt; &gt; &gt; &gt;
&gt; &gt;&gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt;&gt; &gt; &gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals
on
&gt; &gt;&gt; &gt; Windows 7
&gt; &gt;&gt; &gt; &gt; &gt; now
&gt; &gt;&gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt; &gt;&gt; &gt;
&gt; &gt;&gt; &gt; _________________________________________________________________
&gt; &gt;&gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on
&gt; &gt;&gt; &gt; Windows
&gt; &gt;&gt; &gt; 7 now
&gt; &gt;&gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt; &gt;&gt;
&gt; &gt;&gt;
&gt; &gt;  		 	   		
&gt; &gt; _________________________________________________________________
&gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
&gt; &gt; now
&gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt; 
&gt; 
&gt; -- 
&gt; -MilleBii-
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

</pre>
</div>
</content>
</entry>
<entry>
<title>Re: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912071026g752fc370hdd65f4d5f17f9c5b@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912071026g752fc370hdd65f4d5f17f9c5b@mail-gmail-com%3e</id>
<updated>2009-12-07T18:26:42Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Another an alternative to crontab, I use nohup command to get my jobs running.

2009/12/7, BELLINI ADAM &lt;mbellil@msn.com&gt;:
&gt;
&gt; thx fuad for the info...yes i was just closing my laptop without exting the
&gt; ssh session.
&gt; but now i hv it running form my cron and it didnt stop :)
&gt; thx again
&gt;
&gt;
&gt;
&gt;
&gt;&gt; From: fuad@efendi.ca
&gt;&gt; To: nutch-user@lucene.apache.org
&gt;&gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt;&gt; Date: Mon, 7 Dec 2009 12:58:48 -0500
&gt;&gt;
&gt;&gt; &gt;crawl.log 2&gt;&amp;1 &amp;
&gt;&gt;
&gt;&gt; You forgot 2&gt;&amp;1... output for errors...
&gt;&gt;
&gt;&gt; Also, you need to close _politely_ the SSH session by executing "exit".
&gt;&gt; Without it, it pipe is broken, OS will kill the process.
&gt;&gt;
&gt;&gt;
&gt;&gt; Fuad Efendi
&gt;&gt; +1 416-993-2060
&gt;&gt; http://www.tokenizer.ca
&gt;&gt; Data Mining, Vertical Search
&gt;&gt;
&gt;&gt;
&gt;&gt; &gt; -----Original Message-----
&gt;&gt; &gt; From: BELLINI ADAM [mailto:mbellil@msn.com]
&gt;&gt; &gt; Sent: December-07-09 12:01 PM
&gt;&gt; &gt; To: nutch-user@lucene.apache.org
&gt;&gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt;&gt; &gt;
&gt;&gt; &gt;
&gt;&gt; &gt;
&gt;&gt; &gt;
&gt;&gt; &gt; hi,
&gt;&gt; &gt;
&gt;&gt; &gt; mabe i found my probleme, it's not nutch mistake, i beleived when
&gt;&gt; &gt; running
&gt;&gt; &gt; the crawl command as background process when closing my console it will
&gt;&gt; &gt; not stop the process, but it seems that it realy kill the process
&gt;&gt; &gt;
&gt;&gt; &gt;
&gt;&gt; &gt; i launched the porcess like this : ./bin/nutch crawl urls -dir crawl
&gt;&gt; &gt; depth
&gt;&gt; &gt; -10 &gt; crawl.log &amp;
&gt;&gt; &gt;
&gt;&gt; &gt; but even with the '&amp;' caractere when closing my console it kills the
&gt;&gt; &gt; process.
&gt;&gt; &gt;
&gt;&gt; &gt; thx
&gt;&gt; &gt;
&gt;&gt; &gt; &gt; Date: Mon, 7 Dec 2009 19:00:37 +0800
&gt;&gt; &gt; &gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error
&gt;&gt; &gt; &gt; From: yeahyf@gmail.com
&gt;&gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt;&gt; &gt; &gt;
&gt;&gt; &gt; &gt; I sill want to  know the reason.
&gt;&gt; &gt; &gt;
&gt;&gt; &gt; &gt; 2009/12/2 BELLINI ADAM &lt;mbellil@msn.com&gt;
&gt;&gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; hi,
&gt;&gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; anay idea guys ??
&gt;&gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; thanx
&gt;&gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt;&gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt;&gt; &gt; &gt; &gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt;&gt; &gt; &gt; &gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; hi,
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; this is the main loop of my recrawl.sh
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; do
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth
---"
&gt;&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
&gt;&gt; &gt; $topN \
&gt;&gt; &gt; &gt; &gt; &gt;       -adddays $adddays
&gt;&gt; &gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt;&gt; &gt; &gt; &gt; &gt;   then
&gt;&gt; &gt; &gt; &gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs to
&gt;&gt; &gt; &gt; &gt; &gt; fetch."
&gt;&gt; &gt; &gt; &gt; &gt;     break
&gt;&gt; &gt; &gt; &gt; &gt;   fi
&gt;&gt; &gt; &gt; &gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
&gt;&gt; &gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt;&gt; &gt; &gt; &gt; &gt;   then
&gt;&gt; &gt; &gt; &gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
&gt;&gt; &gt; &gt; &gt; &gt;     echo "runbot: Deleting segment $segment."
&gt;&gt; &gt; &gt; &gt; &gt;     rm $RMARGS $segment
&gt;&gt; &gt; &gt; &gt; &gt;     continue
&gt;&gt; &gt; &gt; &gt; &gt;   fi
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; done
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; in my log file i never find the message "----- Merge Segments
&gt;&gt; &gt; &gt; &gt; &gt; (Step
&gt;&gt; &gt; 3 of
&gt;&gt; &gt; &gt; &gt; $steps) -----" ! so it breaks the loop and stops the process.
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; i dont understand why it stops at depth 7 without any errors
!
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt;&gt; &gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt;&gt; &gt; &gt; &gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error
&gt;&gt; &gt; &gt; &gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000
&gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt; hi,
&gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth
7/10
&gt;&gt; &gt; without
&gt;&gt; &gt; &gt; &gt; any error ! but when run the bin/crawl with the same crawl-urlfilter
&gt;&gt; &gt; and the
&gt;&gt; &gt; &gt; &gt; same seeds file it finishs softly in 1h50
&gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i
just
&gt;&gt; &gt; find
&gt;&gt; &gt; &gt; &gt; the last url it was parsing
&gt;&gt; &gt; &gt; &gt; &gt; &gt; do fetching or crawling has a timeout ?
&gt;&gt; &gt; &gt; &gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the time
fetch
&gt;&gt; &gt; interval
&gt;&gt; &gt; &gt; &gt; 24 hours and i'm running the generate with adddays = 1
&gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt; best regards
&gt;&gt; &gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt;&gt; &gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade
to
&gt;&gt; &gt; &gt; &gt; &gt; &gt; Windows
&gt;&gt; &gt; 7
&gt;&gt; &gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt;&gt; &gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt;&gt; &gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt;&gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to
Windows
&gt;&gt; &gt; &gt; &gt; &gt; 7
&gt;&gt; &gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt;&gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt;&gt; &gt; &gt; &gt;
&gt;&gt; &gt; &gt; &gt; _________________________________________________________________
&gt;&gt; &gt; &gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on
&gt;&gt; &gt; Windows 7
&gt;&gt; &gt; &gt; &gt; now
&gt;&gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt;&gt; &gt;
&gt;&gt; &gt; _________________________________________________________________
&gt;&gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on
&gt;&gt; &gt; Windows
&gt;&gt; &gt; 7 now
&gt;&gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt;&gt;
&gt;&gt;
&gt;  		 	   		
&gt; _________________________________________________________________
&gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
&gt; now
&gt; http://go.microsoft.com/?linkid=9691818


-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>BELLINI ADAM &lt;mbellil@msn.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3cSNT106-W487D48727FBF15CFD64DF6AA900@phx.gbl%3e"/>
<id>urn:uuid:%3cSNT106-W487D48727FBF15CFD64DF6AA900@phx-gbl%3e</id>
<updated>2009-12-07T18:05:23Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>

thx fuad for the info...yes i was just closing my laptop without exting the ssh session.
but now i hv it running form my cron and it didnt stop :)
thx again




&gt; From: fuad@efendi.ca
&gt; To: nutch-user@lucene.apache.org
&gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; Date: Mon, 7 Dec 2009 12:58:48 -0500
&gt; 
&gt; &gt;crawl.log 2&gt;&amp;1 &amp;
&gt; 
&gt; You forgot 2&gt;&amp;1... output for errors...
&gt; 
&gt; Also, you need to close _politely_ the SSH session by executing "exit".
&gt; Without it, it pipe is broken, OS will kill the process.
&gt; 
&gt; 
&gt; Fuad Efendi
&gt; +1 416-993-2060
&gt; http://www.tokenizer.ca
&gt; Data Mining, Vertical Search
&gt; 
&gt; 
&gt; &gt; -----Original Message-----
&gt; &gt; From: BELLINI ADAM [mailto:mbellil@msn.com]
&gt; &gt; Sent: December-07-09 12:01 PM
&gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; 
&gt; &gt; 
&gt; &gt; 
&gt; &gt; 
&gt; &gt; hi,
&gt; &gt; 
&gt; &gt; mabe i found my probleme, it's not nutch mistake, i beleived when running
&gt; &gt; the crawl command as background process when closing my console it will
&gt; &gt; not stop the process, but it seems that it realy kill the process
&gt; &gt; 
&gt; &gt; 
&gt; &gt; i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth
&gt; &gt; -10 &gt; crawl.log &amp;
&gt; &gt; 
&gt; &gt; but even with the '&amp;' caractere when closing my console it kills the
&gt; &gt; process.
&gt; &gt; 
&gt; &gt; thx
&gt; &gt; 
&gt; &gt; &gt; Date: Mon, 7 Dec 2009 19:00:37 +0800
&gt; &gt; &gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; From: yeahyf@gmail.com
&gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt;
&gt; &gt; &gt; I sill want to  know the reason.
&gt; &gt; &gt;
&gt; &gt; &gt; 2009/12/2 BELLINI ADAM &lt;mbellil@msn.com&gt;
&gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; hi,
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; anay idea guys ??
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; thanx
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt; &gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; &gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; hi,
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; this is the main loop of my recrawl.sh
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; do
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
&gt; &gt; $topN \
&gt; &gt; &gt; &gt; &gt;       -adddays $adddays
&gt; &gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt; &gt; &gt; &gt;   then
&gt; &gt; &gt; &gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
&gt; &gt; &gt; &gt; &gt;     break
&gt; &gt; &gt; &gt; &gt;   fi
&gt; &gt; &gt; &gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
&gt; &gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt; &gt; &gt; &gt;   then
&gt; &gt; &gt; &gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
&gt; &gt; &gt; &gt; &gt;     echo "runbot: Deleting segment $segment."
&gt; &gt; &gt; &gt; &gt;     rm $RMARGS $segment
&gt; &gt; &gt; &gt; &gt;     continue
&gt; &gt; &gt; &gt; &gt;   fi
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; done
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; in my log file i never find the message "----- Merge Segments (Step
&gt; &gt; 3 of
&gt; &gt; &gt; &gt; $steps) -----" ! so it breaks the loop and stops the process.
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; i dont understand why it stops at depth 7 without any errors !
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt; &gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; &gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; hi,
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth 7/10
&gt; &gt; without
&gt; &gt; &gt; &gt; any error ! but when run the bin/crawl with the same crawl-urlfilter
&gt; &gt; and the
&gt; &gt; &gt; &gt; same seeds file it finishs softly in 1h50
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i
just
&gt; &gt; find
&gt; &gt; &gt; &gt; the last url it was parsing
&gt; &gt; &gt; &gt; &gt; &gt; do fetching or crawling has a timeout ?
&gt; &gt; &gt; &gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the time fetch
&gt; &gt; interval
&gt; &gt; &gt; &gt; 24 hours and i'm running the generate with adddays = 1
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; best regards
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to
Windows
&gt; &gt; 7
&gt; &gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to Windows
7
&gt; &gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on
&gt; &gt; Windows 7
&gt; &gt; &gt; &gt; now
&gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt; &gt; 
&gt; &gt; _________________________________________________________________
&gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows
&gt; &gt; 7 now
&gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt; 
&gt; 
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

</pre>
</div>
</content>
</entry>
<entry>
<title>RE: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>&quot;Fuad Efendi&quot; &lt;fuad@efendi.ca&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c000901ca7766$ef5057f0$cdf107d0$@ca%3e"/>
<id>urn:uuid:%3c000901ca7766$ef5057f0$cdf107d0$@ca%3e</id>
<updated>2009-12-07T17:58:48Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
&gt;crawl.log 2&gt;&amp;1 &amp;

You forgot 2&gt;&amp;1... output for errors...

Also, you need to close _politely_ the SSH session by executing "exit".
Without it, it pipe is broken, OS will kill the process.


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca
Data Mining, Vertical Search


&gt; -----Original Message-----
&gt; From: BELLINI ADAM [mailto:mbellil@msn.com]
&gt; Sent: December-07-09 12:01 PM
&gt; To: nutch-user@lucene.apache.org
&gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; 
&gt; 
&gt; 
&gt; 
&gt; hi,
&gt; 
&gt; mabe i found my probleme, it's not nutch mistake, i beleived when running
&gt; the crawl command as background process when closing my console it will
&gt; not stop the process, but it seems that it realy kill the process
&gt; 
&gt; 
&gt; i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth
&gt; -10 &gt; crawl.log &amp;
&gt; 
&gt; but even with the '&amp;' caractere when closing my console it kills the
&gt; process.
&gt; 
&gt; thx
&gt; 
&gt; &gt; Date: Mon, 7 Dec 2009 19:00:37 +0800
&gt; &gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; From: yeahyf@gmail.com
&gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt;
&gt; &gt; I sill want to  know the reason.
&gt; &gt;
&gt; &gt; 2009/12/2 BELLINI ADAM &lt;mbellil@msn.com&gt;
&gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; hi,
&gt; &gt; &gt;
&gt; &gt; &gt; anay idea guys ??
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; thanx
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; hi,
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; this is the main loop of my recrawl.sh
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; do
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
&gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
&gt; $topN \
&gt; &gt; &gt; &gt;       -adddays $adddays
&gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt; &gt; &gt;   then
&gt; &gt; &gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
&gt; &gt; &gt; &gt;     break
&gt; &gt; &gt; &gt;   fi
&gt; &gt; &gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
&gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt; &gt; &gt;   then
&gt; &gt; &gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
&gt; &gt; &gt; &gt;     echo "runbot: Deleting segment $segment."
&gt; &gt; &gt; &gt;     rm $RMARGS $segment
&gt; &gt; &gt; &gt;     continue
&gt; &gt; &gt; &gt;   fi
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; done
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; in my log file i never find the message "----- Merge Segments (Step
&gt; 3 of
&gt; &gt; &gt; $steps) -----" ! so it breaks the loop and stops the process.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; i dont understand why it stops at depth 7 without any errors !
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; hi,
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth 7/10
&gt; without
&gt; &gt; &gt; any error ! but when run the bin/crawl with the same crawl-urlfilter
&gt; and the
&gt; &gt; &gt; same seeds file it finishs softly in 1h50
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i just
&gt; find
&gt; &gt; &gt; the last url it was parsing
&gt; &gt; &gt; &gt; &gt; do fetching or crawling has a timeout ?
&gt; &gt; &gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the time fetch
&gt; interval
&gt; &gt; &gt; 24 hours and i'm running the generate with adddays = 1
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; best regards
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to Windows
&gt; 7
&gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to Windows
7
&gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt; &gt;
&gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on
&gt; Windows 7
&gt; &gt; &gt; now
&gt; &gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt; 
&gt; _________________________________________________________________
&gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows
&gt; 7 now
&gt; http://go.microsoft.com/?linkid=9691818




</pre>
</div>
</content>
</entry>
<entry>
<title>OR support</title>
<author><name>BrunoWL &lt;bwotikoski@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c26680899.post@talk.nabble.com%3e"/>
<id>urn:uuid:%3c26680899-post@talk-nabble-com%3e</id>
<updated>2009-12-07T17:37:38Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>

Hi!
Did anybody added the search with "or" operator in the nutch1.0
successfully?
i found a patch for the 0.9 version, but doesn't work. 

thanks.
-- 
View this message in context: http://old.nabble.com/OR-support-tp26680899p26680899.html
Sent from the Nutch - User mailing list archive at Nabble.com.



</pre>
</div>
</content>
</entry>
<entry>
<title>RE: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>BELLINI ADAM &lt;mbellil@msn.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3cSNT106-W540D10E44B7B9EA162E350AA900@phx.gbl%3e"/>
<id>urn:uuid:%3cSNT106-W540D10E44B7B9EA162E350AA900@phx-gbl%3e</id>
<updated>2009-12-07T17:08:03Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>


i fixed it by putting  it in crontab and now i can sleep without thinking of it :)
thx u very much



&gt; Date: Mon, 7 Dec 2009 12:03:25 -0500
&gt; From: ptomblin@gmail.com
&gt; To: nutch-user@lucene.apache.org
&gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; 
&gt; Try starting it with nohup.  'man nohup' for details.
&gt; 
&gt; -- Sent from my Palm Prº
&gt; BELLINI ADAM wrote:
&gt; 
&gt; 
&gt; 
&gt; 
&gt; 
&gt; 
&gt; 
&gt; hi,
&gt; 
&gt; 
&gt; 
&gt; mabe i found my probleme, it's not nutch mistake, i beleived when running the crawl command
as background process when closing my console it will not stop the process, but it seems that
it realy kill the process  
&gt; 
&gt; 
&gt; 
&gt; 
&gt; 
&gt; i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 &gt; crawl.log
&amp;amp;
&gt; 
&gt; 
&gt; 
&gt; but even with the '&amp;amp;' caractere when closing my console it kills the process.
&gt; 
&gt; 
&gt; 
&gt; thx
&gt; 
&gt; 
&gt; 
&gt; &gt; Date: Mon, 7 Dec 2009 19:00:37 +0800
&gt; 
&gt; &gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error
&gt; 
&gt; &gt; From: yeahyf@gmail.com
&gt; 
&gt; &gt; To: nutch-user@lucene.apache.org
&gt; 
&gt; &gt; 
&gt; 
&gt; &gt; I sill want to  know the reason.
&gt; 
&gt; &gt; 
&gt; 
&gt; &gt; 2009/12/2 BELLINI ADAM &amp;lt;mbellil@msn.com&gt;
&gt; 
&gt; &gt; 
&gt; 
&gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; hi,
&gt; 
&gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; anay idea guys ??
&gt; 
&gt; &gt; &gt;
&gt; 
&gt; &gt; &gt;
&gt; 
&gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; thanx
&gt; 
&gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; 
&gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; 
&gt; &gt; &gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; 
&gt; &gt; &gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; hi,
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; this is the main loop of my recrawl.sh
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; do
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
&gt; 
&gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
\
&gt; 
&gt; &gt; &gt; &gt;       -adddays $adddays
&gt; 
&gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; 
&gt; &gt; &gt; &gt;   then
&gt; 
&gt; &gt; &gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
&gt; 
&gt; &gt; &gt; &gt;     break
&gt; 
&gt; &gt; &gt; &gt;   fi
&gt; 
&gt; &gt; &gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
&gt; 
&gt; &gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; 
&gt; &gt; &gt; &gt;   then
&gt; 
&gt; &gt; &gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
&gt; 
&gt; &gt; &gt; &gt;     echo "runbot: Deleting segment $segment."
&gt; 
&gt; &gt; &gt; &gt;     rm $RMARGS $segment
&gt; 
&gt; &gt; &gt; &gt;     continue
&gt; 
&gt; &gt; &gt; &gt;   fi
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; done
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; in my log file i never find the message "----- Merge Segments (Step 3
of
&gt; 
&gt; &gt; &gt; $steps) -----" ! so it breaks the loop and stops the process.
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; i dont understand why it stops at depth 7 without any errors !
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; 
&gt; &gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; 
&gt; &gt; &gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error
&gt; 
&gt; &gt; &gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000
&gt; 
&gt; &gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt; hi,
&gt; 
&gt; &gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth 7/10 without
&gt; 
&gt; &gt; &gt; any error ! but when run the bin/crawl with the same crawl-urlfilter and the
&gt; 
&gt; &gt; &gt; same seeds file it finishs softly in 1h50
&gt; 
&gt; &gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i just
find
&gt; 
&gt; &gt; &gt; the last url it was parsing
&gt; 
&gt; &gt; &gt; &gt; &gt; do fetching or crawling has a timeout ?
&gt; 
&gt; &gt; &gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the time fetch interval
&gt; 
&gt; &gt; &gt; 24 hours and i'm running the generate with adddays = 1
&gt; 
&gt; &gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt; best regards
&gt; 
&gt; &gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; &gt; _________________________________________________________________
&gt; 
&gt; &gt; &gt; &gt; &gt; Eligible CDN College &amp;amp; University students can upgrade to
Windows 7
&gt; 
&gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; 
&gt; &gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; 
&gt; &gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; &gt; _________________________________________________________________
&gt; 
&gt; &gt; &gt; &gt; Eligible CDN College &amp;amp; University students can upgrade to Windows
7
&gt; 
&gt; &gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; 
&gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; 
&gt; &gt; &gt;
&gt; 
&gt; &gt; &gt; _________________________________________________________________
&gt; 
&gt; &gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
&gt; 
&gt; &gt; &gt; now
&gt; 
&gt; &gt; &gt; http://go.microsoft.com/?linkid=9691818
&gt; 
&gt;  		 	   		  
&gt; 
&gt; _________________________________________________________________
&gt; 
&gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
&gt; 
&gt; http://go.microsoft.com/?linkid=9691818
 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://go.microsoft.com/?linkid=9691817

</pre>
</div>
</content>
</entry>
<entry>
<title>RE: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>&quot;Paul Tomblin&quot; &lt;ptomblin@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c4b1d3579.0703c00a.59cc.ffffdd11@mx.google.com%3e"/>
<id>urn:uuid:%3c4b1d3579-0703c00a-59cc-ffffdd11@mx-google-com%3e</id>
<updated>2009-12-07T17:03:25Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Try starting it with nohup. Â 'man nohup' for details.

-- Sent from my Palm PrÄ“
BELLINI ADAM wrote:







hi,



mabe i found my probleme, it's not nutch mistake, i beleived when running the crawl command
as background process when closing my console it will not stop the process, but it seems that
it realy kill the process  





i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 &gt; crawl.log
&amp;amp;



but even with the '&amp;amp;' caractere when closing my console it kills the process.



thx



&gt; Date: Mon, 7 Dec 2009 19:00:37 +0800

&gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error

&gt; From: yeahyf@gmail.com

&gt; To: nutch-user@lucene.apache.org

&gt; 

&gt; I sill want to  know the reason.

&gt; 

&gt; 2009/12/2 BELLINI ADAM &amp;lt;mbellil@msn.com&gt;

&gt; 

&gt; &gt;

&gt; &gt; hi,

&gt; &gt;

&gt; &gt; anay idea guys ??

&gt; &gt;

&gt; &gt;

&gt; &gt;

&gt; &gt; thanx

&gt; &gt;

&gt; &gt; &gt; From: mbellil@msn.com

&gt; &gt; &gt; To: nutch-user@lucene.apache.org

&gt; &gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error

&gt; &gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000

&gt; &gt; &gt;

&gt; &gt; &gt;

&gt; &gt; &gt;

&gt; &gt; &gt; hi,

&gt; &gt; &gt;

&gt; &gt; &gt; this is the main loop of my recrawl.sh

&gt; &gt; &gt;

&gt; &gt; &gt;

&gt; &gt; &gt; do

&gt; &gt; &gt;

&gt; &gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"

&gt; &gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \

&gt; &gt; &gt;       -adddays $adddays

&gt; &gt; &gt;   if [ $? -ne 0 ]

&gt; &gt; &gt;   then

&gt; &gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs to fetch."

&gt; &gt; &gt;     break

&gt; &gt; &gt;   fi

&gt; &gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`

&gt; &gt; &gt;

&gt; &gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads

&gt; &gt; &gt;   if [ $? -ne 0 ]

&gt; &gt; &gt;   then

&gt; &gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1` failed."

&gt; &gt; &gt;     echo "runbot: Deleting segment $segment."

&gt; &gt; &gt;     rm $RMARGS $segment

&gt; &gt; &gt;     continue

&gt; &gt; &gt;   fi

&gt; &gt; &gt;

&gt; &gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment

&gt; &gt; &gt;

&gt; &gt; &gt; done

&gt; &gt; &gt;

&gt; &gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"

&gt; &gt; &gt;

&gt; &gt; &gt;

&gt; &gt; &gt;

&gt; &gt; &gt; in my log file i never find the message "----- Merge Segments (Step 3 of

&gt; &gt; $steps) -----" ! so it breaks the loop and stops the process.

&gt; &gt; &gt;

&gt; &gt; &gt; i dont understand why it stops at depth 7 without any errors !

&gt; &gt; &gt;

&gt; &gt; &gt;

&gt; &gt; &gt; &gt; From: mbellil@msn.com

&gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org

&gt; &gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error

&gt; &gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000

&gt; &gt; &gt; &gt;

&gt; &gt; &gt; &gt;

&gt; &gt; &gt; &gt;

&gt; &gt; &gt; &gt; hi,

&gt; &gt; &gt; &gt;

&gt; &gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth 7/10 without

&gt; &gt; any error ! but when run the bin/crawl with the same crawl-urlfilter and the

&gt; &gt; same seeds file it finishs softly in 1h50

&gt; &gt; &gt; &gt;

&gt; &gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i just find

&gt; &gt; the last url it was parsing

&gt; &gt; &gt; &gt; do fetching or crawling has a timeout ?

&gt; &gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the time fetch interval

&gt; &gt; 24 hours and i'm running the generate with adddays = 1

&gt; &gt; &gt; &gt;

&gt; &gt; &gt; &gt; best regards

&gt; &gt; &gt; &gt;

&gt; &gt; &gt; &gt; _________________________________________________________________

&gt; &gt; &gt; &gt; Eligible CDN College &amp;amp; University students can upgrade to Windows
7

&gt; &gt; before Jan 3 for only $39.99. Upgrade now!

&gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819

&gt; &gt; &gt;

&gt; &gt; &gt; _________________________________________________________________

&gt; &gt; &gt; Eligible CDN College &amp;amp; University students can upgrade to Windows 7

&gt; &gt; before Jan 3 for only $39.99. Upgrade now!

&gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819

&gt; &gt;

&gt; &gt; _________________________________________________________________

&gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7

&gt; &gt; now

&gt; &gt; http://go.microsoft.com/?linkid=9691818

 		 	   		  

_________________________________________________________________

Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now

http://go.microsoft.com/?linkid=9691818


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>BELLINI ADAM &lt;mbellil@msn.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3cSNT106-W48A79F2552103D0A9D7DB8AA900@phx.gbl%3e"/>
<id>urn:uuid:%3cSNT106-W48A79F2552103D0A9D7DB8AA900@phx-gbl%3e</id>
<updated>2009-12-07T17:01:14Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>



hi,

mabe i found my probleme, it's not nutch mistake, i beleived when running the crawl command
as background process when closing my console it will not stop the process, but it seems that
it realy kill the process  


i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 &gt; crawl.log
&amp;

but even with the '&amp;' caractere when closing my console it kills the process.

thx

&gt; Date: Mon, 7 Dec 2009 19:00:37 +0800
&gt; Subject: Re: recrawl.sh stopped at depth 7/10 without error
&gt; From: yeahyf@gmail.com
&gt; To: nutch-user@lucene.apache.org
&gt; 
&gt; I sill want to  know the reason.
&gt; 
&gt; 2009/12/2 BELLINI ADAM &lt;mbellil@msn.com&gt;
&gt; 
&gt; &gt;
&gt; &gt; hi,
&gt; &gt;
&gt; &gt; anay idea guys ??
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt; thanx
&gt; &gt;
&gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; hi,
&gt; &gt; &gt;
&gt; &gt; &gt; this is the main loop of my recrawl.sh
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; do
&gt; &gt; &gt;
&gt; &gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
&gt; &gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
&gt; &gt; &gt;       -adddays $adddays
&gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt; &gt;   then
&gt; &gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
&gt; &gt; &gt;     break
&gt; &gt; &gt;   fi
&gt; &gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`
&gt; &gt; &gt;
&gt; &gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
&gt; &gt; &gt;   if [ $? -ne 0 ]
&gt; &gt; &gt;   then
&gt; &gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
&gt; &gt; &gt;     echo "runbot: Deleting segment $segment."
&gt; &gt; &gt;     rm $RMARGS $segment
&gt; &gt; &gt;     continue
&gt; &gt; &gt;   fi
&gt; &gt; &gt;
&gt; &gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
&gt; &gt; &gt;
&gt; &gt; &gt; done
&gt; &gt; &gt;
&gt; &gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; in my log file i never find the message "----- Merge Segments (Step 3 of
&gt; &gt; $steps) -----" ! so it breaks the loop and stops the process.
&gt; &gt; &gt;
&gt; &gt; &gt; i dont understand why it stops at depth 7 without any errors !
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; hi,
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth 7/10 without
&gt; &gt; any error ! but when run the bin/crawl with the same crawl-urlfilter and the
&gt; &gt; same seeds file it finishs softly in 1h50
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i just find
&gt; &gt; the last url it was parsing
&gt; &gt; &gt; &gt; do fetching or crawling has a timeout ?
&gt; &gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the time fetch interval
&gt; &gt; 24 hours and i'm running the generate with adddays = 1
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; best regards
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to Windows
7
&gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt; &gt;
&gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to Windows 7
&gt; &gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt;
&gt; &gt; _________________________________________________________________
&gt; &gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
&gt; &gt; now
&gt; &gt; http://go.microsoft.com/?linkid=9691818
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

</pre>
</div>
</content>
</entry>
<entry>
<title>RE: How to successfully crawl and index office 2007 documents in Nutch 1.0</title>
<author><name>Rupesh Mankar &lt;rupesh_mankar@persistent.co.in&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c6B75622C2FF30547B96A565D6A6B0B70C4953CDE87@EXCHANGE.persistent.co.in%3e"/>
<id>urn:uuid:%3c6B75622C2FF30547B96A565D6A6B0B70C4953CDE87@EXCHANGE-persistent-co-in%3e</id>
<updated>2009-12-07T12:41:21Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Is there any readymade plug-in for office 2007 documents available or I have to write it by
my own?


-----Original Message-----
From: yangfeng [mailto:yeahyf@gmail.com] 
Sent: Monday, December 07, 2009 4:35 PM
To: nutch-user@lucene.apache.org
Subject: Re: How to successfully crawl and index office 2007 documents in Nutch 1.0

docx should be parsed,A plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on.

2009/12/4 Rupesh Mankar &lt;rupesh_mankar@persistent.co.in&gt;

&gt; Hi,
&gt;
&gt; I am new to Nutch. I want to crawl and search office 2007 documents (.docx,
&gt; .pptx etc) from Nutch. But when I try to crawl, crawler throws following
&gt; error:
&gt;
&gt; fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
&gt; Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx:
&gt; org.apache.nutch.parse.ParseException: parser not found for
&gt; contentType=application/zip url=
&gt; http://10.88.45.140:8081/tutorial/Office-2007-document.docx
&gt;        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
&gt;        at
&gt; org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
&gt;        at
&gt; org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
&gt;
&gt; When I add zip plugin in nutch-site.xml under plugin.includes, crawling
&gt; becomes successful but nothing gets search.
&gt;
&gt; How can we successfully crawl and search contents of office 2007 documents?
&gt;
&gt; Thanks,
&gt; Rupesh
&gt;
&gt; DISCLAIMER
&gt; ==========
&gt; This e-mail may contain privileged and confidential information which is
&gt; the property of Persistent Systems Ltd. It is intended only for the use of
&gt; the individual or entity to which it is addressed. If you are not the
&gt; intended recipient, you are not authorized to read, retain, copy, print,
&gt; distribute or use this message. If you have received this communication in
&gt; error, please notify the sender and delete all copies of this message.
&gt; Persistent Systems Ltd. does not accept any liability for virus infected
&gt; mails.
&gt;

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 1.0 wml plugin</title>
<author><name>Andrzej Bialecki &lt;ab@getopt.org&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c4B1CEC19.3060001@getopt.org%3e"/>
<id>urn:uuid:%3c4B1CEC19-3060001@getopt-org%3e</id>
<updated>2009-12-07T11:50:49Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
yangfeng wrote:
&gt; I have completed the plugin for parsing the wml(wiredless mark language). I
&gt; hope to add it to lucene, what i do?
&gt; 

The best long-term option would be to submit this work to the Tika 
project - see http://lucene.apache.org/tika/. If you already implemented 
this as a Nutch plugin, please creata a JIRA issue in Nutch, and attach 
the patch.


-- 
Best regards,
Andrzej Bialecki     &lt;&gt;&lt;
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



</pre>
</div>
</content>
</entry>
<entry>
<title>Fetched links contain html</title>
<author><name>Kirk Gillock &lt;pk@isara.org&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c4B1CEB65.3010303@isara.org%3e"/>
<id>urn:uuid:%3c4B1CEB65-3010303@isara-org%3e</id>
<updated>2009-12-07T11:47:49Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hello fellow Nutch users,

In a few days we'll start crawling a long list of Thai websites. With 
previous crawls we noticed there were A LOT of poorly formatted html 
pages and the crawler would sometimes fetch links that contain html code 
(ex: http://www.website.com/news/index.php&lt;/ul&gt; ). How can we regex 
those URLs so that the html code (&lt;/ul&gt;) is removed? Would we use the 
regex-normalizer.xml file? If so, what would the code look like?

Thanks in advance,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org


</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch 1.0 wml plugin</title>
<author><name>yangfeng &lt;yeahyf@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c3b13e4710912070313s5af033c9u561dc74cff465ee8@mail.gmail.com%3e"/>
<id>urn:uuid:%3c3b13e4710912070313s5af033c9u561dc74cff465ee8@mail-gmail-com%3e</id>
<updated>2009-12-07T11:13:35Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I have completed the plugin for parsing the wml(wiredless mark language). I
hope to add it to lucene, what i do?


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: How to successfully crawl and index office 2007 documents in	Nutch 1.0</title>
<author><name>yangfeng &lt;yeahyf@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c3b13e4710912070305q7982c4c9rfe360af507b6f749@mail.gmail.com%3e"/>
<id>urn:uuid:%3c3b13e4710912070305q7982c4c9rfe360af507b6f749@mail-gmail-com%3e</id>
<updated>2009-12-07T11:05:30Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
docx should be parsed,A plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on.

2009/12/4 Rupesh Mankar &lt;rupesh_mankar@persistent.co.in&gt;

&gt; Hi,
&gt;
&gt; I am new to Nutch. I want to crawl and search office 2007 documents (.docx,
&gt; .pptx etc) from Nutch. But when I try to crawl, crawler throws following
&gt; error:
&gt;
&gt; fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
&gt; Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx:
&gt; org.apache.nutch.parse.ParseException: parser not found for
&gt; contentType=application/zip url=
&gt; http://10.88.45.140:8081/tutorial/Office-2007-document.docx
&gt;        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
&gt;        at
&gt; org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
&gt;        at
&gt; org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
&gt;
&gt; When I add zip plugin in nutch-site.xml under plugin.includes, crawling
&gt; becomes successful but nothing gets search.
&gt;
&gt; How can we successfully crawl and search contents of office 2007 documents?
&gt;
&gt; Thanks,
&gt; Rupesh
&gt;
&gt; DISCLAIMER
&gt; ==========
&gt; This e-mail may contain privileged and confidential information which is
&gt; the property of Persistent Systems Ltd. It is intended only for the use of
&gt; the individual or entity to which it is addressed. If you are not the
&gt; intended recipient, you are not authorized to read, retain, copy, print,
&gt; distribute or use this message. If you have received this communication in
&gt; error, please notify the sender and delete all copies of this message.
&gt; Persistent Systems Ltd. does not accept any liability for virus infected
&gt; mails.
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: recrawl.sh stopped at depth 7/10 without error</title>
<author><name>yangfeng &lt;yeahyf@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c3b13e4710912070300q60158713y397c9fb8b8ae8050@mail.gmail.com%3e"/>
<id>urn:uuid:%3c3b13e4710912070300q60158713y397c9fb8b8ae8050@mail-gmail-com%3e</id>
<updated>2009-12-07T11:00:37Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I sill want to  know the reason.

2009/12/2 BELLINI ADAM &lt;mbellil@msn.com&gt;

&gt;
&gt; hi,
&gt;
&gt; anay idea guys ??
&gt;
&gt;
&gt;
&gt; thanx
&gt;
&gt; &gt; From: mbellil@msn.com
&gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; Subject: RE: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; Date: Fri, 27 Nov 2009 20:11:12 +0000
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt; hi,
&gt; &gt;
&gt; &gt; this is the main loop of my recrawl.sh
&gt; &gt;
&gt; &gt;
&gt; &gt; do
&gt; &gt;
&gt; &gt;   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
&gt; &gt;   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
&gt; &gt;       -adddays $adddays
&gt; &gt;   if [ $? -ne 0 ]
&gt; &gt;   then
&gt; &gt;     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
&gt; &gt;     break
&gt; &gt;   fi
&gt; &gt;   segment=`ls -d $crawl/segments/* | tail -1`
&gt; &gt;
&gt; &gt;   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
&gt; &gt;   if [ $? -ne 0 ]
&gt; &gt;   then
&gt; &gt;     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
&gt; &gt;     echo "runbot: Deleting segment $segment."
&gt; &gt;     rm $RMARGS $segment
&gt; &gt;     continue
&gt; &gt;   fi
&gt; &gt;
&gt; &gt;   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
&gt; &gt;
&gt; &gt; done
&gt; &gt;
&gt; &gt; echo "----- Merge Segments (Step 3 of $steps) -----"
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt; in my log file i never find the message "----- Merge Segments (Step 3 of
&gt; $steps) -----" ! so it breaks the loop and stops the process.
&gt; &gt;
&gt; &gt; i dont understand why it stops at depth 7 without any errors !
&gt; &gt;
&gt; &gt;
&gt; &gt; &gt; From: mbellil@msn.com
&gt; &gt; &gt; To: nutch-user@lucene.apache.org
&gt; &gt; &gt; Subject: recrawl.sh stopped at depth 7/10 without error
&gt; &gt; &gt; Date: Wed, 25 Nov 2009 15:43:33 +0000
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; hi,
&gt; &gt; &gt;
&gt; &gt; &gt; i'm running recrawl.sh and it stops every time at depth 7/10 without
&gt; any error ! but when run the bin/crawl with the same crawl-urlfilter and the
&gt; same seeds file it finishs softly in 1h50
&gt; &gt; &gt;
&gt; &gt; &gt; i checked the hadoop.log, and dont find any error there...i just find
&gt; the last url it was parsing
&gt; &gt; &gt; do fetching or crawling has a timeout ?
&gt; &gt; &gt; my recrawl takes 2 hours before it stops. i set the time fetch interval
&gt; 24 hours and i'm running the generate with adddays = 1
&gt; &gt; &gt;
&gt; &gt; &gt; best regards
&gt; &gt; &gt;
&gt; &gt; &gt; _________________________________________________________________
&gt; &gt; &gt; Eligible CDN College &amp; University students can upgrade to Windows 7
&gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt; &gt;
&gt; &gt; _________________________________________________________________
&gt; &gt; Eligible CDN College &amp; University students can upgrade to Windows 7
&gt; before Jan 3 for only $39.99. Upgrade now!
&gt; &gt; http://go.microsoft.com/?linkid=9691819
&gt;
&gt; _________________________________________________________________
&gt; Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
&gt; now
&gt; http://go.microsoft.com/?linkid=9691818


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: newbie questions</title>
<author><name>yangfeng &lt;yeahyf@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c3b13e4710912070258h6b97f142g38df7fda0e921ab4@mail.gmail.com%3e"/>
<id>urn:uuid:%3c3b13e4710912070258h6b97f142g38df7fda0e921ab4@mail-gmail-com%3e</id>
<updated>2009-12-07T10:58:56Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
you should add property  below£º

 &lt;property&gt;
      &lt;name&gt;hadoop.job.ugi&lt;/name&gt;
      &lt;value&gt;rider,iamsolomon&lt;/value&gt;
   &lt;/property&gt;

it's ok!

2009/12/1 Mischa Tuffield &lt;mischa.tuffield@garlik.com&gt;

&gt; Hello Brian,
&gt;
&gt; Getting a response from another newbie here, so I could be wrong (do excuse
&gt; if I am).
&gt;
&gt; If you are attempting to run a search index from the filesystem you need to
&gt; have the following in your nutch-site.xml :
&gt;
&gt;  &lt;property&gt;
&gt;    &lt;name&gt;fs.default.name&lt;/name&gt;
&gt;    &lt;value&gt;file:///&lt;/value&gt;
&gt;  &lt;/property&gt;
&gt;
&gt; The fs.default.name is require by the nutch-site.xml when you build your
&gt; .war file for deployment to tomcat. This should be accompanied by the below
&gt; config, which should point to the direct where your index has been copied
&gt; to, in my case it looks something like below :
&gt;
&gt;  &lt;property&gt;
&gt;   &lt;name&gt;searcher.dir&lt;/name&gt;
&gt;   &lt;value&gt;/home/nutch/nutch/service/crawl&lt;/value&gt;
&gt;   &lt;description&gt;
&gt;   Path to root of crawl.  This directory is searched (in
&gt;   order) for either the file search-servers.txt, containing a list of
&gt;   distributed search servers, or the directory "index" containing
&gt;   merged indexes, or the directory "segments" containing segment
&gt;   indexes.
&gt;   &lt;/description&gt;
&gt;  &lt;/property&gt;
&gt;
&gt; Regarding your second question :
&gt;
&gt; bin/nutch readdb yourcrawldir/crawldb -dump -format csv
&gt;
&gt; Gives you a nice flat file serialisation of your crawl database.
&gt;
&gt; I hope this helps,
&gt;
&gt; Mischa
&gt; On 1 Dec 2009, at 08:44, brian wrote:
&gt;
&gt; &gt; also, I would like to know how to extract flat text files of the crawl
&gt; data.
&gt;
&gt; ___________________________________
&gt; Mischa Tuffield
&gt; Email: mischa.tuffield@garlik.com
&gt; Homepage - http://mmt.me.uk/
&gt; Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&gt; +44(0)20 8973 2465  http://www.garlik.com/
&gt; Registered in England and Wales 535 7233 VAT # 849 0517 11
&gt; Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&gt;
&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Fetch failing ?</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912061435l31887da3n5a019706ebb00af2@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912061435l31887da3n5a019706ebb00af2@mail-gmail-com%3e</id>
<updated>2009-12-06T22:35:21Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
New and longer run ... I get plenty of  :  failed with:
java.lang.OutOfMemoryError: Java heap space
Fetching still goes on, not sure if this the expected behavior.


2009/12/6 MilleBii &lt;millebii@gmail.com&gt;

&gt; Works fine and my memory problem had to do with the fact that I had too
&gt; many threads...
&gt;
&gt; 2009/12/5 MilleBii &lt;millebii@gmail.com&gt;
&gt;
&gt;&gt; Thx again Julien,
&gt;&gt;
&gt;&gt; Yes I'm going to buy myself the Hadoop book, because I thought I could do
&gt;&gt; without but I realize that I need to make good use of hadooop.
&gt;&gt;
&gt;&gt; Didn't know you could split fetching &amp; parsing:  so I suppose you just
&gt;&gt; issue nutch fetch &lt;segment&gt; -noParsing, followed by nutch parse &lt;segment&gt;.
I
&gt;&gt; will try on my next run.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; 2009/12/5 Julien Nioche &lt;lists.digitalpebble@gmail.com&gt;
&gt;&gt;
&gt;&gt; HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
&gt;&gt;&gt; does NOT affect the memory used for the map/ reduce jobs. Maybe you
&gt;&gt;&gt; should
&gt;&gt;&gt; invest a bit of time reading about Hadoop first?
&gt;&gt;&gt;
&gt;&gt;&gt; As for your memory problem it could be due to the parsing and not the
&gt;&gt;&gt; fetching. If you don't already do so I suggest that you separate the
&gt;&gt;&gt; fetching from the parsing. First that will tell you which part fails + if
&gt;&gt;&gt; it
&gt;&gt;&gt; does fail in the parsing then you would not need to refetch the content
&gt;&gt;&gt;
&gt;&gt;&gt; J.
&gt;&gt;&gt;
&gt;&gt;&gt; 2009/12/5 MilleBii &lt;millebii@gmail.com&gt;
&gt;&gt;&gt;
&gt;&gt;&gt; &gt; My fetch cycle failed on the following initial error :
&gt;&gt;&gt; &gt;
&gt;&gt;&gt; &gt; java.io.IOException: Task process exit with nonzero status of 65.
&gt;&gt;&gt; &gt;        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
&gt;&gt;&gt; &gt;
&gt;&gt;&gt; &gt; Than it makes a second attempt and after 3 hours I bump on that error
&gt;&gt;&gt; &gt; (altough I had double HADOOP_HEAPSIZE):
&gt;&gt;&gt; &gt;
&gt;&gt;&gt; &gt; java.lang.OutOfMemoryError: GC overhead limit exceeded
&gt;&gt;&gt; &gt;
&gt;&gt;&gt; &gt;
&gt;&gt;&gt; &gt; Any idea what the initial error is or could be ?
&gt;&gt;&gt; &gt; For the second one, I'm going to reduce number of threads... but I'm
&gt;&gt;&gt; &gt; wondering if there could be a memory leak ? And I don't how to trace
&gt;&gt;&gt; that.
&gt;&gt;&gt; &gt;
&gt;&gt;&gt; &gt; --
&gt;&gt;&gt; &gt; -MilleBii-
&gt;&gt;&gt; &gt;
&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt; --
&gt;&gt;&gt; DigitalPebble Ltd
&gt;&gt;&gt; http://www.digitalpebble.com
&gt;&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; --
&gt;&gt; -MilleBii-
&gt;&gt;
&gt;
&gt;
&gt;
&gt; --
&gt; -MilleBii-
&gt;



-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch 1.0 ms-powerpoint plugin</title>
<author><name>&quot;Joe Bell&quot; &lt;joe.bell@prodeasystems.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c56DA5079467D1C48B857C6FE59D5D4FC03345130@prodeaserve.prodea.local%3e"/>
<id>urn:uuid:%3c56DA5079467D1C48B857C6FE59D5D4FC03345130@prodeaserve-prodea-local%3e</id>
<updated>2009-12-06T18:24:14Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi - this is my first post to the nutch mailing list, please let me know
if I commit any list protocol errors.

 

I'm currently using Nutch 1.0 with the Powerpoint plugin enabled and can
verify that Nutch is indeed pulling in the entire file for passing off
to the parser (i.e., I've set the content limit to -1 to get the full
file).  However it appears that most Powerpoint files with any
complexity (they use a template, have tables, images, etc.) do not get
indexed at all.  In one case I created a new file with one "title" slide
and the title text was recognized but the subtitle text directly
underneath was not.

 

My question is whether I'm missing something that has already been
covered (like for example,
http://issues.apache.org/jira/browse/NUTCH-463, though I don't see any
logs indicating issues in my crawl) or that this is a known defect in
the existing Powerpoint plugin?  It goes without saying that I'd very
much like to be able to completely index Powerpoint slides as this is
going to be the most common document type on my site.

 

Thanks,

Joe

 




This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems  
is neither apparent nor implied,and must be independently verified.

</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Fetch failing ?</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912061007x5b034036u5913fb215c32b007@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912061007x5b034036u5913fb215c32b007@mail-gmail-com%3e</id>
<updated>2009-12-06T18:07:17Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Works fine and my memory problem had to do with the fact that I had too many
threads...

2009/12/5 MilleBii &lt;millebii@gmail.com&gt;

&gt; Thx again Julien,
&gt;
&gt; Yes I'm going to buy myself the Hadoop book, because I thought I could do
&gt; without but I realize that I need to make good use of hadooop.
&gt;
&gt; Didn't know you could split fetching &amp; parsing:  so I suppose you just
&gt; issue nutch fetch &lt;segment&gt; -noParsing, followed by nutch parse &lt;segment&gt;.
I
&gt; will try on my next run.
&gt;
&gt;
&gt;
&gt; 2009/12/5 Julien Nioche &lt;lists.digitalpebble@gmail.com&gt;
&gt;
&gt; HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
&gt;&gt; does NOT affect the memory used for the map/ reduce jobs. Maybe you should
&gt;&gt; invest a bit of time reading about Hadoop first?
&gt;&gt;
&gt;&gt; As for your memory problem it could be due to the parsing and not the
&gt;&gt; fetching. If you don't already do so I suggest that you separate the
&gt;&gt; fetching from the parsing. First that will tell you which part fails + if
&gt;&gt; it
&gt;&gt; does fail in the parsing then you would not need to refetch the content
&gt;&gt;
&gt;&gt; J.
&gt;&gt;
&gt;&gt; 2009/12/5 MilleBii &lt;millebii@gmail.com&gt;
&gt;&gt;
&gt;&gt; &gt; My fetch cycle failed on the following initial error :
&gt;&gt; &gt;
&gt;&gt; &gt; java.io.IOException: Task process exit with nonzero status of 65.
&gt;&gt; &gt;        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
&gt;&gt; &gt;
&gt;&gt; &gt; Than it makes a second attempt and after 3 hours I bump on that error
&gt;&gt; &gt; (altough I had double HADOOP_HEAPSIZE):
&gt;&gt; &gt;
&gt;&gt; &gt; java.lang.OutOfMemoryError: GC overhead limit exceeded
&gt;&gt; &gt;
&gt;&gt; &gt;
&gt;&gt; &gt; Any idea what the initial error is or could be ?
&gt;&gt; &gt; For the second one, I'm going to reduce number of threads... but I'm
&gt;&gt; &gt; wondering if there could be a memory leak ? And I don't how to trace
&gt;&gt; that.
&gt;&gt; &gt;
&gt;&gt; &gt; --
&gt;&gt; &gt; -MilleBii-
&gt;&gt; &gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; --
&gt;&gt; DigitalPebble Ltd
&gt;&gt; http://www.digitalpebble.com
&gt;&gt;
&gt;
&gt;
&gt;
&gt; --
&gt; -MilleBii-
&gt;



-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>Configurable depth for fetcher queue ?</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912061005w53db0a5avad54bc8197e0155a@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912061005w53db0a5avad54bc8197e0155a@mail-gmail-com%3e</id>
<updated>2009-12-06T18:05:50Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Currently the depth queue is hardcoded to 50... however one needs to keep
#Threads x Depth below a certain number otherwise fetcher spends its life in
managing the queues and CPU becomes the limiting factor.
This is what was creating my L shape kind of bandwidth usage.


I had to patch it by hand, however I suggest we create a conf item to enable
different combination  #Threads x Depth.

Or we could also set it the opposite way which is to define a
"max.fetchqueue.totalSize" and derive the depth as totalSize/#Threads.
And of course max.fetchqueue.totalSize is very dependent on the system
specs, but this is what you want to control.

-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch Hadoop 0.20 - Exception</title>
<author><name>Eran Zinman &lt;zzeran@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c42b8f64b0912060551p6a9ff160g2fbe00c4498b3df@mail.gmail.com%3e"/>
<id>urn:uuid:%3c42b8f64b0912060551p6a9ff160g2fbe00c4498b3df@mail-gmail-com%3e</id>
<updated>2009-12-06T13:51:53Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,

Just upgraded to the latest version of Nutch with Hadoop 0.20.

I'm getting the following exception in the namenode log and DFS doesn't
start:

2009-12-06 15:48:32,523 ERROR namenode.NameNode -
java.lang.SecurityException: sealing violation: can't seal package
org.mortbay.util: already loaded
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
    at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.&lt;init&gt;(NameNode.java:279)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

Any help will be appreciated ... quite stuck with this.

Thanks,
Eran


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: Indexing with solrindexer -&gt; OutOfMemoryError</title>
<author><name>BELLINI ADAM &lt;mbellil@msn.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3cSNT106-W4879250EC7D82FEED23B1AAA910@phx.gbl%3e"/>
<id>urn:uuid:%3cSNT106-W4879250EC7D82FEED23B1AAA910@phx-gbl%3e</id>
<updated>2009-12-06T05:26:15Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>

hi,
u have to make your segments smaller than that, just cut every segment in small pieces




&gt; Subject: Indexing with solrindexer -&gt; OutOfMemoryError
&gt; From: felizimm@gmx.de
&gt; To: nutch-user@lucene.apache.org
&gt; Date: Sun, 6 Dec 2009 01:35:04 +0100
&gt; 
&gt; Hi,
&gt; 
&gt; when trying to index four segments (~5 GB) with solrindexer, I get this
&gt; error in hadoop.log. There is no error in the logs of Tomcat, where I
&gt; deployed Solr. I crawled with "crawl"-command.
&gt; 
&gt; I`ve read that increasing the hadoop heap space will change nothing.
&gt; What can I do?
&gt; 
&gt; Thanks for help!
&gt; Felix.
&gt; 
&gt; 
&gt; 2009-12-06 00:21:51,061 WARN  mapred.LocalJobRunner - job_local_0001
&gt; java.lang.OutOfMemoryError: Java heap space
&gt;         at java.util.Arrays.copyOf(Arrays.java:2882)
&gt;         at
&gt; java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
&gt;         at
&gt; java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
&gt;         at java.lang.StringBuffer.append(StringBuffer.java:320)
&gt;         at java.io.StringWriter.write(StringWriter.java:60)
&gt;         at org.apache.solr.common.util.XML.escape(XML.java:180)
&gt;         at org.apache.solr.common.util.XML.escapeCharData(XML.java:78)
&gt;         at org.apache.solr.common.util.XML.writeXML(XML.java:148)
&gt;         at
&gt; org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:117)
&gt;         at
&gt; org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:169)
&gt;         at
&gt; org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(UpdateRequest.java:160)
&gt;         at
&gt; org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:191)
&gt;         at
&gt; org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
&gt;         at
&gt; org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217)
&gt;         at
&gt; org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48)
&gt;         at
&gt; org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:58)
&gt;         at org.apache.nutch.indexer.IndexerOutputFormat
&gt; $1.write(IndexerOutputFormat.java:54)
&gt;         at org.apache.nutch.indexer.IndexerOutputFormat
&gt; $1.write(IndexerOutputFormat.java:44)
&gt;         at org.apache.hadoop.mapred.ReduceTask
&gt; $3.collect(ReduceTask.java:410)
&gt;         at
&gt; org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
&gt;         at
&gt; org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
&gt;         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
&gt;         at org.apache.hadoop.mapred.LocalJobRunner
&gt; $Job.run(LocalJobRunner.java:170)
&gt; 2009-12-06 00:21:51,650 FATAL solr.SolrIndexer - SolrIndexer:
&gt; java.io.IOException: Job failed!
&gt;         at
&gt; org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
&gt;         at
&gt; org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73)
&gt;         at
&gt; org.apache.nutch.indexer.solr.SolrIndexer.run(SolrIndexer.java:95)
&gt;         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
&gt;         at
&gt; org.apache.nutch.indexer.solr.SolrIndexer.main(SolrIndexer.java:104)
&gt; 
&gt; 
&gt; 
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816

</pre>
</div>
</content>
</entry>
<entry>
<title>Indexing with solrindexer -&gt; OutOfMemoryError</title>
<author><name>Felix Zimmermann &lt;felizimm@gmx.de&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c1260059704.2444.112.camel@uhu%3e"/>
<id>urn:uuid:%3c1260059704-2444-112-camel@uhu%3e</id>
<updated>2009-12-06T00:35:04Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,

when trying to index four segments (~5 GB) with solrindexer, I get this
error in hadoop.log. There is no error in the logs of Tomcat, where I
deployed Solr. I crawled with "crawl"-command.

I`ve read that increasing the hadoop heap space will change nothing.
What can I do?

Thanks for help!
Felix.


2009-12-06 00:21:51,061 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
        at java.lang.StringBuffer.append(StringBuffer.java:320)
        at java.io.StringWriter.write(StringWriter.java:60)
        at org.apache.solr.common.util.XML.escape(XML.java:180)
        at org.apache.solr.common.util.XML.escapeCharData(XML.java:78)
        at org.apache.solr.common.util.XML.writeXML(XML.java:148)
        at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:117)
        at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:169)
        at
org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(UpdateRequest.java:160)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:191)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
        at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217)
        at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48)
        at
org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:58)
        at org.apache.nutch.indexer.IndexerOutputFormat
$1.write(IndexerOutputFormat.java:54)
        at org.apache.nutch.indexer.IndexerOutputFormat
$1.write(IndexerOutputFormat.java:44)
        at org.apache.hadoop.mapred.ReduceTask
$3.collect(ReduceTask.java:410)
        at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
        at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.LocalJobRunner
$Job.run(LocalJobRunner.java:170)
2009-12-06 00:21:51,650 FATAL solr.SolrIndexer - SolrIndexer:
java.io.IOException: Job failed!
        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at
org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73)
        at
org.apache.nutch.indexer.solr.SolrIndexer.run(SolrIndexer.java:95)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.solr.SolrIndexer.main(SolrIndexer.java:104)





</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Fetch failing ?</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912050417ud6e1df0u7836ff5c01731ccd@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912050417ud6e1df0u7836ff5c01731ccd@mail-gmail-com%3e</id>
<updated>2009-12-05T12:17:45Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Thx again Julien,

Yes I'm going to buy myself the Hadoop book, because I thought I could do
without but I realize that I need to make good use of hadooop.

Didn't know you could split fetching &amp; parsing:  so I suppose you just issue
nutch fetch &lt;segment&gt; -noParsing, followed by nutch parse &lt;segment&gt;. I will
try on my next run.



2009/12/5 Julien Nioche &lt;lists.digitalpebble@gmail.com&gt;

&gt; HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
&gt; does NOT affect the memory used for the map/ reduce jobs. Maybe you should
&gt; invest a bit of time reading about Hadoop first?
&gt;
&gt; As for your memory problem it could be due to the parsing and not the
&gt; fetching. If you don't already do so I suggest that you separate the
&gt; fetching from the parsing. First that will tell you which part fails + if
&gt; it
&gt; does fail in the parsing then you would not need to refetch the content
&gt;
&gt; J.
&gt;
&gt; 2009/12/5 MilleBii &lt;millebii@gmail.com&gt;
&gt;
&gt; &gt; My fetch cycle failed on the following initial error :
&gt; &gt;
&gt; &gt; java.io.IOException: Task process exit with nonzero status of 65.
&gt; &gt;        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
&gt; &gt;
&gt; &gt; Than it makes a second attempt and after 3 hours I bump on that error
&gt; &gt; (altough I had double HADOOP_HEAPSIZE):
&gt; &gt;
&gt; &gt; java.lang.OutOfMemoryError: GC overhead limit exceeded
&gt; &gt;
&gt; &gt;
&gt; &gt; Any idea what the initial error is or could be ?
&gt; &gt; For the second one, I'm going to reduce number of threads... but I'm
&gt; &gt; wondering if there could be a memory leak ? And I don't how to trace
&gt; that.
&gt; &gt;
&gt; &gt; --
&gt; &gt; -MilleBii-
&gt; &gt;
&gt;
&gt;
&gt;
&gt; --
&gt; DigitalPebble Ltd
&gt; http://www.digitalpebble.com
&gt;



-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Fetch failing ?</title>
<author><name>Julien Nioche &lt;lists.digitalpebble@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c16d405e0912050356u405e47f5u941fe43307676371@mail.gmail.com%3e"/>
<id>urn:uuid:%3c16d405e0912050356u405e47f5u941fe43307676371@mail-gmail-com%3e</id>
<updated>2009-12-05T11:56:50Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
does NOT affect the memory used for the map/ reduce jobs. Maybe you should
invest a bit of time reading about Hadoop first?

As for your memory problem it could be due to the parsing and not the
fetching. If you don't already do so I suggest that you separate the
fetching from the parsing. First that will tell you which part fails + if it
does fail in the parsing then you would not need to refetch the content

J.

2009/12/5 MilleBii &lt;millebii@gmail.com&gt;

&gt; My fetch cycle failed on the following initial error :
&gt;
&gt; java.io.IOException: Task process exit with nonzero status of 65.
&gt;        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
&gt;
&gt; Than it makes a second attempt and after 3 hours I bump on that error
&gt; (altough I had double HADOOP_HEAPSIZE):
&gt;
&gt; java.lang.OutOfMemoryError: GC overhead limit exceeded
&gt;
&gt;
&gt; Any idea what the initial error is or could be ?
&gt; For the second one, I'm going to reduce number of threads... but I'm
&gt; wondering if there could be a memory leak ? And I don't how to trace that.
&gt;
&gt; --
&gt; -MilleBii-
&gt;



-- 
DigitalPebble Ltd
http://www.digitalpebble.com


</pre>
</div>
</content>
</entry>
<entry>
<title>Fetch failing ?</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912050050p55bb676bs3f387db1cd65d118@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912050050p55bb676bs3f387db1cd65d118@mail-gmail-com%3e</id>
<updated>2009-12-05T08:50:34Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
My fetch cycle failed on the following initial error :

java.io.IOException: Task process exit with nonzero status of 65.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)

Than it makes a second attempt and after 3 hours I bump on that error
(altough I had double HADOOP_HEAPSIZE):

java.lang.OutOfMemoryError: GC overhead limit exceeded


Any idea what the initial error is or could be ?
For the second one, I'm going to reduce number of threads... but I'm
wondering if there could be a memory leak ? And I don't how to trace that.

-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: How to drop page content at fetch stages ?</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912050042v71b23846l23a63c4214910e60@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912050042v71b23846l23a63c4214910e60@mail-gmail-com%3e</id>
<updated>2009-12-05T08:42:11Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Thx, a bit too complex for me right now. I don't -yet- fully understand this
map/reduce technique.
But I'll keep the idea for a future development.

2009/12/4 Dennis Kubes &lt;kubes@apache.org&gt;

&gt; Sorry, segments, not indexes.
&gt;
&gt;
&gt; Dennis Kubes wrote:
&gt;
&gt;&gt; You would need to write a custom MapReduce job to run through the indexes
&gt;&gt; and only keeps the ones identified by your plugin.  Be sure to update the
&gt;&gt; CrawlDb with the extracted urls before you drop the content from the
&gt;&gt; segments.
&gt;&gt;
&gt;&gt; Dennis
&gt;&gt;
&gt;&gt; MilleBii wrote:
&gt;&gt;
&gt;&gt;&gt; Hi guys,
&gt;&gt;&gt;
&gt;&gt;&gt; I'm looking if I can optimize the size occupied on disk by my segments.
&gt;&gt;&gt; I have implemented a topical-scoring plugin... this means I know at that
&gt;&gt;&gt; steps if I should keep that page content or not.
&gt;&gt;&gt; Is there a way to drop some pages content after parsing it, but of course
&gt;&gt;&gt; keep the links because I want to follow the graph ?
&gt;&gt;&gt;
&gt;&gt;&gt; PS: Prune is no option to me because it only cleans up the indexes, not
&gt;&gt;&gt; the
&gt;&gt;&gt; segments and my indexer does that clean-up very well.
&gt;&gt;&gt;
&gt;&gt;&gt;


-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch - create my own repository</title>
<author><name>Eran Zinman &lt;zzeran@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c42b8f64b0912050041ke8d81e0ndc295c1c2a259718@mail.gmail.com%3e"/>
<id>urn:uuid:%3c42b8f64b0912050041ke8d81e0ndc295c1c2a259718@mail-gmail-com%3e</id>
<updated>2009-12-05T08:41:02Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,

I'm developing my own set of tools, plugins and some minor code changes to
Nutch.

I still want to get updates from the main Nutch repository, but I would
like to keep my own SVN for tracking my local code changes.

I'm using normal shell SVN (I have no expirence with GIT) to track my
changes.

My question is - can I create a branch from the main repository to my own
repository, which will only track my changes and keep getting updates from
Nutch main repository with easy merge?

Thanks,
Eran


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: unsubscribe from nutch-user</title>
<author><name>M S Ram &lt;msram@cse.iitk.ac.in&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c4B1A12FB.9040505@cse.iitk.ac.in%3e"/>
<id>urn:uuid:%3c4B1A12FB-9040505@cse-iitk-ac-in%3e</id>
<updated>2009-12-05T07:59:55Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I did it many times. But I am still receiving these mails.

prashant ullegaddi wrote:
&gt; Take a look at it:
&gt;
&gt; http://lucene.apache.org/nutch/mailing_lists.html
&gt;
&gt; or
&gt; probably sending a blank mail to:
&gt; nutch-user-unsubscribe@lucene.apache.orgshould also
&gt; work.
&gt;
&gt; Thanks,
&gt; Prashant.
&gt;
&gt; On Fri, Dec 4, 2009 at 8:30 PM, M S Ram &lt;msram@cse.iitk.ac.in&gt; wrote:
&gt;
&gt;   
&gt;&gt; Same here. Please remove my ID also from the mailing list.
&gt;&gt;
&gt;&gt; Thanks,
&gt;&gt; MSR
&gt;&gt;
&gt;&gt; rengan xu wrote:
&gt;&gt;
&gt;&gt;     
&gt;&gt;&gt; To whom it may concern,
&gt;&gt;&gt;
&gt;&gt;&gt; Hello! Because I will use this E-mail for special purpose. I will use
&gt;&gt;&gt; another E-mail to subscribe in nutch-user. So I want to unsubscribe from
&gt;&gt;&gt; nutch-user.
&gt;&gt;&gt;
&gt;&gt;&gt; Thank you!
&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt;
&gt;&gt;&gt;       
&gt;&gt;
&gt;&gt;     
&gt;
&gt;
&gt;   




</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch image extraction</title>
<author><name>manishkbawne &lt;manish.bawne@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c26653542.post@talk.nabble.com%3e"/>
<id>urn:uuid:%3c26653542-post@talk-nabble-com%3e</id>
<updated>2009-12-05T07:36:18Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>

Hi, 
I am using nutch to crawl the data from the web. Now I want to extract the
images using nutch. Can somebody please suggest me some way how to do that
or sugeest me some url?

Regards,
Manish Bawne
Software Engineer
Biz Integra Systems Pvt Ltd
http://www.bizhandel.com

-- 
View this message in context: http://old.nabble.com/Nutch-image-extraction-tp26653542p26653542.html
Sent from the Nutch - User mailing list archive at Nabble.com.



</pre>
</div>
</content>
</entry>
<entry>
<title>Re: How to drop page content at fetch stages ?</title>
<author><name>Dennis Kubes &lt;kubes@apache.org&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c4B199376.2060705@apache.org%3e"/>
<id>urn:uuid:%3c4B199376-2060705@apache-org%3e</id>
<updated>2009-12-04T22:55:50Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Sorry, segments, not indexes.

Dennis Kubes wrote:
&gt; You would need to write a custom MapReduce job to run through the 
&gt; indexes and only keeps the ones identified by your plugin.  Be sure to 
&gt; update the CrawlDb with the extracted urls before you drop the content 
&gt; from the segments.
&gt; 
&gt; Dennis
&gt; 
&gt; MilleBii wrote:
&gt;&gt; Hi guys,
&gt;&gt;
&gt;&gt; I'm looking if I can optimize the size occupied on disk by my segments.
&gt;&gt; I have implemented a topical-scoring plugin... this means I know at that
&gt;&gt; steps if I should keep that page content or not.
&gt;&gt; Is there a way to drop some pages content after parsing it, but of course
&gt;&gt; keep the links because I want to follow the graph ?
&gt;&gt;
&gt;&gt; PS: Prune is no option to me because it only cleans up the indexes, 
&gt;&gt; not the
&gt;&gt; segments and my indexer does that clean-up very well.
&gt;&gt;


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: How to drop page content at fetch stages ?</title>
<author><name>Dennis Kubes &lt;kubes@apache.org&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c4B199183.2090404@apache.org%3e"/>
<id>urn:uuid:%3c4B199183-2090404@apache-org%3e</id>
<updated>2009-12-04T22:47:31Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
You would need to write a custom MapReduce job to run through the 
indexes and only keeps the ones identified by your plugin.  Be sure to 
update the CrawlDb with the extracted urls before you drop the content 
from the segments.

Dennis

MilleBii wrote:
&gt; Hi guys,
&gt; 
&gt; I'm looking if I can optimize the size occupied on disk by my segments.
&gt; I have implemented a topical-scoring plugin... this means I know at that
&gt; steps if I should keep that page content or not.
&gt; Is there a way to drop some pages content after parsing it, but of course
&gt; keep the links because I want to follow the graph ?
&gt; 
&gt; PS: Prune is no option to me because it only cleans up the indexes, not the
&gt; segments and my indexer does that clean-up very well.
&gt; 


</pre>
</div>
</content>
</entry>
<entry>
<title>How to drop page content at fetch stages ?</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912041418i620d5f89n2c74a27f63d0d91f@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912041418i620d5f89n2c74a27f63d0d91f@mail-gmail-com%3e</id>
<updated>2009-12-04T22:18:23Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi guys,

I'm looking if I can optimize the size occupied on disk by my segments.
I have implemented a topical-scoring plugin... this means I know at that
steps if I should keep that page content or not.
Is there a way to drop some pages content after parsing it, but of course
keep the links because I want to follow the graph ?

PS: Prune is no option to me because it only cleans up the indexes, not the
segments and my indexer does that clean-up very well.

-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>Re: What is the best choice: nutch/lucene or nutch/solr?</title>
<author><name>Otis Gospodnetic &lt;ogjunk-nutch@yahoo.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c201852.33120.qm@web50304.mail.re2.yahoo.com%3e"/>
<id>urn:uuid:%3c201852-33120-qm@web50304-mail-re2-yahoo-com%3e</id>
<updated>2009-12-04T20:20:26Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Sounds like Nutch for crawling to gather the data, custom tools to read the gathered data,
call the KV store, construct SolrInputDocuments, and index those to Solr.  If you want Solr
and not Lucene, which is a bigger question that I can't answer without knowing the details.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
&gt; From: Mr Hadoop &lt;mrhadoop@gmail.com&gt;
&gt; To: nutch-user@lucene.apache.org
&gt; Sent: Fri, December 4, 2009 2:51:47 PM
&gt; Subject: What is the best choice: nutch/lucene or nutch/solr?
&gt; 
&gt; I am going over mailing list and still didn't find an answer.
&gt; 
&gt; For a project, I need to crawl the web, index it and merge that content with
&gt; another site's content which is stored inside the key-value storage system.
&gt; 
&gt; What is the best approach to merge these two contents in to a lucene index,
&gt; solr index or keep the index separate but merge during the search query
&gt; results?



</pre>
</div>
</content>
</entry>
<entry>
<title>What is the best choice: nutch/lucene or nutch/solr?</title>
<author><name>Mr Hadoop &lt;mrhadoop@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c25e5b37b0912041151v6f6e4d76p53610cc80dd0cc64@mail.gmail.com%3e"/>
<id>urn:uuid:%3c25e5b37b0912041151v6f6e4d76p53610cc80dd0cc64@mail-gmail-com%3e</id>
<updated>2009-12-04T19:51:47Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I am going over mailing list and still didn't find an answer.

For a project, I need to crawl the web, index it and merge that content with
another site's content which is stored inside the key-value storage system.

What is the best approach to merge these two contents in to a lucene index,
solr index or keep the index separate but merge during the search query
results?


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: Problems with a new Installation of Nutch</title>
<author><name>&quot;Tom Landvoigt&quot; &lt;tom.landvoigt@linklift.de&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c5695272CB1DAFC4689A5F4BFC31A60EC6D4431@adenauer.synserver.ads%3e"/>
<id>urn:uuid:%3c5695272CB1DAFC4689A5F4BFC31A60EC6D4431@adenauer-synserver-ads%3e</id>
<updated>2009-12-04T19:15:57Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,

Does anyone know what packages I have to install in Suse to get Nutch running?

I have another installation with nutch where everything is fine. So I copied the hole installation.
It's also an Suse linux but it is in 64 bit and I donâ€™t installed it. 

But the same problem. 
At the moment I installed the following packages:
Tomcat 6
Openjdk devel 1.6
Sun java devel 1.6
Ant 1.7

Now it is enough for today.

Hope someone can help.

Tom

-----Original Message-----
From: MilleBii [mailto:millebii@gmail.com] 
Sent: Freitag, 4. Dezember 2009 17:31
To: nutch-user@lucene.apache.org
Subject: Re: Problems with a new Installation of Nutch

I don't know that hadoop uses tomcat... But I think it uses Jetty
instead. The nodes communicate via http: so you need some kind of web
server... And for monitorin its the best way

2009/12/4, Tom Landvoigt &lt;tom.landvoigt@linklift.de&gt;:
&gt; Hi,
&gt;
&gt; I don't have tomcat on this system because I don't want to use the
&gt; websearch. But if it is necessary for hadoop what I donâ€™t think I will
&gt; install it.
&gt;
&gt; nutch@ip-10-224-113-210:/nutch/search&gt; ./bin/hadoop fs -ls /
&gt; Found 1 items
&gt; -rw-r--r--   2 nutch supergroup          0 2009-12-04 14:04 /url.txt
&gt; nutch@ip-10-224-113-210:/nutch/search&gt;
&gt;
&gt; I get the normal answer but the file is empty.
&gt;
&gt; -----Original Message-----
&gt; From: MilleBii [mailto:millebii@gmail.com]
&gt; Sent: Freitag, 4. Dezember 2009 15:06
&gt; To: nutch-user@lucene.apache.org
&gt; Subject: Re: Problems with a new Installation of Nutch
&gt;
&gt; Did you check with the web interface ? It gives a lot of info you can
&gt; even browse the file system.
&gt;
&gt; Try hadoop fs -ls to see what it gives you ?
&gt;
&gt; 2009/12/4, Tom Landvoigt &lt;tom.landvoigt@linklift.de&gt;:
&gt;&gt; Hallo,
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I hope someone can help me.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I installed nutch on 2 Amazon EC2 computers. Everything is fine but I
&gt;&gt; can't put data in the hdfs.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I formatted the namenode and start the hdfs with start all.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;  All  java processes start properly, but when I want to make hadoop fs
&gt;&gt; -put something / I get these logs:
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; nutch@bla:/nutch/search&gt; ./bin/hadoop fs -put
&gt;&gt; /tmp/hadoop-nutch-tasktracker.pid blub
&gt;&gt;
&gt;&gt; put: Protocol not available
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; DATA NODE LOG on the master:
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:15,566 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:15,582 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,483 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,614 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,882 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@1284fd4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,883 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:17,827 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@39c8c1
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:17,849 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:18,485 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50075
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:18,485 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@36527f
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,745 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,746 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; NAME NODE LOG
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:11,539 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:11,573 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,488 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@19fe451
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,565 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,891 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@1570945
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,891 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,569 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@11410e5
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,582 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,613 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50070
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,613 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@173ec72
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; SECONDARY NAMENODE LOG
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:19,163 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:19,207 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:20,365 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@174d93a
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:20,454 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:21,396 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@31f2a7
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:21,396 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:21,533 INFO  servlet.XMLConfiguration - No
&gt;&gt; WEB-INF/web.xml in file:/mnt/nutch/nutch-1.0/webapps/secondary. Serving
&gt;&gt; files and default/dynamic servlets only
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,206 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@383118
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,785 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50090
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@297ffb
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 WARN  namenode.SecondaryNameNode - Checkpoint
&gt;&gt; Period   :3600 secs (60 min)
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 WARN  namenode.SecondaryNameNode - Log Size
&gt;&gt; Trigger    :67108864 bytes (65536 KB)
&gt;&gt;
&gt;&gt; 2009-12-04 12:55:23,908 WARN  namenode.SecondaryNameNode - Checkpoint
&gt;&gt; done. New Image Size: 1056
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; HADOOP LOG
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,708 WARN  hdfs.DFSClient - DataStreamer Exception:
&gt;&gt; java.io.IOException: Unable to create new block.
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
&gt;&gt; FSClient.java:2722)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
&gt;&gt; ava:1996)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
&gt;&gt; ent.java:2183)
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,709 WARN  hdfs.DFSClient - Error Recovery for block
&gt;&gt; blk_5506837520665828594_1002 bad datanode[0] nodes == null
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,709 WARN  hdfs.DFSClient - Could not get block
&gt;&gt; locations. Source file "/user/nutch/blub/hadoop-nutch-tasktracker.pid" -
&gt;&gt; Aborting...
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; DATA NODE LOG on the slave
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:49,433 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:49,438 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,288 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,357 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,555 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@2016b0
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,555 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,816 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@118278a
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,820 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,849 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50075
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,849 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@b02928
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; HADOOP SITE XML
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;fs.default.name&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;hdfs://(yes here is the right ip):9000&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     The name of the default file system. Either the literal string
&gt;&gt;
&gt;&gt;     "local" or a host:port for NDFS.
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wo der JobTracker (koordiniert die (MapReduce-)Auftraege)
&gt;&gt; zu finden ist. --&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.job.tracker&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;hdfs://(here to):9001&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     The host and port that the MapReduce job tracker runs at. If
&gt;&gt;
&gt;&gt;     "local", then jobs are run in-process as a single map and
&gt;&gt;
&gt;&gt;     reduce task.
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wieviele MapJobs gleichzeitig laufen duerfen--&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.tasktracker.map.tasks.maximum&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     define mapred.map tasks to be number of slave hosts
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wieviele ReduceJobs gleichzeitig laufen duerfen--&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.tasktracker.reduce.tasks.maximum&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     define mapred.reduce tasks to be number of slave hosts
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.child.java.opts&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;-Xmx1500m&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.jobtracker.restart.recover&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;true&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Die naechsten Einstellungen geben an wo das HadoopFS seine Datein
&gt;&gt; auf der Festplatte jeder Instanz speichert. --&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;dfs.name.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/name&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;dfs.data.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/data&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.system.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/mapreduce/system&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.local.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/mapreduce/local&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wieviele Replikate einer Datei im Dateisystem vorhanden
&gt;&gt; sein muessen damit sie erreichbar ist. Am Anfang 1 --&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;dfs.replication&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I hope someone can help me.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; Thanks
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; Tom
&gt;&gt;
&gt;&gt;
&gt;
&gt;
&gt; --
&gt; -MilleBii-
&gt;


-- 
-MilleBii-

</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Problems with a new Installation of Nutch</title>
<author><name>MilleBii &lt;millebii@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c376d9f4a0912040830y77b8bdeflea2a000fc4936106@mail.gmail.com%3e"/>
<id>urn:uuid:%3c376d9f4a0912040830y77b8bdeflea2a000fc4936106@mail-gmail-com%3e</id>
<updated>2009-12-04T16:30:58Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I don't know that hadoop uses tomcat... But I think it uses Jetty
instead. The nodes communicate via http: so you need some kind of web
server... And for monitorin its the best way

2009/12/4, Tom Landvoigt &lt;tom.landvoigt@linklift.de&gt;:
&gt; Hi,
&gt;
&gt; I don't have tomcat on this system because I don't want to use the
&gt; websearch. But if it is necessary for hadoop what I donâ€™t think I will
&gt; install it.
&gt;
&gt; nutch@ip-10-224-113-210:/nutch/search&gt; ./bin/hadoop fs -ls /
&gt; Found 1 items
&gt; -rw-r--r--   2 nutch supergroup          0 2009-12-04 14:04 /url.txt
&gt; nutch@ip-10-224-113-210:/nutch/search&gt;
&gt;
&gt; I get the normal answer but the file is empty.
&gt;
&gt; -----Original Message-----
&gt; From: MilleBii [mailto:millebii@gmail.com]
&gt; Sent: Freitag, 4. Dezember 2009 15:06
&gt; To: nutch-user@lucene.apache.org
&gt; Subject: Re: Problems with a new Installation of Nutch
&gt;
&gt; Did you check with the web interface ? It gives a lot of info you can
&gt; even browse the file system.
&gt;
&gt; Try hadoop fs -ls to see what it gives you ?
&gt;
&gt; 2009/12/4, Tom Landvoigt &lt;tom.landvoigt@linklift.de&gt;:
&gt;&gt; Hallo,
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I hope someone can help me.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I installed nutch on 2 Amazon EC2 computers. Everything is fine but I
&gt;&gt; can't put data in the hdfs.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I formatted the namenode and start the hdfs with start all.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;  All  java processes start properly, but when I want to make hadoop fs
&gt;&gt; -put something / I get these logs:
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; nutch@bla:/nutch/search&gt; ./bin/hadoop fs -put
&gt;&gt; /tmp/hadoop-nutch-tasktracker.pid blub
&gt;&gt;
&gt;&gt; put: Protocol not available
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; DATA NODE LOG on the master:
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:15,566 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:15,582 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,483 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,614 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,882 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@1284fd4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:16,883 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:17,827 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@39c8c1
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:17,849 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:18,485 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50075
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:18,485 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@36527f
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,745 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,746 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&gt;&gt; DatanodeRegistration(10.224.113.210:50010,
&gt;&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt;&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;&gt;
&gt;&gt; java.io.EOFException
&gt;&gt;
&gt;&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt;&gt; 79)
&gt;&gt;
&gt;&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; NAME NODE LOG
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:11,539 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:11,573 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,488 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@19fe451
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,565 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,891 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@1570945
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:12,891 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,569 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@11410e5
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,582 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,613 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50070
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:13,613 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@173ec72
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; SECONDARY NAMENODE LOG
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:19,163 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:19,207 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:20,365 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@174d93a
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:20,454 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:21,396 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@31f2a7
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:21,396 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:21,533 INFO  servlet.XMLConfiguration - No
&gt;&gt; WEB-INF/web.xml in file:/mnt/nutch/nutch-1.0/webapps/secondary. Serving
&gt;&gt; files and default/dynamic servlets only
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,206 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@383118
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,785 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50090
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@297ffb
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 WARN  namenode.SecondaryNameNode - Checkpoint
&gt;&gt; Period   :3600 secs (60 min)
&gt;&gt;
&gt;&gt; 2009-12-04 12:50:22,787 WARN  namenode.SecondaryNameNode - Log Size
&gt;&gt; Trigger    :67108864 bytes (65536 KB)
&gt;&gt;
&gt;&gt; 2009-12-04 12:55:23,908 WARN  namenode.SecondaryNameNode - Checkpoint
&gt;&gt; done. New Image Size: 1056
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; HADOOP LOG
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,708 WARN  hdfs.DFSClient - DataStreamer Exception:
&gt;&gt; java.io.IOException: Unable to create new block.
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
&gt;&gt; FSClient.java:2722)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
&gt;&gt; ava:1996)
&gt;&gt;
&gt;&gt;     at
&gt;&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
&gt;&gt; ent.java:2183)
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,709 WARN  hdfs.DFSClient - Error Recovery for block
&gt;&gt; blk_5506837520665828594_1002 bad datanode[0] nodes == null
&gt;&gt;
&gt;&gt; 2009-12-04 12:54:20,709 WARN  hdfs.DFSClient - Could not get block
&gt;&gt; locations. Source file "/user/nutch/blub/hadoop-nutch-tasktracker.pid" -
&gt;&gt; Aborting...
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; DATA NODE LOG on the slave
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:49,433 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:49,438 INFO  util.Credential - Checking Resource
&gt;&gt; aliases
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,288 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,357 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/static,/static]
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,555 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@2016b0
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,555 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/logs,/logs]
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,816 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.servlet.WebApplicationHandler@118278a
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,820 INFO  util.Container - Started
&gt;&gt; WebApplicationContext[/,/]
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,849 INFO  http.SocketListener - Started
&gt;&gt; SocketListener on 0.0.0.0:50075
&gt;&gt;
&gt;&gt; 2009-12-04 12:49:50,849 INFO  util.Container - Started
&gt;&gt; org.mortbay.jetty.Server@b02928
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; HADOOP SITE XML
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;fs.default.name&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;hdfs://(yes here is the right ip):9000&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     The name of the default file system. Either the literal string
&gt;&gt;
&gt;&gt;     "local" or a host:port for NDFS.
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wo der JobTracker (koordiniert die (MapReduce-)Auftraege)
&gt;&gt; zu finden ist. --&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.job.tracker&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;hdfs://(here to):9001&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     The host and port that the MapReduce job tracker runs at. If
&gt;&gt;
&gt;&gt;     "local", then jobs are run in-process as a single map and
&gt;&gt;
&gt;&gt;     reduce task.
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wieviele MapJobs gleichzeitig laufen duerfen--&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.tasktracker.map.tasks.maximum&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     define mapred.map tasks to be number of slave hosts
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wieviele ReduceJobs gleichzeitig laufen duerfen--&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.tasktracker.reduce.tasks.maximum&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;&gt;
&gt;&gt;   &lt;description&gt;
&gt;&gt;
&gt;&gt;     define mapred.reduce tasks to be number of slave hosts
&gt;&gt;
&gt;&gt;   &lt;/description&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.child.java.opts&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;-Xmx1500m&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.jobtracker.restart.recover&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;true&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Die naechsten Einstellungen geben an wo das HadoopFS seine Datein
&gt;&gt; auf der Festplatte jeder Instanz speichert. --&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;dfs.name.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/name&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;dfs.data.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/data&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.system.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/mapreduce/system&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;mapred.local.dir&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;/nutch/filesystem/mapreduce/local&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; &lt;!-- Gibt an wieviele Replikate einer Datei im Dateisystem vorhanden
&gt;&gt; sein muessen damit sie erreichbar ist. Am Anfang 1 --&gt;
&gt;&gt;
&gt;&gt; &lt;property&gt;
&gt;&gt;
&gt;&gt;   &lt;name&gt;dfs.replication&lt;/name&gt;
&gt;&gt;
&gt;&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;&gt;
&gt;&gt; &lt;/property&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; I hope someone can help me.
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; Thanks
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt; Tom
&gt;&gt;
&gt;&gt;
&gt;
&gt;
&gt; --
&gt; -MilleBii-
&gt;


-- 
-MilleBii-


</pre>
</div>
</content>
</entry>
<entry>
<title>RE: How to force recrawl of everything</title>
<author><name>&quot;Peters, Vijaya&quot; &lt;Vijaya_Peters@sra.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c49D143542281E54ABB884F30D140B20913DA27BE@sraex2.sra.com%3e"/>
<id>urn:uuid:%3c49D143542281E54ABB884F30D140B20913DA27BE@sraex2-sra-com%3e</id>
<updated>2009-12-04T15:36:09Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>

Running:
bin/nutch readdb crawldb -url &lt;url&gt; I got the following exception.
Also, how do I force a recrawl in Nutch 1.0?


Exception in thread "main" java.lang.ArithmeticException: / by zero
        at
org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartiti
oner.java:32)
        at
org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFo
rmat.java:104)
        at
org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:380)
        at
org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:386)
        at
org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:511)

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.
-----Original Message-----
From: reinhard schwab [mailto:reinhard.schwab@aon.at] 
Sent: Friday, December 04, 2009 8:32 AM
To: nutch-user@lucene.apache.org
Subject: Re: How to force recrawl of everything

Peters, Vijaya schrieb:
&gt; I am using Nutch 1.0.  I want to perform a 'clean' crawl.  
&gt;
&gt;  
&gt;
&gt; I see the force option in this patch:  NUTCH-601v1.0.patch
&gt;
&lt;https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0
&gt; .patch&gt; 
&gt;
&gt; Do I have to make those code changes, or does Nutch 1.0 have another
way
&gt; to do this?
&gt;
&gt;  
&gt;
&gt; Also, everytime I do another crawl, I see the same file being fetched
&gt; over and over again. Is it appending the same url over and over to the
&gt;   
which file?
you can check the crawl date of this file with

reinhard@thord:&gt;bin/nutch readdb  &lt;crawldb&gt;   -url &lt;url&gt;


&gt; fetch list?
&gt;
&gt;  
&gt;
&gt; Thanks,
&gt;
&gt; - Vijaya
&gt;
&gt;  
&gt;
&gt;  
&gt;
&gt; Vijaya Peters
&gt; SRA International, Inc.
&gt; 4350 Fair Lakes Court North
&gt; Room 4004
&gt; Fairfax, VA  22033
&gt; Tel:  703-502-1184
&gt;
&gt; www.sra.com &lt;http://www.sra.com/&gt; 
&gt; Named to FORTUNE's "100 Best Companies to Work For" list for 10
&gt; consecutive years
&gt;
&gt; P Please consider the environment before printing this e-mail
&gt;
&gt; This electronic message transmission contains information from SRA
&gt; International, Inc. which may be confidential, privileged or
&gt; proprietary.  The information is intended for the use of the
individual
&gt; or entity named above.  If you are not the intended recipient, be
aware
&gt; that any disclosure, copying, distribution, or use of the contents of
&gt; this information is strictly prohibited.  If you have received this
&gt; electronic information in error, please notify us immediately by
&gt; telephone at 866-584-2143.
&gt;
&gt;  
&gt;
&gt;
&gt;   



</pre>
</div>
</content>
</entry>
<entry>
<title>RE: Problems with a new Installation of Nutch</title>
<author><name>&quot;Tom Landvoigt&quot; &lt;tom.landvoigt@linklift.de&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c5695272CB1DAFC4689A5F4BFC31A60EC6D43F5@adenauer.synserver.ads%3e"/>
<id>urn:uuid:%3c5695272CB1DAFC4689A5F4BFC31A60EC6D43F5@adenauer-synserver-ads%3e</id>
<updated>2009-12-04T15:35:08Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,

I don't have tomcat on this system because I don't want to use the websearch. But if it is
necessary for hadoop what I donâ€™t think I will install it.

nutch@ip-10-224-113-210:/nutch/search&gt; ./bin/hadoop fs -ls /
Found 1 items
-rw-r--r--   2 nutch supergroup          0 2009-12-04 14:04 /url.txt
nutch@ip-10-224-113-210:/nutch/search&gt;

I get the normal answer but the file is empty.

-----Original Message-----
From: MilleBii [mailto:millebii@gmail.com] 
Sent: Freitag, 4. Dezember 2009 15:06
To: nutch-user@lucene.apache.org
Subject: Re: Problems with a new Installation of Nutch

Did you check with the web interface ? It gives a lot of info you can
even browse the file system.

Try hadoop fs -ls to see what it gives you ?

2009/12/4, Tom Landvoigt &lt;tom.landvoigt@linklift.de&gt;:
&gt; Hallo,
&gt;
&gt;
&gt;
&gt; I hope someone can help me.
&gt;
&gt;
&gt;
&gt; I installed nutch on 2 Amazon EC2 computers. Everything is fine but I
&gt; can't put data in the hdfs.
&gt;
&gt;
&gt;
&gt; I formatted the namenode and start the hdfs with start all.
&gt;
&gt;
&gt;
&gt;  All  java processes start properly, but when I want to make hadoop fs
&gt; -put something / I get these logs:
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; nutch@bla:/nutch/search&gt; ./bin/hadoop fs -put
&gt; /tmp/hadoop-nutch-tasktracker.pid blub
&gt;
&gt; put: Protocol not available
&gt;
&gt;
&gt;
&gt; DATA NODE LOG on the master:
&gt;
&gt; 2009-12-04 12:50:15,566 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;
&gt; 2009-12-04 12:50:15,582 INFO  util.Credential - Checking Resource
&gt; aliases
&gt;
&gt; 2009-12-04 12:50:16,483 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&gt;
&gt; 2009-12-04 12:50:16,614 INFO  util.Container - Started
&gt; WebApplicationContext[/static,/static]
&gt;
&gt; 2009-12-04 12:50:16,882 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@1284fd4
&gt;
&gt; 2009-12-04 12:50:16,883 INFO  util.Container - Started
&gt; WebApplicationContext[/logs,/logs]
&gt;
&gt; 2009-12-04 12:50:17,827 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@39c8c1
&gt;
&gt; 2009-12-04 12:50:17,849 INFO  util.Container - Started
&gt; WebApplicationContext[/,/]
&gt;
&gt; 2009-12-04 12:50:18,485 INFO  http.SocketListener - Started
&gt; SocketListener on 0.0.0.0:50075
&gt;
&gt; 2009-12-04 12:50:18,485 INFO  util.Container - Started
&gt; org.mortbay.jetty.Server@36527f
&gt;
&gt; 2009-12-04 12:54:20,745 ERROR datanode.DataNode -
&gt; DatanodeRegistration(10.224.113.210:50010,
&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;
&gt; java.io.EOFException
&gt;
&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;
&gt;     at
&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt; 79)
&gt;
&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;
&gt; 2009-12-04 12:54:20,746 ERROR datanode.DataNode -
&gt; DatanodeRegistration(10.224.113.210:50010,
&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;
&gt; java.io.EOFException
&gt;
&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;
&gt;     at
&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt; 79)
&gt;
&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;
&gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&gt; DatanodeRegistration(10.224.113.210:50010,
&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;
&gt; java.io.EOFException
&gt;
&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;
&gt;     at
&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt; 79)
&gt;
&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;
&gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&gt; DatanodeRegistration(10.224.113.210:50010,
&gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&gt; infoPort=50075, ipcPort=50020):DataXceiver
&gt;
&gt; java.io.EOFException
&gt;
&gt;     at java.io.DataInputStream.readShort(DataInputStream.java:315)
&gt;
&gt;     at
&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&gt; 79)
&gt;
&gt;     at java.lang.Thread.run(Thread.java:636)
&gt;
&gt;
&gt;
&gt; NAME NODE LOG
&gt;
&gt; 2009-12-04 12:50:11,539 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;
&gt; 2009-12-04 12:50:11,573 INFO  util.Credential - Checking Resource
&gt; aliases
&gt;
&gt; 2009-12-04 12:50:12,488 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@19fe451
&gt;
&gt; 2009-12-04 12:50:12,565 INFO  util.Container - Started
&gt; WebApplicationContext[/static,/static]
&gt;
&gt; 2009-12-04 12:50:12,891 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@1570945
&gt;
&gt; 2009-12-04 12:50:12,891 INFO  util.Container - Started
&gt; WebApplicationContext[/logs,/logs]
&gt;
&gt; 2009-12-04 12:50:13,569 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@11410e5
&gt;
&gt; 2009-12-04 12:50:13,582 INFO  util.Container - Started
&gt; WebApplicationContext[/,/]
&gt;
&gt; 2009-12-04 12:50:13,613 INFO  http.SocketListener - Started
&gt; SocketListener on 0.0.0.0:50070
&gt;
&gt; 2009-12-04 12:50:13,613 INFO  util.Container - Started
&gt; org.mortbay.jetty.Server@173ec72
&gt;
&gt;
&gt;
&gt; SECONDARY NAMENODE LOG
&gt;
&gt; 2009-12-04 12:50:19,163 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;
&gt; 2009-12-04 12:50:19,207 INFO  util.Credential - Checking Resource
&gt; aliases
&gt;
&gt; 2009-12-04 12:50:20,365 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@174d93a
&gt;
&gt; 2009-12-04 12:50:20,454 INFO  util.Container - Started
&gt; WebApplicationContext[/static,/static]
&gt;
&gt; 2009-12-04 12:50:21,396 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@31f2a7
&gt;
&gt; 2009-12-04 12:50:21,396 INFO  util.Container - Started
&gt; WebApplicationContext[/logs,/logs]
&gt;
&gt; 2009-12-04 12:50:21,533 INFO  servlet.XMLConfiguration - No
&gt; WEB-INF/web.xml in file:/mnt/nutch/nutch-1.0/webapps/secondary. Serving
&gt; files and default/dynamic servlets only
&gt;
&gt; 2009-12-04 12:50:22,206 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@383118
&gt;
&gt; 2009-12-04 12:50:22,785 INFO  util.Container - Started
&gt; WebApplicationContext[/,/]
&gt;
&gt; 2009-12-04 12:50:22,787 INFO  http.SocketListener - Started
&gt; SocketListener on 0.0.0.0:50090
&gt;
&gt; 2009-12-04 12:50:22,787 INFO  util.Container - Started
&gt; org.mortbay.jetty.Server@297ffb
&gt;
&gt; 2009-12-04 12:50:22,787 WARN  namenode.SecondaryNameNode - Checkpoint
&gt; Period   :3600 secs (60 min)
&gt;
&gt; 2009-12-04 12:50:22,787 WARN  namenode.SecondaryNameNode - Log Size
&gt; Trigger    :67108864 bytes (65536 KB)
&gt;
&gt; 2009-12-04 12:55:23,908 WARN  namenode.SecondaryNameNode - Checkpoint
&gt; done. New Image Size: 1056
&gt;
&gt;
&gt;
&gt; HADOOP LOG
&gt;
&gt; 2009-12-04 12:54:20,708 WARN  hdfs.DFSClient - DataStreamer Exception:
&gt; java.io.IOException: Unable to create new block.
&gt;
&gt;     at
&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
&gt; FSClient.java:2722)
&gt;
&gt;     at
&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
&gt; ava:1996)
&gt;
&gt;     at
&gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
&gt; ent.java:2183)
&gt;
&gt;
&gt;
&gt; 2009-12-04 12:54:20,709 WARN  hdfs.DFSClient - Error Recovery for block
&gt; blk_5506837520665828594_1002 bad datanode[0] nodes == null
&gt;
&gt; 2009-12-04 12:54:20,709 WARN  hdfs.DFSClient - Could not get block
&gt; locations. Source file "/user/nutch/blub/hadoop-nutch-tasktracker.pid" -
&gt; Aborting...
&gt;
&gt;
&gt;
&gt; DATA NODE LOG on the slave
&gt;
&gt; 2009-12-04 12:49:49,433 INFO  http.HttpServer - Version Jetty/5.1.4
&gt;
&gt; 2009-12-04 12:49:49,438 INFO  util.Credential - Checking Resource
&gt; aliases
&gt;
&gt; 2009-12-04 12:49:50,288 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&gt;
&gt; 2009-12-04 12:49:50,357 INFO  util.Container - Started
&gt; WebApplicationContext[/static,/static]
&gt;
&gt; 2009-12-04 12:49:50,555 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@2016b0
&gt;
&gt; 2009-12-04 12:49:50,555 INFO  util.Container - Started
&gt; WebApplicationContext[/logs,/logs]
&gt;
&gt; 2009-12-04 12:49:50,816 INFO  util.Container - Started
&gt; org.mortbay.jetty.servlet.WebApplicationHandler@118278a
&gt;
&gt; 2009-12-04 12:49:50,820 INFO  util.Container - Started
&gt; WebApplicationContext[/,/]
&gt;
&gt; 2009-12-04 12:49:50,849 INFO  http.SocketListener - Started
&gt; SocketListener on 0.0.0.0:50075
&gt;
&gt; 2009-12-04 12:49:50,849 INFO  util.Container - Started
&gt; org.mortbay.jetty.Server@b02928
&gt;
&gt;
&gt;
&gt; HADOOP SITE XML
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;fs.default.name&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;hdfs://(yes here is the right ip):9000&lt;/value&gt;
&gt;
&gt;   &lt;description&gt;
&gt;
&gt;     The name of the default file system. Either the literal string
&gt;
&gt;     "local" or a host:port for NDFS.
&gt;
&gt;   &lt;/description&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;!-- Gibt an wo der JobTracker (koordiniert die (MapReduce-)Auftraege)
&gt; zu finden ist. --&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;mapred.job.tracker&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;hdfs://(here to):9001&lt;/value&gt;
&gt;
&gt;   &lt;description&gt;
&gt;
&gt;     The host and port that the MapReduce job tracker runs at. If
&gt;
&gt;     "local", then jobs are run in-process as a single map and
&gt;
&gt;     reduce task.
&gt;
&gt;   &lt;/description&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;!-- Gibt an wieviele MapJobs gleichzeitig laufen duerfen--&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;mapred.tasktracker.map.tasks.maximum&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;
&gt;   &lt;description&gt;
&gt;
&gt;     define mapred.map tasks to be number of slave hosts
&gt;
&gt;   &lt;/description&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;!-- Gibt an wieviele ReduceJobs gleichzeitig laufen duerfen--&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;mapred.tasktracker.reduce.tasks.maximum&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;
&gt;   &lt;description&gt;
&gt;
&gt;     define mapred.reduce tasks to be number of slave hosts
&gt;
&gt;   &lt;/description&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;mapred.child.java.opts&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;-Xmx1500m&lt;/value&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;mapred.jobtracker.restart.recover&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;true&lt;/value&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;!-- Die naechsten Einstellungen geben an wo das HadoopFS seine Datein
&gt; auf der Festplatte jeder Instanz speichert. --&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;dfs.name.dir&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;/nutch/filesystem/name&lt;/value&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;dfs.data.dir&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;/nutch/filesystem/data&lt;/value&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;mapred.system.dir&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;/nutch/filesystem/mapreduce/system&lt;/value&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;mapred.local.dir&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;/nutch/filesystem/mapreduce/local&lt;/value&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; &lt;!-- Gibt an wieviele Replikate einer Datei im Dateisystem vorhanden
&gt; sein muessen damit sie erreichbar ist. Am Anfang 1 --&gt;
&gt;
&gt; &lt;property&gt;
&gt;
&gt;   &lt;name&gt;dfs.replication&lt;/name&gt;
&gt;
&gt;   &lt;value&gt;2&lt;/value&gt;
&gt;
&gt; &lt;/property&gt;
&gt;
&gt;
&gt;
&gt; I hope someone can help me.
&gt;
&gt;
&gt;
&gt; Thanks
&gt;
&gt;
&gt;
&gt; Tom
&gt;
&gt;


-- 
-MilleBii-

</pre>
</div>
</content>
</entry>
<entry>
<title>unsubscribe from nutch-user</title>
<author><name>&quot;Lukas, Ray&quot; &lt;Ray.Lukas@idearc.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3c6165226BDD41964D80E23DDFF26DD5C70D07A615@dfw2w2smail5.na1.vis.verizon.com%3e"/>
<id>urn:uuid:%3c6165226BDD41964D80E23DDFF26DD5C70D07A615@dfw2w2smail5-na1-vis-verizon-com%3e</id>
<updated>2009-12-04T15:07:02Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Well three is a charm.. I need to move these to a different email as
well.. Please if you could.. Could we also remove this email address as
well.. 
Thanks 
ray

-----Original Message-----
From: M S Ram [mailto:msram@cse.iitk.ac.in] 
Sent: Friday, December 04, 2009 10:01 AM
To: nutch-user@lucene.apache.org
Subject: Re: unsubscribe from nutch-user

Same here. Please remove my ID also from the mailing list.

Thanks,
MSR

rengan xu wrote:
&gt; To whom it may concern,
&gt;
&gt; Hello! Because I will use this E-mail for special purpose. I will use
&gt; another E-mail to subscribe in nutch-user. So I want to unsubscribe
from
&gt; nutch-user.
&gt;
&gt; Thank you!
&gt;
&gt;
&gt;   




</pre>
</div>
</content>
</entry>
<entry>
<title>Re: unsubscribe from nutch-user</title>
<author><name>prashant ullegaddi &lt;prashullegaddi@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200912.mbox/%3cac6e61fc0912040706j5f1dda5cr8541af2d04062ed1@mail.gmail.com%3e"/>
<id>urn:uuid:%3cac6e61fc0912040706j5f1dda5cr8541af2d04062ed1@mail-gmail-com%3e</id>
<updated>2009-12-04T15:06:36Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Take a look at it:

http://lucene.apache.org/nutch/mailing_lists.html

or
probably sending a blank mail to:
nutch-user-unsubscribe@lucene.apache.orgshould also
work.

Thanks,
Prashant.

On Fri, Dec 4, 2009 at 8:30 PM, M S Ram &lt;msram@cse.iitk.ac.in&gt; wrote:

&gt; Same here. Please remove my ID also from the mailing list.
&gt;
&gt; Thanks,
&gt; MSR
&gt;
&gt; rengan xu wrote:
&gt;
&gt;&gt; To whom it may concern,
&gt;&gt;
&gt;&gt; Hello! Because I will use this E-mail for special purpose. I will use
&gt;&gt; another E-mail to subscribe in nutch-user. So I want to unsubscribe from
&gt;&gt; nutch-user.
&gt;&gt;
&gt;&gt; Thank you!
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;
&gt;
&gt;


-- 
Thanks,
Prashant Ullegaddi,
Search and Information Extraction Lab,
IIIT-Hyderabad, INDIA.


</pre>
</div>
</content>
</entry>
</feed>
