hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Safdar Kureishy <safdar.kurei...@gmail.com>
Subject SolrIndex eats up lots of disk space for intermediate data
Date Sat, 23 Jun 2012 14:44:34 GMT

I couldn't find an answer to this question online, so I'm posting to the
mailing list.

I've got a crawl of about 10M *fetched* pages (crawl db has about 50 M
pages, since it includes the fetched + failed + unfetched pages). I've also
got a freshly updated linkdb and webgraphdb (having run linkrank). I'm
trying to index the fetched pages (content + anchor links) using solrindex.

When I launch the "bin/nutch solrindex <solrurl> <crawldb> -linkdb <linkdb>
-dir <segmentsdir>" command, the disk space utilization really jumps.
Before running the solrindex stage, I had about 50% of disk space remaining
for HDFS on my nodes (5 nodes) -- I had consumed about 100G and had about
100G left over. However, when running the solrindex phase, by the end of
the map phase, the disk space utilization nears 100% and the available HDFS
space drops below 1%. Running "hadoop dfsadmin -report" shows that the jump
in storage is for non-DFS data (i.e. intermediate data) and it happens
during the map phase of the IndexerMapReduce job (solrindex).

What can I do to reduce the intermediate data being generated for
solrindex? Any configuration settings I should change? I'm using all the
defaults, for the indexing phase, and I'm not using any custom plugins


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message