hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Curtin <curtin.ch...@gmail.com>
Subject NameNode question about lots of small files
Date Thu, 16 Dec 2010 19:25:34 GMT

During our research into the 'small files' issues we are having I didn't
find anything to explain what I see 'after' a change.

Before: all files were stored in a structure like /source/year/month/day/
where we had dozens of files in each day's direcotory (and 500+ sources). We
were using a lot more memory than we expected in the NameNode so we
redesigned the directory structure. Here is the 'before' summary:

*1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is
1.94 GB / 1.94 GB (100%)*


The Heap Size relative to the # of files was higher than we expected (Using
150 byte/file rule of thumb from Cloudera)  so we redesigned our approach.

After: simplified into /source/year_month/ and while there are a lot of
files in the directory, the memory usage dropped significantly:

* *

*1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is
1.18 GB / 1.74 GB (67%)*


This was a suprise, since we hadn't done the file compaction step and
already saw a huge drop in memory usage.

What I don't understand is why the change in memory usage? The old structure
is still there (/source/year/month/day) but with no files in the tips. The
reorg process only moved the files to the new structure, a separate step is
going to remove the empty directories. The 'before' was after the cluster
was at idle for 4+ hours so I don't think it was GC timing issue.

I'm looking to understand what happened so I can make sure our capacity
calculations based on # of files and # of directories is correct. We're
using: 0.20.2, r911707



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message