hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: NameNode question about lots of small files
Date Thu, 16 Dec 2010 22:51:51 GMT
Hi Chris,

To have a reasonable understanding of used heap, you need to trigger a
full GC. Otherwise, the heap number on the web UI doesn't actually
tell you live heap.

With the default (non-CMS) collector, the collector will not run until
it is manually triggered or the heap becomes full.

You can use JConsole to connect and force a GC to get a good
measurement of heap used.

Keep in mind also that the total heap is more than just the inodes and
blocks. Other things like RPC buffers account for some usage as well.

-Todd

On Thu, Dec 16, 2010 at 11:25 AM, Chris Curtin <curtin.chris@gmail.com> wrote:
> Hi,
>
> During our research into the 'small files' issues we are having I didn't
> find anything to explain what I see 'after' a change.
>
> Before: all files were stored in a structure like /source/year/month/day/
> where we had dozens of files in each day's direcotory (and 500+ sources). We
> were using a lot more memory than we expected in the NameNode so we
> redesigned the directory structure. Here is the 'before' summary:
>
>
> *1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is
> 1.94 GB / 1.94 GB (100%)*
>
> **
>
> The Heap Size relative to the # of files was higher than we expected (Using
> 150 byte/file rule of thumb from Cloudera)  so we redesigned our approach.
>
>
>
> After: simplified into /source/year_month/ and while there are a lot of
> files in the directory, the memory usage dropped significantly:
>
> * *
>
> *1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is
> 1.18 GB / 1.74 GB (67%)*
>
> **
>
> This was a suprise, since we hadn't done the file compaction step and
> already saw a huge drop in memory usage.
>
>
>
> What I don't understand is why the change in memory usage? The old structure
> is still there (/source/year/month/day) but with no files in the tips. The
> reorg process only moved the files to the new structure, a separate step is
> going to remove the empty directories. The 'before' was after the cluster
> was at idle for 4+ hours so I don't think it was GC timing issue.
>
>
>
> I'm looking to understand what happened so I can make sure our capacity
> calculations based on # of files and # of directories is correct. We're
> using: 0.20.2, r911707
>
>
>
> Thanks,
>
>
>
> Chris
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message