hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2092) Create a light inner conf class in DFSClient
Date Fri, 24 Jun 2011 07:30:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054285#comment-13054285

Aaron T. Myers commented on HDFS-2092:

bq. We are not concerned about the task attempt. The problem here is for Task Tracker's availability.

Have you actually experienced TTs crashing because conf objects were too large? Or where conf
objects were taking up a substantial portion of the available heap space?

bq. The way conf was designed has its own benefits. At the same time it comes with some disadvantages.
What if a task attempt can run for a day or more? This is not uncommon in, our clusters.

I would conjecture that such a task attempt is likely using many MBs or GBs of memory for
the actual work it's doing. Is this patch which saves a few hundred KBs at the extreme end
really going to move the needle?

bq. 1. With UGI, conf will be created per user in TT. (Security folks?)

But presumably only for every user which is concurrently running a task attempt on that TT,
so not that many, right? Unless I'm missing something, which is certainly possible.

bq. 2. PIG or any other job can store arbitrary data. Hadoop framework should be able to deal
with it as far as it can. 

No disagreement there.

bq. 3. Last but not least, API should not hold on to client's data.

I see no principled reason the DFSClient "should not hold on to client's data" in the form
of the conf object. If this is actually negatively impacting performance or availability,
then we should certainly fix that, but you haven't demonstrated that yet.

bq. As every job is different so can workloads can be different. So one can't see or hear
all the problems.

Certainly, but we can validate this issue with some testing. Can you please describe what
you did to gather these measurements? What exactly are they actually measuring?

My issue here is that this change is being done purely as an optimization, but it's unclear
to me that negative issues exist without this patch, or that this patch necessarily addresses
those issues. If you can demonstrate those, I'll shut up immediately. :)

> Create a light inner conf class in DFSClient
> --------------------------------------------
>                 Key: HDFS-2092
>                 URL: https://issues.apache.org/jira/browse/HDFS-2092
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>    Affects Versions: 0.23.0
>            Reporter: Bharath Mundlapudi
>            Assignee: Bharath Mundlapudi
>             Fix For: 0.23.0
>         Attachments: HDFS-2092-1.patch, HDFS-2092-2.patch
> At present, DFSClient stores reference to configuration object. Since, these configuration
objects are pretty big at times can blot the processes which has multiple DFSClient objects
like in TaskTracker. This is an attempt to remove the reference of conf object in DFSClient.

> This patch creates a light inner conf class and copies the required keys from the Configuration

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message