hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2866) JobConf should validate key names in well-defined namespaces and warn on misspelling
Date Thu, 21 Feb 2008 07:00:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570961#action_12570961
] 

Aaron Kimball commented on HADOOP-2866:
---------------------------------------

I surveyed the hadoop TRUNK source by using Eclipse's Java Search; I looked for all instances
of org.apache.hadoop.mapred.JobConf.set* and org.apache.hadoop.mapred.JobConf.get* methods.


I believe the following to be a complete list of key names explicitly set within the hadoop
Java source :
Any entries with a "@@@" postscript are keys that are set only in @deprecated methods.

aggregator.descriptor. {i\*} <--- numeric end stem...
aggregator.descriptor.num
dfs.datanode.rack
dfs.datanode.startup
dfs.http.address
dfs.namenode.startup
dfs.secondary.http.bindAddress
dfs.umask
fs.default.name
fs.hdfs.impl
hadoop.native.lib
hadoop.pipes.command-file.keep
hadoop.pipes.executable
hadoop.pipes.java.mapper
hadoop.pipes.java.recordreader
hadoop.pipes.java.recordwriter
hadoop.pipes.java.reducer
hadoop.pipes.partitioner
io.compression.codecs
io.seqfile.compression.type
ipc.client.connection.maxidletime
ipc.client.timeout
job.end.notification.url
jobclient.output.filter
keep.failed.task.files
keep.task.files.pattern
map.input.file
map.input.length
map.input.start
mapred.cache.archives
mapred.cache.archives.timestamps
mapred.cache.files
mapred.cache.files.timestamps
mapred.cache.localArchives
mapred.cache.localFiles
mapred.combiner.class
mapred.compress.map.output
mapred.create.symlink
mapred.input.dir
mapred.input.format.class
mapred.input.key.class @@@
mapred.input.value.class @@@
mapred.jar
mapred.job.classpath.archives
mapred.job.classpath.files
mapred.job.history.http.bindAddress
mapred.job.id
mapred.job.name
mapred.job.priority
mapred.job.split.file
mapred.job.tracker
mapred.job.tracker.http.bindAddress
mapred.local.dir
mapred.map.max.attempts
mapred.map.output.compression.codec
mapred.map.output.compression.type
mapred.map.runner.class
mapred.map.task.debug.script
mapred.map.tasks
mapred.map.tasks.speculative.execution
mapred.mapoutput.key.class
mapred.mapoutput.value.class
mapred.mapper.class
mapred.mapper.regex
mapred.max.map.failures.percent
mapred.max.reduce.failures.percent
mapred.max.tracker.failures
mapred.min.split.size
mapred.output.compress
mapred.output.compression.codec
mapred.output.compression.type
mapred.output.dir
mapred.output.format.class
mapred.output.key.class
mapred.output.key.comparator.class
mapred.output.value.class
mapred.output.value.groupfn.class
mapred.partitioner.class
mapred.reduce.max.attempts
mapred.reduce.task.debug.script
mapred.reduce.task.debug.script
mapred.reduce.tasks
mapred.reduce.tasks.speculative.execution
mapred.reducer.class
mapred.reducer.separator
mapred.reducer.sort
mapred.speculative.execution @@@
mapred.task.id
mapred.task.is.map
mapred.task.partition
mapred.task.profile
mapred.task.profile.maps
mapred.task.profile.reduces
mapred.task.tracker.report.bindAddress
mapred.tip.id
mapred.working.dir
sequencefile.filter.class
sequencefile.filter.frequency
sequencefile.filter.regex
session.id
user.name


And the following keys are explicitly retrieved by a get*() method somewhere in the Java source:

aggregate.max.num.unique.values
aggregator.descriptor.{i\*} <---- ints
aggregator.descriptor.num
create.empty.dir.if.nonexist
dfs.balance.bandwidthPerSec
dfs.block.size
dfs.blockreport.initialDelay
dfs.blockreport.intervalMsec
dfs.client.block.write.retries
dfs.data.dir
dfs.datanode.address
dfs.datanode.bindAddress
dfs.datanode.block.write.timeout.sec
dfs.datanode.dns.interface
dfs.datanode.dns.nameserver
dfs.datanode.du.pct
dfs.datanode.du.reserved
dfs.datanode.http.address
dfs.datanode.info.bindAddress
dfs.datanode.info.port
dfs.datanode.numblocks
dfs.datanode.port
dfs.datanode.rack
dfs.datanode.scan.period.hours
dfs.datanode.simulateddatastorage
dfs.datanode.startup
dfs.default.chunk.view.size
dfs.df.interval
dfs.heartbeat.interval
dfs.hosts
dfs.hosts.exclude
dfs.http.address
dfs.info.bindAddress
dfs.info.port
dfs.max-repl-streams
dfs.max.objects
dfs.name.dir
dfs.namenode.decommission.interval
dfs.namenode.handler.count
dfs.namenode.startup
dfs.network.script
dfs.permissions.
dfs.permissions.supergroup
dfs.read.prefetch.size
dfs.replication
dfs.replication.considerLoad
dfs.replication.interval
dfs.replication.max
dfs.replication.min
dfs.replication.pending.timeout.sec
dfs.safemode.extension
dfs.safemode.threshold.pct
dfs.secondary.http.address
dfs.secondary.info.bindAddress
dfs.secondary.info.port
dfs.socket.timeout
dfs.umask
dfs.upgrade.permission
dfs.web.ugi
fs.*.impl <-------wildcard allowed for URI-based scheme slot
fs.checkpoint.dir
fs.checkpoint.period
fs.checkpoint.size
fs.default.name
fs.inmemory.size.mb
fs.kfs.metaServerHost
fs.kfs.metaServerPort
fs.local.block.size
fs.s3.awsAccessKeyId
fs.s3.awsSecretAccessKey
fs.s3.buffer.dir
fs.s3.maxRetries
fs.s3.sleepTimeSeconds
fs.trash.interval
hadoop.job.history.location
hadoop.job.history.user.location
hadoop.native.lib
hadoop.pipes.command-file.keep
hadoop.pipes.executable
hadoop.pipes.java.mapper
hadoop.pipes.java.recordreader
hadoop.pipes.java.recordwriter
hadoop.pipes.java.reducer
hadoop.pipes.partitioner
hadoop.rpc.socket.factory.class.* <-- wildcard
hadoop.rpc.socket.factory.class.default
hadoop.socks.server
heartbeat.recheck.interval
io.bytes.per.checksum
io.compression.codec.lzo.buffersize
io.compression.codec.lzo.compressor
io.compression.codec.lzo.decompressor
io.compression.codecs
io.file.buffer.size
io.map.index.skip
io.seqfile.compress.blocksize
io.seqfile.compression.type
io.skip.checksum.errors
io.sort.factor
io.sort.mb
ipc.client.connect.max.retries
ipc.client.connection.maxidletime
ipc.client.idlethreshold
ipc.client.kill.max
ipc.client.maxidletime
ipc.client.tcpnodelay
ipc.client.timeout
ipc.server.listen.queue.size
java.library.path
job.end.notification.url
job.end.retry.attempts
job.end.retry.interval
jobclient.output.filter
keep.failed.task.files
keep.task.files.pattern
key.value.separator.in.input.line
local.cache.size
map.input.file
map.output.key.field.separator
map.output.key.value.fields.spec
map.sort.class
mapred.cache.archives
mapred.cache.archives.timestamps
mapred.cache.files
mapred.cache.files.timestamps
mapred.cache.localArchives
mapred.cache.localFiles
mapred.child.java.opts
mapred.child.tmp
mapred.combiner.class
mapred.compress.map.output
mapred.create.symlink
mapred.data.field.separator
mapred.debug.out.lines
mapred.hosts
mapred.hosts.exclude
mapred.inmem.merge.threshold
mapred.input.dir
mapred.input.format.class
mapred.input.key.class @@@
mapred.input.value.class @@@
mapred.jar
mapred.job.classpath.archives
mapred.job.classpath.files
mapred.job.history.http.bindAddress
mapred.job.id
mapred.job.name
mapred.job.priority
mapred.job.split.file
mapred.job.tracker
mapred.job.tracker.handler.count
mapred.job.tracker.http.address
mapred.job.tracker.info.bindAddress
mapred.job.tracker.info.port
mapred.job.tracker.persist.jobstatus.active
mapred.job.tracker.persist.jobstatus.dir
mapred.job.tracker.persist.jobstatus.hours
mapred.jobtracker.completeuserjobs.maximum
mapred.jobtracker.retirejob.check"
mapred.jobtracker.retirejob.interval
mapred.jobtracker.taskalloc.capacitypad
mapred.jobtracker.taskalloc.loadbalance.epsilon
mapred.join.expr
mapred.join.keycomparator
mapred.local.dir
mapred.local.dir.minspacekill
mapred.local.dir.minspacestart
mapred.map.max.attempts
mapred.map.multithreadedrunner.threads
mapred.map.output.compression.codec
mapred.map.output.compression.type
mapred.map.runner.class
mapred.map.task.debug.script
mapred.map.tasks
mapred.map.tasks.speculative.execution
mapred.mapoutput.key.class
mapred.mapoutput.value.class
mapred.mapper.class
mapred.mapper.regex
mapred.mapper.regex.group
mapred.max.map.failures.percent
mapred.max.reduce.failures.percent
mapred.max.tracker.failures
mapred.min.split.size
mapred.output.compress
mapred.output.compression.codec
mapred.output.compression.type
mapred.output.dir
mapred.output.format.class
mapred.output.key.class
mapred.output.key.comparator.class
mapred.output.value.class
mapred.output.value.groupfn.class
mapred.partitioner.class
mapred.reduce.copy.backoff
mapred.reduce.max.attempts
mapred.reduce.parallel.copies
mapred.reduce.task.debug.script
mapred.reduce.tasks
mapred.reduce.tasks.speculative.execution
mapred.reducer.class
mapred.reducer.separator
mapred.reducer.sort
mapred.speculative.execution
mapred.speculative.execution @@@
mapred.submit.replication
mapred.system.dir
mapred.task.id
mapred.task.is.map
mapred.task.partition
mapred.task.profile
mapred.task.profile.maps
mapred.task.profile.reduces
mapred.task.timeout
mapred.task.tracker.http.address
mapred.task.tracker.report.address
mapred.task.tracker.report.bindAddress
mapred.task.tracker.report.port
mapred.tasktracker.dns.interface
mapred.tasktracker.dns.nameserver
mapred.tasktracker.expiry.interval
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
mapred.tasktracker.tasks.maximum
mapred.tip.id
mapred.userlog.limit.kb
mapred.userlog.retain.hours
mapred.working.dir
num.key.fields.for.partition
reduce.output.key.value.fields.spec
sequencefile.filter.class
sequencefile.filter.frequency
sequencefile.filter.regex
session.id
tasktracker.contention.tracking
tasktracker.http.bindAddress
tasktracker.http.port
tasktracker.http.threads
user.jar.file
user.name


This does not cover HBase.

I propose an interface of the following methods:

class JobConfValidator {

public JobConfValidator();

/** @return true if all keys are typed correctly using the spec in the issue description */
bool validateConfig(JobConf conf, bool printWarnings);

/** @return true if this key is typed correctly using the spec in issue description */
bool validateKeyName(String key);

/** @return true if key name begins with mapred.*, fs.*, etc. */
bool keyIsInReservedNamespace(String key);

}


This can be mostly implemented by checking key names in the JobConf against a simple hashmap
of key names and namespace names (mapred., fs., io., etc.).
Some bonus code would be necessary to handle the two keys that have arbitrary integer suffixes,
and the key that allows wildcards in the middle (fs.\*.impl).

An open question: Should all top-level key prefixes in the list above be considered namespaces
fully-reserved by Hadoop? e.g., "session.id" and "user.name" are single-entry namespaces.
Is all of session.\* and user.\* reserved? 

Awaiting comments on these issues. If people like, I can code this up.





> JobConf should validate key names in well-defined namespaces and warn on misspelling
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2866
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2866
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Aaron Kimball
>            Priority: Minor
>             Fix For: 0.16.1, 0.17.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> A discussion on the mailing list reveals that some configuration strings in the JobConf
are deprecated over time and new configuration names replace them:
> e.g., "mapred.output.compression.type" is now replaced with "mapred.map.output.compression.type"
> Programmers who have been manually specifying the former string, however, receive no
diagnostic output during testing to suggest that their compression type is being silently
ignored.
> It would be desirable to notify developers of this change by printing a warning message
when deprecated configuration names are used in a newer version of Hadoop. More generally,
when any configuration string in the mapred.\*, fs.\*, dfs.\*, etc namespaces are provided
by a user and are not recognized by Hadoop, it is desirable to print a warning, to indicate
malformed configurations. No warnings should be printed when configuration keys are in user-defined
namespaces (e.g., "myprogram.mytask.myvalue").

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message