Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <19094569.1193443550675.JavaMail.jira@brutus>
Date: Fri, 26 Oct 2007 17:05:50 -0700 (PDT)
From: "Konstantin Shvachko (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Commented: (HADOOP-1989) Add support for simulated Data
 Nodes  - helpful for testing and performance benchmarking of the Name Node
 without having a large cluster
In-Reply-To: <24543146.1191434150640.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538148 ] 

Konstantin Shvachko commented on HADOOP-1989:
---------------------------------------------

AbstractFSDataset.java
- AbstractFSDataset should be an interface rather than an abstract class, since it does not implement any methods, 
and all methods are declared abstract public. Therefore, it should be called FSDatasetInteface.
I also think it should not be dependent on FSContants.
- @author field should be removed. See HADOOP-1147.
- Javadoc: missing descriptions of public methods.
- Long lines.
- Too many line breaks between methods.

SimulatedFSDataset.java
- redundant import java.io.OutputStream;
- Needs javadoc desciption of SimulatedFSDataset class.
- needs line break between class and method declarations.
- subclasses BInfo and Storage should private.
- Storage is not a good name for the class, since it is already used. May be SimulatedStorage.
Is it possible to reuse classes like DF or DatanodeInfo here?

DataNode.java
- Javadoc description for method startDataNode().
- You do not need extra variable sendBlockReportatNextHeartbeat. 
Instead the new method sendBlockReport() should set 
      lastHeartbeat=0;
      lastBlockReport=0;
as it is done in case DatanodeProtocol.DNA_REGISTER:
I'd then call this method scheduleBlockReport(), and it should not be public.
- In readMetadata() both methods 
{code}
        checksumIn = data.getMetaDataInStream(block);
        long fileSize = data.getMetaDataLength(block);
{code}
perform access to the data-node block map, which is not efficient.
Can it be optimized?

DataChecksum.java
- Would it be more clean to have SimulatedFSDataset.getChecksumHeader(checkSum)
rather than DataChecksum.getHeader() so that to keep all simulated methods
inside the simulated classes?

simulatedStreams
- SimulatedInputStream and SimulatedOutputStream should be private subclasses of
SimulatedFSDataset, because they are not used outside of the dataset directly.

PendingReplicationBlocks.java
- remove(Block block){
empty line included.

ClusterTestDFS.java
- This is the only place where AbstractFSDataset.getVolumeNames() is used.
I think it toString() should be use here insted, getVolumeNames() can then be 
removed from the abstract class.

TestFileCreation.java
- In conf.setBoolean("dfs.datanode.simulateddatastorage", true);
constant CONFIG_PROPERTY_SIMULATED for should be used or not used consistently in all cases. 
May be it is more consistent with hadoop current practices to use config name directly.

TestSetrepIncreasing.java
- testSetrepIncreasingSimulatedStorage(): Tabs are off.
- same constant as in TestFileCreation.

TestSmallBlock.java
- Tabs should be 2 and replaced by spaces.

TestInjectionForSimulatedStorage.java
- A lot of redundant imports.
- writeFile(): formatting.

MiniDFSCluter
- The NOTE: in Javadoc for MiniDFSCluter constructor does not make sense any more.
- the line
 	if (dataSet.getClass() != SimulatedFSDataset.class)  
should probably read
  	if (dataSet instanceof SimulatedFSDataset) 

TestPRead.java
- methods should be separated by a blank line.

TestReplication.java
- System.out.println() should be removed. LOG should be used instead if necessary.

TestSimulatedFSDataset.java
- redundant import org.apache.hadoop.dfs.AbstractFSDataset.BlockWriteStreams;
- testWriteRead(): bytesAdded is never used


> Add support for simulated Data Nodes  - helpful for testing and performance benchmarking of the Name Node without having a large cluster
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1989
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1989
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Sanjay Radia
>            Priority: Minor
>         Attachments: SimulatedStoragePatchSubmit.txt
>
>
> Proposal is to add an implementation for a Simulated Data Node.
> This will 
>   - allow one to test certain parts of the system (especially the Name Node, protocols) much more easily and efficiently.
>   - allow one to run performance benchmarks on the Name node without having a large cluster.
>   - Inject faults for testing (e.g. one can add random faults based probability parameters).
> The idea is that the Simulated Data Node will
>  - discard any data written to blocks (but remember the blocks and their sizes)
>  - generate fixed data on the fly when blocks are read (e.g. block is fixed set of bytes or repeated sequence of strings).
> The Simulated Data Node can also be used for fault injection.
> The data node can be parameterized with probabilities that allow one to control:
>   - Delays on reads and writes, creates, etc
>   - IO Exceptions
>  - Loss of blocks 
>  - Failures

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.