Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 32645 invoked from network); 27 Oct 2007 00:06:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Oct 2007 00:06:16 -0000 Received: (qmail 56431 invoked by uid 500); 27 Oct 2007 00:06:01 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 56403 invoked by uid 500); 27 Oct 2007 00:06:01 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 56394 invoked by uid 99); 27 Oct 2007 00:06:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Oct 2007 17:06:01 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Oct 2007 00:06:14 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A732F714238 for ; Fri, 26 Oct 2007 17:05:50 -0700 (PDT) Message-ID: <19094569.1193443550675.JavaMail.jira@brutus> Date: Fri, 26 Oct 2007 17:05:50 -0700 (PDT) From: "Konstantin Shvachko (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1989) Add support for simulated Data Nodes - helpful for testing and performance benchmarking of the Name Node without having a large cluster In-Reply-To: <24543146.1191434150640.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538148 ] Konstantin Shvachko commented on HADOOP-1989: --------------------------------------------- AbstractFSDataset.java - AbstractFSDataset should be an interface rather than an abstract class, since it does not implement any methods, and all methods are declared abstract public. Therefore, it should be called FSDatasetInteface. I also think it should not be dependent on FSContants. - @author field should be removed. See HADOOP-1147. - Javadoc: missing descriptions of public methods. - Long lines. - Too many line breaks between methods. SimulatedFSDataset.java - redundant import java.io.OutputStream; - Needs javadoc desciption of SimulatedFSDataset class. - needs line break between class and method declarations. - subclasses BInfo and Storage should private. - Storage is not a good name for the class, since it is already used. May be SimulatedStorage. Is it possible to reuse classes like DF or DatanodeInfo here? DataNode.java - Javadoc description for method startDataNode(). - You do not need extra variable sendBlockReportatNextHeartbeat. Instead the new method sendBlockReport() should set lastHeartbeat=0; lastBlockReport=0; as it is done in case DatanodeProtocol.DNA_REGISTER: I'd then call this method scheduleBlockReport(), and it should not be public. - In readMetadata() both methods {code} checksumIn = data.getMetaDataInStream(block); long fileSize = data.getMetaDataLength(block); {code} perform access to the data-node block map, which is not efficient. Can it be optimized? DataChecksum.java - Would it be more clean to have SimulatedFSDataset.getChecksumHeader(checkSum) rather than DataChecksum.getHeader() so that to keep all simulated methods inside the simulated classes? simulatedStreams - SimulatedInputStream and SimulatedOutputStream should be private subclasses of SimulatedFSDataset, because they are not used outside of the dataset directly. PendingReplicationBlocks.java - remove(Block block){ empty line included. ClusterTestDFS.java - This is the only place where AbstractFSDataset.getVolumeNames() is used. I think it toString() should be use here insted, getVolumeNames() can then be removed from the abstract class. TestFileCreation.java - In conf.setBoolean("dfs.datanode.simulateddatastorage", true); constant CONFIG_PROPERTY_SIMULATED for should be used or not used consistently in all cases. May be it is more consistent with hadoop current practices to use config name directly. TestSetrepIncreasing.java - testSetrepIncreasingSimulatedStorage(): Tabs are off. - same constant as in TestFileCreation. TestSmallBlock.java - Tabs should be 2 and replaced by spaces. TestInjectionForSimulatedStorage.java - A lot of redundant imports. - writeFile(): formatting. MiniDFSCluter - The NOTE: in Javadoc for MiniDFSCluter constructor does not make sense any more. - the line if (dataSet.getClass() != SimulatedFSDataset.class) should probably read if (dataSet instanceof SimulatedFSDataset) TestPRead.java - methods should be separated by a blank line. TestReplication.java - System.out.println() should be removed. LOG should be used instead if necessary. TestSimulatedFSDataset.java - redundant import org.apache.hadoop.dfs.AbstractFSDataset.BlockWriteStreams; - testWriteRead(): bytesAdded is never used > Add support for simulated Data Nodes - helpful for testing and performance benchmarking of the Name Node without having a large cluster > ---------------------------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-1989 > URL: https://issues.apache.org/jira/browse/HADOOP-1989 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Reporter: Sanjay Radia > Priority: Minor > Attachments: SimulatedStoragePatchSubmit.txt > > > Proposal is to add an implementation for a Simulated Data Node. > This will > - allow one to test certain parts of the system (especially the Name Node, protocols) much more easily and efficiently. > - allow one to run performance benchmarks on the Name node without having a large cluster. > - Inject faults for testing (e.g. one can add random faults based probability parameters). > The idea is that the Simulated Data Node will > - discard any data written to blocks (but remember the blocks and their sizes) > - generate fixed data on the fly when blocks are read (e.g. block is fixed set of bytes or repeated sequence of strings). > The Simulated Data Node can also be used for fault injection. > The data node can be parameterized with probabilities that allow one to control: > - Delays on reads and writes, creates, etc > - IO Exceptions > - Loss of blocks > - Failures -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.