Return-Path: X-Original-To: apmail-hadoop-common-commits-archive@www.apache.org Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 231556CC9 for ; Wed, 18 May 2011 17:25:52 +0000 (UTC) Received: (qmail 17548 invoked by uid 500); 18 May 2011 17:25:51 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 17495 invoked by uid 500); 18 May 2011 17:25:51 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 17338 invoked by uid 500); 18 May 2011 17:25:50 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 17295 invoked by uid 99); 18 May 2011 17:25:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2011 17:25:50 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2011 17:25:47 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id A6D53DE5; Wed, 18 May 2011 17:25:26 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Wed, 18 May 2011 17:25:26 -0000 Message-ID: <20110518172526.75725.46139@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22ProjectSuggestions=22_by_EliColl?= =?utf-8?q?ins?= X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "ProjectSuggestions" page has been changed by EliCollins. The comment on this change is: Move research projects to its own page. http://wiki.apache.org/hadoop/ProjectSuggestions?action=3Ddiff&rev1=3D12&re= v2=3D13 -------------------------------------------------- = <> =3D=3D Research Projects =3D=3D - Here are some research project ideas, engineering ideas for new participa= nts, and areas where domain experts from other fields might add a lot of va= lue by bringing their perspective into the Hadoop discussion. = + Check out this page of [[HadoopResearchProjects|Hadoop Research Projects]= ]. - * '''Modeling of block placement and replication policies in HDFS''' - * Modeling of the expected time to data loss for a give HDFS cluster, = given Hadoops replication policy and protocols. - * Modeling of erasure codes and other approaches to replication that m= ight have other space-performance-reliability tradeoffs. - = - * '''HDFS Namespace Expansion''' - = - * Prototyping approaches to scaling the HDFS name space. Goals - Keep= it simple; Preserve or increase meta-data operations / second; Very large = numbers of files (billions to trillions) & blocks - = - * '''Hadoop Security Design''' - = - * An end-to-end proposal for how to support authentication and client= side data encryption/decryption, so that large data sets can be stored in = a public HDFS and only jobs launched by authenticated users can map-reduce = or browse the data. See HADOOP-xxx - = - * '''Hod ports to various campus work queueing systems''' - = - * Hod currently supports Torque and has previously supported Condor. = We would like to have ports to whichever system(s) are used on major campu= ses (SGE, ...). - = - * '''Integration of Virtualization (such as Xen) with Hadoop tools''' - = - * How does one integrate sandboxing of arbitrary user code in C++ and= other languages in a VM such as Xen with the Hadoop framework? How does t= his interact with SGE, Torque, Condor? - = - * As each individual machine has more and more cores/cpus, it makes s= ense to partition each machine into multiple virtual machines. That gives u= s a number of benefits: - = - * By assigning a virtual machine to a datanode, we effectively isol= ate the datanode from the load on the machine caused by other processes, ma= king the datanode more responsive/reliable. - * With multiple virtual machines on each machine, we can lower the = granularity of hod scheduling units, making it possible to schedule multipl= e tasktrackers on the same machine, improving the overall utilization of th= e whole clusters. - * With virtualization, we can easily snapshot a virtual cluster bef= ore releasing it, making it possible to re-activate the same cluster in the= future and start to work from the snapshot. - = - * '''Provisioning of long running Services via HOD''' - = - * Work on a computation model for services on the grid. The model wo= uld include: - = - * Various tools for defining clients and servers of the service, an= d at the least a C++ and Java instantiation of the abstractions - * Logical definitions of how to partition work onto a set of server= s, i.e. a generalized shard implementation - * A few useful abstractions like locks (exclusive and RW, fairness)= , leader election, transactions, - * Various communication models for groups of servers belonging to a= service, such as broadcast, unicast, etc. - * Tools for assuring QoS, reliability, managing pools of servers fo= r a service with spares, etc. - * Integration with HDFS for persistence, as well as access to local= filesystems - * Integration with ZooKeeper so that applications can use the names= pace - = - * '''A Hadoop compatible framework for discovering network topology and = identifying and diagnosing hardware that is not functioning correctly''' - = - * '''An improved framework for debugging and performance optimizing hado= op and streaming Hadoop jobs''' - = - * Some suggestions: - = - * A distributed profiler for measuring distributed map-reduce applica= tions. This would be real helpful for grid users. It should be able to prov= ide standard profiler features , e.g. number of times a method is executed,= time of execution, number of times a method caused some kind of failures, = etc; maybe accumulated over all instances of tasks that comprised that appl= ication. - = - * '''Map reduce performance enhancements''' - = - * How can we improve the performance of the standard Hadoop performan= ce sort benchmarks? - = - * '''Sort and shuffle optimization in MR framework''' - = - * Some example directions: - * Memory-based shuffling in MR framework - * Combining the results of several maps on rack or node before the = shuffle. This can reduce seek work and intermediate storage. - = - * '''Work load characterization from various Hadoop sites''' - = - * A framework for capturing workload statistics and replaying workloa= d simulations to allow the assessment of framework improvements. - = - * '''Other ideas on how to improve the frameworks performance or stabili= ty''' - = - * '''Benchmark suite for Data Intensive Supercomputing''' - = - * Scientific computation research and software has benefited tremendo= usly due to availability of benchmark suites such as NAS Parallel Benchmark= s. This was a kernel of 7 applications, starting with EP (embarrassingly pa= rallel) to SP, BT, LU (reflecting varying degree of parallelism and communi= cation patterns). A suite for data-intensive supercomputing application be= nchmarks would present a target that Hadoop (and other map-reduce implement= ations) should be optimized for. - = - * '''Performance evaluation of existing Locality Sensitive Hashing schem= es''' - = - * Research on new hashing schemes for filesystem namespace partitioni= ng: [[http://en.wikipedia.org/wiki/Locality_sensitive_hashing]] - = - * '''An alternate view of files a collection of blocks''' - = - * Propose an API and sample use cases for a file as a repository of b= locks where a user can add and delete blocks to arbitrary parts of a file. = This would allow holes in files and moving blocks from one file to another= . How does this reconcile with the sequence-of-bytes view of file? Such an = approach may encourage new styles of applications. - * To push a bit more in a research direction: UNIX file systems are m= anaged as a sequence-of-bytes but usually (and in Hadoop's case exclusively= ) used as a sequence of records. If the filesystem participates in the reco= rd management (like mainframes do for example) you can get same nice semant= ic and performance improvements. - = = <> =3D=3D Tool Investigations =3D=3D