Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Wed, 18 May 2011 17:25:26 -0000
Message-ID: <20110518172526.75725.46139@eos.apache.org>
Subject: 
 =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22ProjectSuggestions=22_by_EliColl?=
 =?utf-8?q?ins?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch=
ange notification.

The "ProjectSuggestions" page has been changed by EliCollins.
The comment on this change is: Move research projects to its own page.
http://wiki.apache.org/hadoop/ProjectSuggestions?action=3Ddiff&rev1=3D12&re=
v2=3D13

--------------------------------------------------

  =

  <<Anchor(research_projects)>>
  =3D=3D Research Projects =3D=3D
- Here are some research project ideas, engineering ideas for new participa=
nts, and areas where domain experts from other fields might add a lot of va=
lue by bringing their perspective into the Hadoop discussion.
  =

+ Check out this page of [[HadoopResearchProjects|Hadoop Research Projects]=
].
-  * '''Modeling of block placement and replication policies in HDFS'''
-    * Modeling of the expected time to data loss for a give HDFS cluster, =
given Hadoops replication policy and protocols.
-    * Modeling of erasure codes and other approaches to replication that m=
ight have other space-performance-reliability tradeoffs.
- =

-  * '''HDFS Namespace Expansion'''
- =

-    * Prototyping approaches to scaling the HDFS name space.  Goals - Keep=
 it simple; Preserve or increase meta-data operations / second; Very large =
numbers of files (billions to trillions) & blocks
- =

-  * '''Hadoop Security Design'''
- =

-     * An end-to-end proposal for how to support authentication and client=
 side data encryption/decryption, so that large data sets can be stored in =
a public HDFS and only jobs launched by authenticated users can map-reduce =
or browse the data.  See HADOOP-xxx
- =

-  * '''Hod ports to various campus work queueing systems'''
- =

-     * Hod currently supports Torque and has previously supported Condor. =
 We would like to have ports to whichever system(s) are used on major campu=
ses (SGE, ...).
- =

-  * '''Integration of Virtualization (such as Xen) with Hadoop tools'''
- =

-     * How does one integrate sandboxing of arbitrary user code in C++ and=
 other languages in a VM such as Xen with the Hadoop framework?  How does t=
his interact with SGE, Torque, Condor?
- =

-     * As each individual machine has more and more cores/cpus, it makes s=
ense to partition each machine into multiple virtual machines. That gives u=
s a number of benefits:
- =

-       * By assigning a virtual machine to a datanode, we effectively isol=
ate the datanode from the load on the machine caused by other processes, ma=
king the datanode more responsive/reliable.
-       * With multiple virtual machines on each machine, we can lower the =
granularity of hod scheduling units, making it possible to schedule multipl=
e tasktrackers on the same machine, improving the overall utilization of th=
e whole clusters.
-       * With virtualization, we can easily snapshot a virtual cluster bef=
ore releasing it, making it possible to re-activate the same cluster in the=
 future and start to work from the snapshot.
- =

-  * '''Provisioning of long running Services via HOD'''
- =

-     * Work on a computation model for services on the grid.  The model wo=
uld include:
- =

-       * Various tools for defining clients and servers of the service, an=
d at the least a C++ and Java instantiation of the abstractions
-       * Logical definitions of how to partition work onto a set of server=
s, i.e. a generalized shard implementation
-       * A few useful abstractions like locks (exclusive and RW, fairness)=
, leader election, transactions,
-       * Various communication models for groups of servers belonging to a=
 service, such as broadcast, unicast, etc.
-       * Tools for assuring QoS, reliability, managing pools of servers fo=
r a service with spares, etc.
-       * Integration with HDFS for persistence, as well as access to local=
 filesystems
-       * Integration with ZooKeeper so that applications can use the names=
pace
- =

-  * '''A Hadoop compatible framework for discovering network topology and =
identifying and diagnosing hardware that is not functioning correctly'''
- =

-  * '''An improved framework for debugging and performance optimizing hado=
op and streaming Hadoop jobs'''
- =

-    * Some suggestions:
- =

-     * A distributed profiler for measuring distributed map-reduce applica=
tions. This would be real helpful for grid users. It should be able to prov=
ide standard profiler features , e.g. number of times a method is executed,=
 time of execution, number of times a method caused some kind of failures, =
etc; maybe accumulated over all instances of tasks that comprised that appl=
ication.
- =

-  * '''Map reduce performance enhancements'''
- =

-     * How can we improve the performance of the standard Hadoop performan=
ce sort benchmarks?
- =

-  * '''Sort and shuffle optimization in MR framework'''
- =

-     * Some example directions:
-       * Memory-based shuffling in MR framework
-       * Combining the results of several maps on rack or node before the =
shuffle.  This can reduce seek work and intermediate storage.
- =

-  * '''Work load characterization from various Hadoop sites'''
- =

-     * A framework for capturing workload statistics and replaying workloa=
d simulations to allow the assessment of framework improvements.
- =

-  * '''Other ideas on how to improve the frameworks performance or stabili=
ty'''
- =

-  * '''Benchmark suite for Data Intensive Supercomputing'''
- =

-     * Scientific computation research and software has benefited tremendo=
usly due to availability of benchmark suites such as NAS Parallel Benchmark=
s. This was a kernel of 7 applications, starting with EP (embarrassingly pa=
rallel) to SP, BT, LU (reflecting varying degree of parallelism and communi=
cation patterns).  A suite for data-intensive supercomputing application be=
nchmarks would present a target that Hadoop (and other map-reduce implement=
ations) should be optimized for.
- =

-  * '''Performance evaluation of existing Locality Sensitive Hashing schem=
es'''
- =

-     * Research on new hashing schemes for filesystem namespace partitioni=
ng: [[http://en.wikipedia.org/wiki/Locality_sensitive_hashing]]
- =

-  * '''An alternate view of files a collection of blocks'''
- =

-     * Propose an API and sample use cases for a file as a repository of b=
locks where a user can add and delete blocks to arbitrary parts of a file. =
 This would allow holes in files and moving blocks from one file to another=
. How does this reconcile with the sequence-of-bytes view of file? Such an =
approach may encourage new styles of applications.
-     * To push a bit more in a research direction: UNIX file systems are m=
anaged as a sequence-of-bytes but usually (and in Hadoop's case exclusively=
) used as a sequence of records. If the filesystem participates in the reco=
rd management (like mainframes do for example) you can get same nice semant=
ic and performance improvements.
- =

  =

  <<Anchor(tool_investigations)>>
  =3D=3D Tool Investigations =3D=3D