incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priyanka Sharma <sharmapriyan...@gmail.com>
Subject Gsoc2010 proposal (please try this)
Date Fri, 26 Mar 2010 18:31:23 GMT
Hi

I am Priyanka Sharma, master student at Vrije University, Amsterdam. My
major is "parallel and distributed system system".
I am interested to participate in gsoc2010 with cassandra. I would like to
implement "demo application for cassandra".
I have pasted my proposal(not fully final) below with this email. I tried to
send proposal in attachment but there was some problem, it may filtering
attachments.

You can find proposal (organized and easy to read) also at:
http://www.few.vu.nl/~psa220/gsoc-proposal.pdf
and CV at : http://www.few.vu.nl/~psa220/priyanka_cv.pdf

I would like to have your comments on my proposal, So that I can make it
better.
Kindly give me some feedback about my proposal.

========================================================================================================


Cassandra gsoc2010 : Demo application for cassandra
---------------------------------------------------

Name and Email Address:

    Priyanka Sharma, psa220@few.vu.nl, sharmapriyanka5@gmail.com



Chat/IM IDs and Networks:

    psharma@irc.freenode.net


Bio, Resumé, or C.V.
--------------------

I strongly believe in learning through experimentation and am conscious of
my responsibility to contribute effectively to my endeavors. I relish
working in teams and am confident of my system-level programming skills. I
am always keen to contribute to open source projects. My interest towards
research and open source projects led me to work on Security Enhanced Linux
(SELinux, Role-based access control). I extended the SELinux framework and
this project led to two international IEEE publication (for links, see
resume).


Currenty, I am pursuing masters in "parallel and distributed systems" and I
have explored the area of distributed systems and databases quite well. I
have been involved and worked on many distributed systems like Plan9 OS and
other system developed at INRIA like Telex. Which give me internal ideas of
real issues that can occur like consistency,scalability, fault tolerance.

I am writing a position paper also on "casandra" in which I am going to
compare it with other data storage systems like, bitable, dynamo, Which will
be no doubt help me in this project. I just started using Cassandra and I
found its very interesting because of its ease of use and its not "just"
key/value storage. It has many properties which are very useful and
interesting, and different from other data storage model.


This increased my motivation to work with cassandra, and I believe that my
deep study and real time experience in distributed systems and storage
systems makes me an ideal candidate for this project. I had participated in
gsoc2009 also with Plan9 bell labs group and I completed it successfully.


Please find my complete resume in attachment with this email or at
http://www.few.vu.nl/~psa220/priyanka_cv.pdf

Project Title and Description
-----------------------------
There are many large scale real time applications running on cassandra like
facebook, twitter, digg. But it doesn't shows how they are storing data
using cassandra. we need a small and simple application which can easily
demonstrate features of cassandra and explains how it is different from
other distributed storage systems, Which also explains the reason of
migrating every application on cassandra today. For example, cassandra uses
"quorum" ((N/2)+1) technique to provide consistency which actually makes it
fast for write operation. Cassandra also uses "eventual comsistency" to make
data consistent (which is also in amazon Dynamo).

Wiki is a kind of application which deals with bulk data, It is an
enclyclopedia. Managing such a bulk and changing data requires a lot of
effort at the storage level. Maintaining indexes on different keys like
Category, Author, Dates, Ids etc adds more complexity and very challenging.
For such a system we require a distributed database with a very efficient
search and indexing facility. For which we can use cassandra which provides
good performance in indexing and searching.

Approach
---------

1) Implement a simple and clean demo application : To implement a simple
application, Wiki would be a sensible application. It will provide the main
text editing feature, login and other additional features like finding
system information, user preferences, see recent changes etc. Any user can
edit pages with the exception of some which will only be modified by the
authentic users i:e if you are logged in.(give solid example here). I may
use python or PHP to implement this application.

2) Use thrift API : For storing data in cassandra, we require some API,
which will help our application to talk to cassandra. We will use the most
stable and popular thrift API to interect with cassandra.

3) Store data on cassandra : I will find out the best way to store data on
cassandra that means we can read and write data effciently. Define columns,
super columns, column family and keyspace. Make proper structure of these
kewords in a way retrival of data would be effective and good in
performance. I have to implement read and write indexing which perfroms
well.

4) Add some showcase features in application : I will add some feature in
our application which will be the showcase of cassandra. I will add search
feature in wiki application as cassandra is good to perform searches. I have
to think about how I am going to implement search internally, for example I
should search on supercolumns. So, it will be challenging to implement
efficient search algorithm internally. Another feature I would add is
category, where each topic would be under some "Category" and in addition to
this we can also define search into some particular category. An example
would be " to find out the documents which have been changed in last week in
a particular category, joining on two groups."

5) Implement group based queries : I would provide some group query results
where i can use get_slice() functionality provided by thrift. For example,
if user want to see its change logs per month basis or may be per week
basis. Then I can query the cassandra system using thrift API (like
get_slice) on the basis of key. It will provide results fast and this
feature would be provide flexibility to user also.

6)
6.1) Test and demonstration of application :  Once all of the above in
place, it is important to test every feature in the application is working
as per the definition. then I have to       demonstrate some of the benefits
of such a system which is using cassandra internally. Some case studies and
compartive study with some other databases required now. I will test how
this system is performing better than other systems for same type of
application.

6.2) Testing on mulinodes : Now, I will test my application where cassandra
is deployed on multinodes. I will repeat same read and write tests and
compare it with other distributed databases performance for same kind of
application.

Timeline
--------

April 20 - May23
Community bonding! Use this time to understand and read all possible
features that can be provide in application which makes it effective(in the
sense of cassandra).

May 24 - June 06
Implementation begins! Implement simple wiki application with some basic
features like edit document and create login etc. I may use PHP or Python
for implementation.

June 07 - June 13:
Integrate wiki application with Thrift to use cassandra as in backend.

June 14 - June 30:
Find out and implement the best way to represent data in storage.

July 1 - July 15:
Add some showcase features like search and category search in application.

July 16:
Mid-term deliverables:Working implementation of an application running on
cassandra. Which also provide some

July 17 - July 30
Implement group based queries for user profile. like, last change logs. Then
implement "Join query" feature where user can search category plus user
based data.

July 31 -  Aug 15
Test application! Do some comparetive study with other databases. Find out
if application looks not fully featured add some features in it.

Aug 16 - Aug 29
Test application where cassandra is deployed on multinodes.

Aug 30:
Final deliverable: Give full proof application running on cassandra.


-- 
Thanks & Regards
Priyanka Sharma

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message