hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Google Summer Of Code 2016
Date Wed, 23 Mar 2016 00:05:41 GMT
On Tue, Mar 22, 2016 at 3:04 AM, Talat Uyarer <talat@uyarer.com> wrote:

> Hi All,
>
> I am Talat UYARER. I am PMC member and Commiter at Nutch and Gora. I
> have few contributions to Hbase and want to work for HBase in GSoC
> 2016. As far as I know, you haven't selected any issue for GSoC.
>
>
I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
late for us to participate now?


> I am wondering is there anybody who can be a mentor for GSoC in HBase?
>
>
I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
mentor signup deadline.



> BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
> are:
> - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> encoding. But these encodings just can use in HFile context. In RPC
> and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> improve them or using HFile encodings in RPC and WAL" ( He didn't say
> the issue number But I guessed it is HBASE-12883 Support block
> encoding based on knowing set of column qualifiers up front)
>

Sounds like a fine project (Someone was just asking about this offline...)



> - HBASE-14379 Replication V2
> - HBASE-8691 High-Throughput Streaming Scan API
> - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> SOLR indexing. I guess it could be this issue.)
>
> Could you help me for selecting topics or could you offer another issue ?
>
>
All above are good.

Here's a few others made for another context:

+ Become Jepsen distributed systems test tool expert: run it against HBase
and HDFS. Analyze results. E.g. see
https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
+ Deep dive on hbase Compactions. Own it. Review current options both the
defaults, experimental, and the stale. Build tooling and surface metrics
that give better insight on effectiveness of compaction mechanics and
policies. Develop tunings and alternate, new policies. For further credit,
develop master-orchestrated compaction algorithm.
+ Reimplement HBase append and increment as write-only with rollup on read
or using CRDTs (
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
+ Make the HBase Server async/event driven/SEDA moving it off its current
thread-per-request basis
+ UI: build out more pages and tabs on the HBase master exposing more of
our cluster metrics (make the master into a metrics sink). Extra points for
views, histograms, or dashboards that are both informative AND pretty (D3,
etc.). A good benchmark would be subsuming the Hannibal tool
https://github.com/sentric/hannibal
+ Build an example application on HBase for test and illustration: e.g. use
Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
load common crawl regular webcrawls https://commoncrawl.org/ or, load hbase
with wikipedia, the flickr dataset, or any dataset that appeals. Extra
credit for documenting steps involved and filing issues where API is
awkward or hard to follow.
+ Add actionable statistics to hbase internals that capture vitals about
the data being served and that we exploit responding to queries; e.g. rough
sizes of rows, column-families, columns-per-row-per-region, etc. For
example, if a client has been stepping sequentially through the data, the
stats would allow us recognize this state so we could switch to a different
scan type; one that is optimal to a sequential progression.
+ Review and redo our fundamental merge sort, the basis of our read. There
are a few techniques to try such as a "loser tree merge" (
http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd make
our merge sort block-based rather than Cell-based. Set yourself up in a rig
and try different Cell formats to get yourself to a cache-friendly Cell
format that maximizes instructions per cycle.
+ Our client is heavy-weight and has accumulated lots of logic over time.
E.g. it is hard to set a single timeout for a request because client is
layered each with its own running timeouts. At its core is a mostly-done
async engine. Review, and finish the async work. Rewrite where it makes
sense after analysis.
+ Our RPC is based on protobuf Service where we plugged in our own RPC
transport. An exploratory PoC putting HBase up on grpc was done by the grpc
team. Bring this project home. Extra points if you reveal a Streaming
Interface between Client and Server.
+ Tiering... if regions are cold, close them so they don't occupy resources
(close files, purge its data from cache...).... reopen when a request comes
in....
+ Dynamic configuration of running HBase


St.Ack




> Thanks
> --
> Talat UYARER
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message