hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Söztutar <enis....@gmail.com>
Subject Re: Google Summer Of Code 2016
Date Wed, 23 Mar 2016 00:32:59 GMT
> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
> late for us to participate now?
ASF participates in GSOC, so HBase automatically can participate AFAIK.

> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
> mentor signup deadline.

I did not check the deadline, if that is the case, it means this year is

Your list is pretty good. We can POC with Capt'n proto as well as grpc.

> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
> > are:
> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> > encoding. But these encodings just can use in HFile context. In RPC
> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
> > the issue number But I guessed it is HBASE-12883 Support block
> > encoding based on knowing set of column qualifiers up front)
> >
> Sounds like a fine project (Someone was just asking about this offline...)
> > - HBASE-14379 Replication V2
> > - HBASE-8691 High-Throughput Streaming Scan API
> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> > SOLR indexing. I guess it could be this issue.)
> >
> > Could you help me for selecting topics or could you offer another issue ?
> >
> >
> All above are good.
> Here's a few others made for another context:
> + Become Jepsen distributed systems test tool expert: run it against HBase
> and HDFS. Analyze results. E.g. see
> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
> + Deep dive on hbase Compactions. Own it. Review current options both the
> defaults, experimental, and the stale. Build tooling and surface metrics
> that give better insight on effectiveness of compaction mechanics and
> policies. Develop tunings and alternate, new policies. For further credit,
> develop master-orchestrated compaction algorithm.
> + Reimplement HBase append and increment as write-only with rollup on read
> or using CRDTs (
> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
> + Make the HBase Server async/event driven/SEDA moving it off its current
> thread-per-request basis
> + UI: build out more pages and tabs on the HBase master exposing more of
> our cluster metrics (make the master into a metrics sink). Extra points for
> views, histograms, or dashboards that are both informative AND pretty (D3,
> etc.). A good benchmark would be subsuming the Hannibal tool
> https://github.com/sentric/hannibal
> + Build an example application on HBase for test and illustration: e.g. use
> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
> load common crawl regular webcrawls https://commoncrawl.org/ or, load
> hbase
> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
> credit for documenting steps involved and filing issues where API is
> awkward or hard to follow.
> + Add actionable statistics to hbase internals that capture vitals about
> the data being served and that we exploit responding to queries; e.g. rough
> sizes of rows, column-families, columns-per-row-per-region, etc. For
> example, if a client has been stepping sequentially through the data, the
> stats would allow us recognize this state so we could switch to a different
> scan type; one that is optimal to a sequential progression.
> + Review and redo our fundamental merge sort, the basis of our read. There
> are a few techniques to try such as a "loser tree merge" (
> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
> make
> our merge sort block-based rather than Cell-based. Set yourself up in a rig
> and try different Cell formats to get yourself to a cache-friendly Cell
> format that maximizes instructions per cycle.
> + Our client is heavy-weight and has accumulated lots of logic over time.
> E.g. it is hard to set a single timeout for a request because client is
> layered each with its own running timeouts. At its core is a mostly-done
> async engine. Review, and finish the async work. Rewrite where it makes
> sense after analysis.
> + Our RPC is based on protobuf Service where we plugged in our own RPC
> transport. An exploratory PoC putting HBase up on grpc was done by the grpc
> team. Bring this project home. Extra points if you reveal a Streaming
> Interface between Client and Server.
> + Tiering... if regions are cold, close them so they don't occupy resources
> (close files, purge its data from cache...).... reopen when a request comes
> in....
> + Dynamic configuration of running HBase
> St.Ack
> > Thanks
> > --
> > Talat UYARER
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message