hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Söztutar <enis....@gmail.com>
Subject Re: Google Summer Of Code 2016
Date Wed, 23 Mar 2016 00:32:59 GMT
>
> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
> late for us to participate now?
>
>
ASF participates in GSOC, so HBase automatically can participate AFAIK.


> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
> mentor signup deadline.
>

I did not check the deadline, if that is the case, it means this year is
over?

Your list is pretty good. We can POC with Capt'n proto as well as grpc.


>
>
> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
> > are:
> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> > encoding. But these encodings just can use in HFile context. In RPC
> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
> > the issue number But I guessed it is HBASE-12883 Support block
> > encoding based on knowing set of column qualifiers up front)
> >
>
> Sounds like a fine project (Someone was just asking about this offline...)
>
>
>
> > - HBASE-14379 Replication V2
> > - HBASE-8691 High-Throughput Streaming Scan API
> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> > SOLR indexing. I guess it could be this issue.)
> >
> > Could you help me for selecting topics or could you offer another issue ?
> >
> >
> All above are good.
>
> Here's a few others made for another context:
>
> + Become Jepsen distributed systems test tool expert: run it against HBase
> and HDFS. Analyze results. E.g. see
> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
> + Deep dive on hbase Compactions. Own it. Review current options both the
> defaults, experimental, and the stale. Build tooling and surface metrics
> that give better insight on effectiveness of compaction mechanics and
> policies. Develop tunings and alternate, new policies. For further credit,
> develop master-orchestrated compaction algorithm.
> + Reimplement HBase append and increment as write-only with rollup on read
> or using CRDTs (
> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
> + Make the HBase Server async/event driven/SEDA moving it off its current
> thread-per-request basis
> + UI: build out more pages and tabs on the HBase master exposing more of
> our cluster metrics (make the master into a metrics sink). Extra points for
> views, histograms, or dashboards that are both informative AND pretty (D3,
> etc.). A good benchmark would be subsuming the Hannibal tool
> https://github.com/sentric/hannibal
> + Build an example application on HBase for test and illustration: e.g. use
> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
> load common crawl regular webcrawls https://commoncrawl.org/ or, load
> hbase
> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
> credit for documenting steps involved and filing issues where API is
> awkward or hard to follow.
> + Add actionable statistics to hbase internals that capture vitals about
> the data being served and that we exploit responding to queries; e.g. rough
> sizes of rows, column-families, columns-per-row-per-region, etc. For
> example, if a client has been stepping sequentially through the data, the
> stats would allow us recognize this state so we could switch to a different
> scan type; one that is optimal to a sequential progression.
> + Review and redo our fundamental merge sort, the basis of our read. There
> are a few techniques to try such as a "loser tree merge" (
> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
> make
> our merge sort block-based rather than Cell-based. Set yourself up in a rig
> and try different Cell formats to get yourself to a cache-friendly Cell
> format that maximizes instructions per cycle.
> + Our client is heavy-weight and has accumulated lots of logic over time.
> E.g. it is hard to set a single timeout for a request because client is
> layered each with its own running timeouts. At its core is a mostly-done
> async engine. Review, and finish the async work. Rewrite where it makes
> sense after analysis.
> + Our RPC is based on protobuf Service where we plugged in our own RPC
> transport. An exploratory PoC putting HBase up on grpc was done by the grpc
> team. Bring this project home. Extra points if you reveal a Streaming
> Interface between Client and Server.
> + Tiering... if regions are cold, close them so they don't occupy resources
> (close files, purge its data from cache...).... reopen when a request comes
> in....
> + Dynamic configuration of running HBase
>
>
> St.Ack
>
>
>
>
> > Thanks
> > --
> > Talat UYARER
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message