hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Talat Uyarer <ta...@uyarer.com>
Subject Re: Google Summer Of Code 2016
Date Fri, 25 Mar 2016 18:09:54 GMT
Hi all,

I created my GSoC proposal for Block Encoding and Compression for RPC
Layer[1]. If you review and share your comments I will be appreciated.

[1] https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing
[2] https://issues.apache.org/jira/browse/HBASE-15530

Thanks

On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer <talat@uyarer.com> wrote:
> Hi,
>
> I am appreciated to being mentor Stack :) As I know as ASF already
> participate and you can sign up. [1] last year I was a mentor. I just
> send an email to private and mentors@community.apache.org. Would you
> like to check it ?
>
> [1] https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>
> 2016-03-22 17:32 GMT-07:00 Enis Söztutar <enis.soz@gmail.com>:
>>>
>>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
>>> late for us to participate now?
>>>
>>>
>> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>>
>>
>>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
>>> mentor signup deadline.
>>>
>>
>> I did not check the deadline, if that is the case, it means this year is
>> over?
>>
>> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>>
>>
>>>
>>>
>>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
>>> > are:
>>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
>>> > encoding. But these encodings just can use in HFile context. In RPC
>>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
>>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
>>> > the issue number But I guessed it is HBASE-12883 Support block
>>> > encoding based on knowing set of column qualifiers up front)
>>> >
>>>
>>> Sounds like a fine project (Someone was just asking about this offline...)
>>>
>>>
>>>
>>> > - HBASE-14379 Replication V2
>>> > - HBASE-8691 High-Throughput Streaming Scan API
>>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
>>> > SOLR indexing. I guess it could be this issue.)
>>> >
>>> > Could you help me for selecting topics or could you offer another issue
?
>>> >
>>> >
>>> All above are good.
>>>
>>> Here's a few others made for another context:
>>>
>>> + Become Jepsen distributed systems test tool expert: run it against HBase
>>> and HDFS. Analyze results. E.g. see
>>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
>>> + Deep dive on hbase Compactions. Own it. Review current options both the
>>> defaults, experimental, and the stale. Build tooling and surface metrics
>>> that give better insight on effectiveness of compaction mechanics and
>>> policies. Develop tunings and alternate, new policies. For further credit,
>>> develop master-orchestrated compaction algorithm.
>>> + Reimplement HBase append and increment as write-only with rollup on read
>>> or using CRDTs (
>>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
>>> + Make the HBase Server async/event driven/SEDA moving it off its current
>>> thread-per-request basis
>>> + UI: build out more pages and tabs on the HBase master exposing more of
>>> our cluster metrics (make the master into a metrics sink). Extra points for
>>> views, histograms, or dashboards that are both informative AND pretty (D3,
>>> etc.). A good benchmark would be subsuming the Hannibal tool
>>> https://github.com/sentric/hannibal
>>> + Build an example application on HBase for test and illustration: e.g. use
>>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
>>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
>>> hbase
>>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
>>> credit for documenting steps involved and filing issues where API is
>>> awkward or hard to follow.
>>> + Add actionable statistics to hbase internals that capture vitals about
>>> the data being served and that we exploit responding to queries; e.g. rough
>>> sizes of rows, column-families, columns-per-row-per-region, etc. For
>>> example, if a client has been stepping sequentially through the data, the
>>> stats would allow us recognize this state so we could switch to a different
>>> scan type; one that is optimal to a sequential progression.
>>> + Review and redo our fundamental merge sort, the basis of our read. There
>>> are a few techniques to try such as a "loser tree merge" (
>>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
>>> make
>>> our merge sort block-based rather than Cell-based. Set yourself up in a rig
>>> and try different Cell formats to get yourself to a cache-friendly Cell
>>> format that maximizes instructions per cycle.
>>> + Our client is heavy-weight and has accumulated lots of logic over time.
>>> E.g. it is hard to set a single timeout for a request because client is
>>> layered each with its own running timeouts. At its core is a mostly-done
>>> async engine. Review, and finish the async work. Rewrite where it makes
>>> sense after analysis.
>>> + Our RPC is based on protobuf Service where we plugged in our own RPC
>>> transport. An exploratory PoC putting HBase up on grpc was done by the grpc
>>> team. Bring this project home. Extra points if you reveal a Streaming
>>> Interface between Client and Server.
>>> + Tiering... if regions are cold, close them so they don't occupy resources
>>> (close files, purge its data from cache...).... reopen when a request comes
>>> in....
>>> + Dynamic configuration of running HBase
>>>
>>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>> > Thanks
>>> > --
>>> > Talat UYARER
>>> >
>>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>
> On Tue, Mar 22, 2016 at 5:32 PM, Enis Söztutar <enis.soz@gmail.com> wrote:
>>>
>>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
>>> late for us to participate now?
>>>
>>>
>> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>>
>>
>>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
>>> mentor signup deadline.
>>>
>>
>> I did not check the deadline, if that is the case, it means this year is
>> over?
>>
>> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>>
>>
>>>
>>>
>>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
>>> > are:
>>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
>>> > encoding. But these encodings just can use in HFile context. In RPC
>>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
>>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
>>> > the issue number But I guessed it is HBASE-12883 Support block
>>> > encoding based on knowing set of column qualifiers up front)
>>> >
>>>
>>> Sounds like a fine project (Someone was just asking about this offline...)
>>>
>>>
>>>
>>> > - HBASE-14379 Replication V2
>>> > - HBASE-8691 High-Throughput Streaming Scan API
>>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
>>> > SOLR indexing. I guess it could be this issue.)
>>> >
>>> > Could you help me for selecting topics or could you offer another issue
?
>>> >
>>> >
>>> All above are good.
>>>
>>> Here's a few others made for another context:
>>>
>>> + Become Jepsen distributed systems test tool expert: run it against HBase
>>> and HDFS. Analyze results. E.g. see
>>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
>>> + Deep dive on hbase Compactions. Own it. Review current options both the
>>> defaults, experimental, and the stale. Build tooling and surface metrics
>>> that give better insight on effectiveness of compaction mechanics and
>>> policies. Develop tunings and alternate, new policies. For further credit,
>>> develop master-orchestrated compaction algorithm.
>>> + Reimplement HBase append and increment as write-only with rollup on read
>>> or using CRDTs (
>>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
>>> + Make the HBase Server async/event driven/SEDA moving it off its current
>>> thread-per-request basis
>>> + UI: build out more pages and tabs on the HBase master exposing more of
>>> our cluster metrics (make the master into a metrics sink). Extra points for
>>> views, histograms, or dashboards that are both informative AND pretty (D3,
>>> etc.). A good benchmark would be subsuming the Hannibal tool
>>> https://github.com/sentric/hannibal
>>> + Build an example application on HBase for test and illustration: e.g. use
>>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
>>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
>>> hbase
>>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
>>> credit for documenting steps involved and filing issues where API is
>>> awkward or hard to follow.
>>> + Add actionable statistics to hbase internals that capture vitals about
>>> the data being served and that we exploit responding to queries; e.g. rough
>>> sizes of rows, column-families, columns-per-row-per-region, etc. For
>>> example, if a client has been stepping sequentially through the data, the
>>> stats would allow us recognize this state so we could switch to a different
>>> scan type; one that is optimal to a sequential progression.
>>> + Review and redo our fundamental merge sort, the basis of our read. There
>>> are a few techniques to try such as a "loser tree merge" (
>>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
>>> make
>>> our merge sort block-based rather than Cell-based. Set yourself up in a rig
>>> and try different Cell formats to get yourself to a cache-friendly Cell
>>> format that maximizes instructions per cycle.
>>> + Our client is heavy-weight and has accumulated lots of logic over time.
>>> E.g. it is hard to set a single timeout for a request because client is
>>> layered each with its own running timeouts. At its core is a mostly-done
>>> async engine. Review, and finish the async work. Rewrite where it makes
>>> sense after analysis.
>>> + Our RPC is based on protobuf Service where we plugged in our own RPC
>>> transport. An exploratory PoC putting HBase up on grpc was done by the grpc
>>> team. Bring this project home. Extra points if you reveal a Streaming
>>> Interface between Client and Server.
>>> + Tiering... if regions are cold, close them so they don't occupy resources
>>> (close files, purge its data from cache...).... reopen when a request comes
>>> in....
>>> + Dynamic configuration of running HBase
>>>
>>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>> > Thanks
>>> > --
>>> > Talat UYARER
>>> >
>>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Mime
View raw message