www-announce mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sally Khudairi ...@apache.org>
Subject Success at Apache: A Newbie’s Narrative
Date Mon, 05 Feb 2018 11:00:45 GMT
[this post is available online at https://s.apache.org/A72H ]

by Kuhu Shukla

As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the
previous day in the Apache Tez project, I feel rather pleased. The latest community release
vote is complete, the bug fixes that we so badly needed are in and the new release that we
tested out internally on our many thousand strong cluster is looking good. Today I am looking
at a new stack trace from a different Apache project process and it is hard to miss how much
of the exceptional code I get to look at every day comes from people all around the globe.
A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice
while someone else wakes up to find that her effort on a bug fix for the past two months has
finally come to fruition through a binding +1.

Yahoo – which joined AOL, HuffPost, Tumblr, Endadget, and many more brands to form the Verizon
subsidiary Oath last year – has been at the frontier of open source adoption and contribution
since before I was in high school. So while I have no historical trajectories to share, I
do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from
Apache MapReduce to Apache Tez, a then new DAG based execution engine.

Oath grid infrastructure is through and through driven by Apache technologies be it storage
through HDFS, resource management through YARN, job execution frameworks with Tez and user
interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically
tailored to Oath's business-critical data pipeline needs using the polymorphic technologies
hosted, developed and maintained by the Apache community.
On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction
to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized
a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and
HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the
first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe
to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to
find my footing with newbie contributions and being ever more careful with whitespaces in
my patches. One thing was clear – Tez was the next big thing for us. By the time I could
truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were
now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all
the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully
contributing to Tez became a goal in my next quarter.

The next sprint planning meeting ended with me getting my first major Tez assignment – progress
reporting. The progress reporting in Tez was non-existent – "Just needs an API fix,"  I
thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress?
How is it different for different kinds of outputs in a graph? The questions were many.

I, however, did not have to go far to get answers. The Tez community actively came to a newbie's
rescue, finding answers and posing important questions. I started attending the bi-weekly
Tez community sync up calls and asking existing contributors and committers for course correction.
Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like
me who came from the networking industry, where the most open part of the code are the RFCs
and the implementation details are often hidden. These meetings served as a clean room for
our coding ideas and experiments. Ideas were shared, to the extent of which data structure
we should pick and what a future user of Tez would take from it. In between the usual status
updates and extensive knowledge transfers were made. 

Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests
came from Pig and Hive developers and users. Each issue led to a community JIRA and as we
started running Tez at Oath scale, new feature ideas and bugs around performance and resource
utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop
Summit where we meet our cohorts from the Apache community and we stand for hours discussing
the state of the art and what is next for the project. One such discussion set the course
for the next year and a half for me.

We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle
phase in their processing life cycle wherein the data from upstream producers is made available
to downstream consumers. Even though Apache Tez was designed with a feature set corresponding
to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted
from MapReduce at the time of the project's inception. With several thousands of jobs on our
clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance
bottleneck. So as we stood talking about our experience with Tez with our friends from the
community, we decided to implement a new Shuffle Handler for Tez. All the conversation points
were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a
few JIRAs and as I started reading through I realized, this is all new code I get to contribute
to and review. There might be a better way to put this, but to be honest it was just a lot
of fun! All the white boards were full, the team took walks post lunch and discussed how to
go about defining the API. Countless hours were spent debugging hangs while fetching data
and looking at stack traces and Wireshark captures from our test runs. Six months in and we
had the feature on our sandbox clusters. There were moments ranging from sheer frustration
to absolute exhilaration with high fives as we continued to address review comments and fixing
big and small issues with this evolving feature.

As much as owning your code is valued everywhere in the software community, I would never
go on to say "I did this!" In fact, "we did!" It is this strong sense of shared ownership
and fluid team structure that makes the open source experience at Apache truly rewarding.
This is just one example. A lot of the work that was done in Tez was leveraged by the Hive
and Pig community and cross Apache product community interaction made the work ever more interesting
and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration
score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters.
As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over
almost 38,000 nodes.

In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside
the Apache community is reading this, it will inspire and intrigue them to contribute to a
project of their choice. As an astronomy aficionado, going from a newbie Apache contributor
to a newbie Apache committer was very much like looking through my telescope - it has endless
possibilities and challenges you to be your best.

Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North
Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and
HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache
Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks
Hadoop Summit on "Tez Shuffle Handler : Shuffling At Scale With Apache Hadoop". Prior to that
she worked on Juniper Networks' router and switch configuration APIs. She likes to participate
in open source conferences and women in tech events. In her spare time she loves singing Indian
classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.

= = =

"Success at Apache" is a monthly blog series that focuses on the processes behind why the
ASF "just works". 1) Project Independence https://s.apache.org/CE0V 2) All Carrot and No Stick
https://s.apache.org/ykoG 3) Asynchronous Decision Making https://s.apache.org/PMvk 4) Rule
of the Makers https://s.apache.org/yFgQ 5) JFDI --the unconditional love of contributors https://s.apache.org/4pjM
6) Meritocracy and Me https://s.apache.org/tQQh 7) Learning to Build a Stronger Community
https://s.apache.org/x9Be 8) Meritocracy. https://s.apache.org/DiEo 9) Lowering Barriers to
Open Innovation https://s.apache.org/dAlg 10) Scratch your own itch. https://s.apache.org/Apah
11) What a Long Strange (and Great) Trip It's Been https://s.apache.org/gVuN 12) A Newbie's
Narrative https://s.apache.org/A72H

# # #  

NOTE: you are receiving this message because you are subscribed to the announce@apache.org
distribution list. To unsubscribe, send email from the recipient account to announce-unsubscribe@apache.org
with the word "Unsubscribe" in the subject line. 

View raw message