hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinoth Chandar <vin...@apache.org>
Subject Re: [DISCUSS] Decouple Hudi and Spark - abstract over persistence
Date Mon, 05 Aug 2019 16:02:54 GMT
Nick, will respond on the JIRA. Hudi uses the Hadoop FileSystem
abstraction, so its already decouple from HDFS per se. and thats how we can
write to cloud stores.

+1 to taher's suggestion.

On Sun, Aug 4, 2019 at 8:37 AM taher koitawala <taherk77@gmail.com> wrote:

> Nick, request you to please stick to one discussion email. We have 3
> different ones to follow now and it is difficult to keep track of things
>
> On Sun, Aug 4, 2019, 9:05 PM Semantic Beeng <nick@semanticbeeng.com>
> wrote:
>
>> As part of this refactoring of Hudi can we also abstract over the
>> persistence to allow multiple impls of HDFS ?
>>
>> See
>> https://issues.apache.org/jira/browse/HUDI-95?focusedCommentId=16899650&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16899650
for
>> details.
>>
>> Many thanks for consideration
>>
>> Nick
>>
>>
>>
>> On August 4, 2019 at 4:21 AM vino yang <yanghua1127@gmail.com> wrote:
>>
>>
>> Hi Nick,
>>
>> Thank you for your more detailed thoughts, and I fully agree with your
>> thoughts about HudiLink, which should also be part of the long-term
>> planning of the Hudi Ecology.
>>
>>
>> *But I found that the angle of our thinking and the starting point are not
>> consistent. I pay more attention to the rationality of the existing
>> architecture and whether the dependence on the computing engine is
>> pluggable. Don't get me wrong, I know very well that although we have
>> different perspectives, these views have value for Hudi.*
>> Let me give more details on the discussion I made earlier.
>>
>> Currently, multiple submodules of the Hudi project are tightly coupled to
>> Spark's design and dependencies. You can see that many of the class files
>> contain statements such as "import org.apache.spark.xxx".
>>
>> I first put forward a discussion: "Integrate Hudi with Apache Flink", and
>> then came up with a discussion: "Decouple Hudi and Spark".
>>
>> I think the word "Integrate" I used for the first discussion may not be
>> accurate enough. My intention is to make the computing engine used by Hudi
>> pluggable. Spark is equivalent to Hudi is just a library, it is not the
>> core of Hudi, it should not be strongly coupled with Hudi. The features
>> currently provided by Spark are also available from Flink. But in order to
>> achieve this, we need to decouple Hudi from the code level with the use of
>> Spark.
>>
>> This makes sense both in terms of structural rationality and community
>> ecology.
>>
>> Best,
>> Vino
>>
>>
>> Semantic Beeng <nick@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道:
>>
>> "+1 for both Beam and Flink" - what I propose implies this indeed.
>>
>> But/and am working from the desired functionality and a proposed design.
>>
>> (as opposed to starting with refactoring Hudi with the goal of close
>> integration with Flink)
>>
>> I feel this is not necessary - but am not an expert in Hudi
>> implementation.
>>
>> But am pretty sure it is not sufficient for the use cases I have in mind.
>> The gist is using Hudi as a file based data lake + ML feature store that
>> enables incremental analyses done with a combination of Flink, Beam,
>> Spark,
>> Tensorlflow (see Petastorm from UberEng for an idea.)
>>
>> Let us call this HudiLink from now on (think of it as a mediator, not
>> another Hudi).
>>
>> The intuition behind looking at more then Flink is that both Beam and
>> Flink have good design abstractions we might reuse and extend.
>>
>> Like I said before, do not believe in point to point integrations.
>>
>> Alternatively / in parallel,If you care to share your use cases it would
>> be very useful. Working with explicit use cases helps others to relate and
>> help.
>>
>> Also, if some of you know there believe in (see) value of refactoring Hudi
>> implementation for a hard integration with Flink (but have no time to
>> argue
>> for it) ofc you please go ahead.
>>
>> That may be a valid bottom up approach but I cannot relate to it myself
>> (due to lack of use cases).
>>
>> Working on a material on HudiLink - if any are interested I might publish
>> when more mature.
>>
>> Hint: this was part of the inspiration https://eng.uber.com/michelangelo/
>>
>> One well thought use case will get you "in". :-) Kidding, ofc.
>>
>> Cheers
>>
>> Nick
>>
>> >
>>
>> On August 3, 2019 at 10:55 PM vino yang <yanghua1127@gmail.com> wrote:
>>
>> >
>>
>> +1 for both Beam and Flink
>>
>> First step here is to probably draw out current hierrarchy and figure out
>> what the abstraction points are..
>> In my opinion, the runtime (spark, flink) should be done at the
>> hoodie-client level and just used by hoodie-utilties seamlessly..
>>
>> >
>>
>> +1 for Vinoth's opinion, it should be the first step.
>>
>> No matter we hope Hudi to integrate with which computing framework.
>> We need to decouple Hudi client and Spark.
>>
>> We may need a pure client module named for example
>> hoodie-client-core(common)
>>
>> Then we could have: hoodie-client-spark, hoodie-client-flink and
>> hoodie-client-beam
>>
>> Suneel Marthi <smarthi@apache.org> 于2019年8月4日周日 上午10:45写道:
>>
>> +1 for Beam -- agree with Semantic Beeng's analysis.
>>
>> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taherk77@gmail.com>
>> wrote:
>>
>> So the way to go around this is that file a hip. Chalk all th classes our
>> and start moving towards Pure client.
>>
>> Secondly should we want to try beam?
>>
>> I think there is to much going on here and I'm not able to follow. If we
>> want to try out beam all along I don't think it makes sense to do anything
>> on Flink then.
>>
>> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <nick@semanticbeeng.com>
>> wrote:
>>
>> +1 My money is on this approach.
>>
>> >> The existing abstractions from Beam seem enough for the use cases as I
>> >> imagine them.
>> >>
>> >> Flink also has "dynamic table", "table source" and "table sink" which
>> >> seem very useful abstractions where Hudi might fit nicely.
>> >>
>> >>
>> >>
>>
>>
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>> >>
>> >>
>> >> Attached a screen shot.
>> >>
>> >> This seems to fit with the original premise of Hudi as well.
>> >>
>> >> Am exploring this venue with a use case that involves "temporal joins
>> on
>> >> streams" which I need for feature extraction.
>> >>
>> >> Anyone is interested in this or has concrete enough needs and use cases
>> >> please let me know.
>> >>
>> >> Best to go from an agreed upon set of 2-3 use cases.
>> >>
>> >> Cheers
>> >>
>> >> Nick
>> >>
>> >>
>> >> > Also, we do have some Beam experts on the mailing list.. Can you
>> please
>> >> weigh on viability of using Beam as the intermediate abstraction here
>> >> between Spark/Flink?
>> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
>> >> reduceByKey, countByKey and also does custom partitioning a lot.>
>> >>
>> >> >
>> >>
>> >
>>
>> >
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message