arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <...@cloudera.com>
Subject Re: A Proposal Apache Incubator Mnemonic as an alternative infra. for Apache Arrow
Date Thu, 31 Mar 2016 16:34:19 GMT
I'm on vacation the week of 4/11 and 4/18, and I'm very interested in
the implications / work that can be done on the C++ side as well, so I
look forward to the mailing list discussion after you meet to talk
through some of the mutual efforts.

Thanks
Wes

On Thu, Mar 31, 2016 at 7:47 AM, Patrick Hunt <phunt@apache.org> wrote:
> fwiw I've seen some projects use hangouts/webex pretty effectively.
>
> Patrick
>
> On Wed, Mar 30, 2016 at 11:15 PM, Wang, Yanping <yanping.wang@intel.com> wrote:
>> Yeah, I was so busy and in hurry to catch other sessions. We only talked about 2
minutes :-)
>> After Jacques and Wes's Arrow presentation, someone in audiences asked if Arrow is
going to use RDMA, I answered: RDMA is going to be used in Mnemonic project to support data
transfer among nodes and clusters.
>> It makes perfect sense we position Mnemonic under Arrow to support its use of persistent
storage media.
>>
>> Thanks Patrick, Henry, Tayler G for the guideline. We can brainstorm ideas in both
dev lists, and post those ideas in jira so developers can see where our projects are heading
to.
>> Gary and I are located in Portland Oregon, we usually plan our SC visits 2 weeks
ahead.
>>
>> Thanks,
>> Yanping
>>
>>
>> -----Original Message-----
>> From: Jacques Nadeau [mailto:jacques@apache.org]
>> Sent: Wednesday, March 30, 2016 7:34 PM
>> To: dev@arrow.apache.org
>> Cc: dev@mnemonic.incubator.apache.org; dev@mnemonic.apache.org
>> Subject: Re: A Proposal Apache Incubator Mnemonic as an alternative infra. for Apache
Arrow
>>
>> Yup. Will do.
>>
>> The discussion today was limited to "let's meet".
>>
>>
>>
>> On Wed, Mar 30, 2016 at 7:13 PM, P. Taylor Goetz <ptgoetz@gmail.com> wrote:
>>
>>> +1
>>>
>>> Discussions should be summarized and brought back to the mailing list(s).
>>> Recommendations are fine, but any decisions should be made on-list.
>>>
>>> -Taylor
>>>
>>> > On Mar 30, 2016, at 8:31 PM, Patrick Hunt <phunt@apache.org> wrote:
>>> >
>>> > Remember that no decisions should be made at the meeting. It's fine to
>>> > have discussions, but those need to be brought back to the community
>>> > before decisions are made. Summarizing for the dev@ mailing list, also
>>> > jiras, etc... are good ways to socialize the issues.
>>> >
>>> > Patrick
>>> >
>>> >> On Wed, Mar 30, 2016 at 5:17 PM, Henry Saputra <henry.saputra@gmail.com>
>>> wrote:
>>> >> The community for both podlings are bigger than the ones show up at
>>> Strata
>>> >> =)
>>> >>
>>> >> Would love to have the summary of the discussions in the dev@ list if
>>> >> indeed some discussions happening at Strata.
>>> >>
>>> >> - Henry
>>> >>
>>> >> On Wed, Mar 30, 2016 at 5:03 PM, Wang, Yanping <yanping.wang@intel.com>
>>> >> wrote:
>>> >>
>>> >>> Hi, All
>>> >>>
>>> >>> I met with Jacques today at Strata, we think it would be great that
>>> Arrow
>>> >>> and Mnemonic communities can have a F2F meeting together to talk
about
>>> our
>>> >>> integration.
>>> >>> I have following two days, 4/11 Monday afternoon, or 4/15 Friday.
>>> >>> We can meet at  intel SC campus.
>>> >>>
>>> >>> Would you let me know if you are able to join us and which day you'd
>>> >>> prefer?
>>> >>>
>>> >>> Thanks
>>> >>> Yanping
>>> >>>
>>> >>>
>>> >>> On Mar 29, 2016, at 4:38 PM, Gary <garyw@apache.org<mailto:
>>> >>> garyw@apache.org>> wrote:
>>> >>>
>>> >>> Yes, I agree with you and that's great if we could brainstorm here
to
>>> >>> collect more ideas about enabling non-volatile memory usage for
Apache
>>> >>> Arrow through Mnemonic.
>>> >>>
>>> >>> for the questions, my ideas are:
>>> >>>
>>> >>>
>>> >>> - Right now you are using unpooled persistent memory. Does that
make
>>> sense
>>> >>> or does chunking make more sense?
>>> >>>
>>> >>> Gary: I think it could make some sense if developer knows that their
>>> >>> datasets are very big and they want Apache Arrow to keep most of
them
>>> in
>>> >>> memory for intensive computing e.g. sort.
>>> >>>          the developer certainly can spill their Mnemonic managed
>>> >>> datasets into disk but this way seems a bit inefficient in some
>>> scenarios
>>> >>> that might depend on concrete application logic .
>>> >>>
>>> >>>
>>> >>> - What do you think is the right way to transition back and forth
>>> between
>>> >>> persistent and ephemeral memory? What do you think will be the first
>>> >>> pattern to be adopted. For example, do you think we should try to
use
>>> it as
>>> >>> a tiered storage for sort spilling (before hitting the disk), or
>>> should we
>>> >>> use it for caching?
>>> >>> Gary: my 2 cents, the netty library looks not yet provide a elegant
>>> switch
>>> >>> mechanism for Arrow to use, probably we can change the logic around
>>> >>> "initialCapacity > directArena.chunkSize" to control which buffer
put
>>> on
>>> >>> off-heap or managed by Mnemonic, another approach is to let memory
>>> >>> clustering mechanism of Mnemonic managing hybrid memory-like spaces
>>> instead
>>> >>> of part logics of class PooledByteBufAllocatorL.
>>> >>> Regarding the sorting, I think it is a typical case of random access
to
>>> >>> the data, we should avoid spilling as much as possible.
>>> >>> my 2 cents, the performance could be
>>> >>> all in off-heap if possible > mnemonic used as cache > all
in mnemonic
>>> >>> using NVMe/disk >  off-heap + spilling
>>> >>> the code simplicity would be
>>> >>> all in off-heap if possible >  all in mnemonic using NVMe/disk
>
>>> mnemonic
>>> >>> used as cache >  off-heap + spilling
>>> >>>
>>> >>> the reason why the mode "mnemonic used as cache + spilling" probably
>>> >>> unnecessary is mnemonic could provide nearly equivalent capacity
of
>>> disk.
>>> >>>
>>> >>> Thanks.
>>> >>> Gary.
>>> >>>
>>> >>>
>>> >>> -----Original Message-----
>>> >>>
>>> >>> From: Jacques Nadeau [mailto:jacques@apache.org]
>>> >>>
>>> >>> Sent: Tuesday, March 29, 2016 8:05 AM
>>> >>>
>>> >>> To: <mailto:dev@arrow.apache.org> dev@arrow.apache.org<mailto:
>>> >>> dev@arrow.apache.org>
>>> >>>
>>> >>> Subject: Re: A Proposal Apache Incubator Mnemonic as an alternative
>>> infra.
>>> >>> for Apache Arrow
>>> >>>
>>> >>>
>>> >>>
>>> >>> This is super cool. A couple of questions:
>>> >>>
>>> >>>
>>> >>>
>>> >>> - Right now you are using unpooled persistent memory. Does that
make
>>> sense
>>> >>> or does chunking make more sense?
>>> >>>
>>> >>> - What do you think is the right way to transition back and forth
>>> between
>>> >>> persistent and ephemeral memory? What do you think will be the first
>>> >>> pattern to be adopted. For example, do you think we should try to
use
>>> it as
>>> >>> a tiered storage for sort spilling (before hitting the disk), or
>>> should we
>>> >>> use it for caching?
>>> >>>
>>> >>>
>>> >>>
>>> >>> I think it will be much easier to think about this in the context
of a
>>> >>> primary or first use case. Do you have something in mind or should
we
>>> >>> brainstorm here?
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Wed, Mar 23, 2016 at 7:16 PM, Gary <garyw@apache.org<mailto:
>>> >>> garyw@apache.org>> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>>> Hello,
>>> >>>
>>> >>>
>>> >>>>   We have created a patch for Apache Arrow to leverage Apache
>>> >>>
>>> >>>> incubator Mnemonic as an alternative infra. for underlying memory
>>> >>>
>>> >>>> resources allocation, you can find it as below forked repo.
>>> >>>
>>> >>>
>>> >>>> <https://github.com/NonVolatileComputing/arrow>
>>> >>> https://github.com/NonVolatileComputing/arrow
>>> >>>
>>> >>>
>>> >>>>    By this way, Apache Arrow could take some structural benefits
from
>>> >>>
>>> >>>> Mnemonic project they are
>>> >>>
>>> >>>
>>> >>>>    - Arrow is able to leverage larger capacity of high performance
>>> >>>
>>> >>>> hybrid storage devices. e.g. high-end SSD, NVMe
>>> >>>
>>> >>>
>>> >>>>    - Mnemonic provide a potential opportunity for Arrow to
>>> >>>
>>> >>>> optimize/tuning its allocation algorithms as a native Arrow-oriented
>>> >>>
>>> >>>> allocation services
>>> >>>
>>> >>>
>>> >>>>    - The non-volatile features of  Mnemonic make it possible
that
>>> >>>
>>> >>>> Arrow could make its columnar in-memory data shared between
different
>>> >>>
>>> >>>> applications or across life-cycle of single application
>>> >>>
>>> >>>
>>> >>>>    - Arrow could take advantages of coming Mnemonic features
of
>>> >>>
>>> >>>> memory clustering/DOG (distributed object graph) and massive
native
>>> >>>
>>> >>>> computing
>>> >>>
>>> >>>
>>> >>>>    - Mnemonic helps to reduce the pressure of main memory utilization
>>> >>>
>>> >>>> and its related system wide overheads.
>>> >>>
>>> >>>
>>> >>>>   Our this patch is designed to minimize the changes for user
to use
>>> >>>
>>> >>>> Arrow, please check out the test cases provided by this patch
for your
>>> >>>
>>> >>>> reference.
>>> >>>
>>> >>>
>>> >>>>   Note that, we need to put allocator services to a specified
>>> >>>
>>> >>>> position (indicated by pom.xml) for Mnemonic backed Arrow related
test
>>> >>>
>>> >>>> cases to run because those services are required for external
>>> >>>
>>> >>>> memory-like device management.
>>> >>>
>>> >>>
>>> >>>>   Please give your comments and review feedback for better
>>> >>>
>>> >>>> collaboration of Apache Arrow and Mnemonic, Thanks.
>>> >>>
>>> >>>
>>> >>>> Best Regards.
>>> >>>
>>> >>>> Gary.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> <smime.p7m>
>>> >>> <gpgol000.txt>
>>> >>>
>>>

Mime
View raw message