arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wang, Yanping" <yanping.w...@intel.com>
Subject Re: A Proposal Apache Incubator Mnemonic as an alternative infra. for Apache Arrow
Date Thu, 31 Mar 2016 00:03:49 GMT
Hi, All

I met with Jacques today at Strata, we think it would be great that Arrow and Mnemonic communities
can have a F2F meeting together to talk about our integration.
I have following two days, 4/11 Monday afternoon, or 4/15 Friday.
We can meet at  intel SC campus.

Would you let me know if you are able to join us and which day you'd prefer?

Thanks
Yanping


On Mar 29, 2016, at 4:38 PM, Gary <garyw@apache.org<mailto:garyw@apache.org>>
wrote:

Yes, I agree with you and that's great if we could brainstorm here to collect more ideas about
enabling non-volatile memory usage for Apache Arrow through Mnemonic.

for the questions, my ideas are:


- Right now you are using unpooled persistent memory. Does that make sense or does chunking
make more sense?

Gary: I think it could make some sense if developer knows that their datasets are very big
and they want Apache Arrow to keep most of them in memory for intensive computing e.g. sort.
          the developer certainly can spill their Mnemonic managed datasets into disk but
this way seems a bit inefficient in some scenarios that might depend on concrete application
logic .


- What do you think is the right way to transition back and forth between persistent and ephemeral
memory? What do you think will be the first pattern to be adopted. For example, do you think
we should try to use it as a tiered storage for sort spilling (before hitting the disk), or
should we use it for caching?
Gary: my 2 cents, the netty library looks not yet provide a elegant switch mechanism for Arrow
to use, probably we can change the logic around "initialCapacity > directArena.chunkSize"
to control which buffer put on off-heap or managed by Mnemonic, another approach is to let
memory clustering mechanism of Mnemonic managing hybrid memory-like spaces instead of part
logics of class PooledByteBufAllocatorL.
Regarding the sorting, I think it is a typical case of random access to the data, we should
avoid spilling as much as possible.
my 2 cents, the performance could be
all in off-heap if possible > mnemonic used as cache > all in mnemonic using NVMe/disk
>  off-heap + spilling
the code simplicity would be
all in off-heap if possible >  all in mnemonic using NVMe/disk > mnemonic used as cache
>  off-heap + spilling

the reason why the mode "mnemonic used as cache + spilling" probably unnecessary is mnemonic
could provide nearly equivalent capacity of disk.

Thanks.
Gary.


-----Original Message-----

From: Jacques Nadeau [mailto:jacques@apache.org]

Sent: Tuesday, March 29, 2016 8:05 AM

To: <mailto:dev@arrow.apache.org> dev@arrow.apache.org<mailto:dev@arrow.apache.org>

Subject: Re: A Proposal Apache Incubator Mnemonic as an alternative infra. for Apache Arrow



This is super cool. A couple of questions:



- Right now you are using unpooled persistent memory. Does that make sense or does chunking
make more sense?

- What do you think is the right way to transition back and forth between persistent and ephemeral
memory? What do you think will be the first pattern to be adopted. For example, do you think
we should try to use it as a tiered storage for sort spilling (before hitting the disk), or
should we use it for caching?



I think it will be much easier to think about this in the context of a primary or first use
case. Do you have something in mind or should we brainstorm here?



On Wed, Mar 23, 2016 at 7:16 PM, Gary <garyw@apache.org<mailto:garyw@apache.org>>
wrote:



> Hello,

>

>    We have created a patch for Apache Arrow to leverage Apache

> incubator Mnemonic as an alternative infra. for underlying memory

> resources allocation, you can find it as below forked repo.

>

> <https://github.com/NonVolatileComputing/arrow> https://github.com/NonVolatileComputing/arrow

>

>     By this way, Apache Arrow could take some structural benefits from

> Mnemonic project they are

>

>     - Arrow is able to leverage larger capacity of high performance

> hybrid storage devices. e.g. high-end SSD, NVMe

>

>     - Mnemonic provide a potential opportunity for Arrow to

> optimize/tuning its allocation algorithms as a native Arrow-oriented

> allocation services

>

>     - The non-volatile features of  Mnemonic make it possible that

> Arrow could make its columnar in-memory data shared between different

> applications or across life-cycle of single application

>

>     - Arrow could take advantages of coming Mnemonic features of

> memory clustering/DOG (distributed object graph) and massive native

> computing

>

>     - Mnemonic helps to reduce the pressure of main memory utilization

> and its related system wide overheads.

>

>    Our this patch is designed to minimize the changes for user to use

> Arrow, please check out the test cases provided by this patch for your

> reference.

>

>    Note that, we need to put allocator services to a specified

> position (indicated by pom.xml) for Mnemonic backed Arrow related test

> cases to run because those services are required for external

> memory-like device management.

>

>    Please give your comments and review feedback for better

> collaboration of Apache Arrow and Mnemonic, Thanks.

>

> Best Regards.

> Gary.

>

>

>

<smime.p7m>
<gpgol000.txt>

Mime
View raw message