asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianfeng Jia <jianfeng....@gmail.com>
Subject Re: Implement an SerializableVector in Hyracks
Date Tue, 02 Feb 2016 01:08:30 GMT
If I understand correctly, it seems very similar to the IFrame in Hyrack, which is also a container
to store a sequence of record into the ByteBuffer.

I’m not clear about how records are manipulated inside the ValueVectors.
In Hyracks case, we store the pointer(usually a pair of (frameID, recordID)) in one java array
and manipulate the pointers instead of the original records. We mainly want to break the that
one array into multiple small ByteBuffer-based arrays. By doing so, we can reduce the risk
of OOM for a large array and we may also take the memory usage of those pointers into account
for the flush decisions.  @Ted, could you share some insights about how the ValueVectors handles
manipulations? e.g. sort, hashing … etc.  

> On Feb 1, 2016, at 3:16 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
> Have you guys looked at the Drill ValueVectors?
> 
> This structure is being spun out as Apache Arrow with multiple interfaces
> and language bindings.
> 
> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <jianfeng.jia@gmail.com> wrote:
> 
>> Hi,
>> 
>> We plan to implement an append-only array at first. The main reason is
>> this is how the auxiliary data structure be used so far. Then the
>> implementation is straightforward.
>> 
>> The tree-structured Vector in Scala can save a lot in updating case mainly
>> because of their immutable requirement. It can saving unnecessary data
>> copies comparing to other immutable list when updating. We only allow
>> in-place update. The tree design may be overkill for us.
>> 
>> Xi made on detailed design doc is here:
>> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
>> Any thoughts or comments?
>> 
>> 
>>> On Jan 18, 2016, at 9:52 AM, Chen Li <chenli@gmail.com> wrote:
>>> 
>>> @Xi and Jianfeng: after we come up with the design, let's share it with
>> the
>>> group for an approval before the implementation.
>>> 
>>> Chen
>>> 
>>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <dtabass@gmail.com> wrote:
>>> 
>>>> The accounting is just as critical as the chunking - we should do both
>>>> (together).
>>>> 
>>>> 
>>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>> 
>>>>> I don’t have relevant experience on the subject. But I think that it
>>>>> sounds good to avoid arbitrarily long chunks of memory. Especially -
as
>>>>> Jianfeng wrote - it would be good to be able to a) account for this
>> memory
>>>>> and b) to manage it.
>>>>> An interesting question for me would be what the overhead of such a
>>>>> Vector is compared to a simple Java array and as a result where it
>> should
>>>>> be used to replace arrays. (The comparison in [3] only compares
>> different
>>>>> Scala collections, but doesn’t look at plain arrays.)
>>>>> 
>>>>> Cheers,
>>>>> Till
>>>>> 
>>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>>> 
>>>>> Before we ask Xi to work on this project, it will be good to know if
>>>>>> other people have seen similar problems and agree with this plan.
>>>>>> @Till: can you share some tips?
>>>>>> 
>>>>>> Chen
>>>>>> 
>>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <jianfeng.jia@gmail.com
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Devs,
>>>>>>> 
>>>>>>> First of all, Xi Zhang is a Master student at UCI wants to work
with
>> us
>>>>>>> for a while. Welcome Xi!
>>>>>>> 
>>>>>>> We are thinking of making a Frame-based, memory-bound
>>>>>>> SerializableVector at first. We expect this vector can solve
some
>>>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>>>>> Though we did a good job on organizing the record-located memory,
the
>>>>>>> OOM exception can still happen while operating the auxiliary
data
>>>>>>> structure. For example in the sort run generator, instead of
moving
>> record
>>>>>>> around we are creating an reference “pointer" array to the
original
>> record.
>>>>>>> However, if the record is small and the size of that int array
will
>> be
>>>>>>> large, then the OOM exception will occur, which is the case of
issue
>> [1].
>>>>>>> 
>>>>>>> One way to solve this problem is to put auxiliary data structures
>> into
>>>>>>> the memory-bounded frame as well. In general, it will be much
easier
>> to ask
>>>>>>> for multiple small memory blocks than one big chunk of memory.
I
>> guess that
>>>>>>> was the same reason why we have “SerializableHashTable” for
>> HashJoin. It
>>>>>>> will be nice to have a more general structure that can be used
by
>> all the
>>>>>>> operators.
>>>>>>> 
>>>>>>> The Frame based Vector idea is inspired by the Scala Vector[2]
which
>>>>>>> looks like a List, but internally it is implemented as a 32-ary
>> tree. The
>>>>>>> performance of it is very stable for variety size of object[3].
It
>> will
>>>>>>> have all the benefits of ArrayList and the LinkedList. In addition,
>> we can
>>>>>>> take the memory usage of the auxiliary structure into the
>> calculation. We
>>>>>>> will work on the detailed design doc later if we are agree on
this
>>>>>>> direction.
>>>>>>> 
>>>>>>> Any thoughts or suggestions? Thank you!
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>>>> <
>>>>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>> 
>>>>>>> 
>>>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>>>>> [3]
>>>>>>> 
>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>>>> <
>>>>>>> 
>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>>>>>> 
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Jianfeng Jia
>>>>>>> PhD Candidate of Computer Science
>>>>>>> University of California, Irvine
>>>>>>> 
>>>>>>> 
>>>> 
>> 
>> 
>> 
>> Best,
>> 
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message