asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From schul...@informatik.hu-berlin.de
Subject Re: Data in AsterixDB skewing towards one node
Date Sat, 14 Nov 2015 11:01:15 GMT
Unfortunately I not longer have access to the cluster of my university and
I cannot provide any more information to this topic.

Regards, Max


> Any updates on this?  Just curious.  (We've not seen this skew problem
> before that I'm aware of...)
>
> On 11/5/15 10:37 AM, schultze@informatik.hu-berlin.de wrote:
>> Hi Pouria,
>>
>> I found out that the "skew" is already created on dataset creation. When
>> I
>> create a dataset like the lineitem dataset I sent earlier, I can detect
>> a
>> slight increase of the size of the storage directory at only a single
>> node. If I remember correctly the storage directory on all of the nodes
>> increased slightly on dataset creation beforehand. Any insertion
>> afterwards is stored on that same node. So it seems like it cannot be an
>> HDFS error.
>>
>> I use two partitions per node controller, as each of the machines has
>> two
>> separate hard disks and I wanted to access as much disk space as
>> possible.
>>
>> For now I worked with external datasets in the HDFS without preloading
>> them. That worked to some extend (I had some system crashes or not
>> finishing queries).
>>
>> Regards, Max
>>
>>> Max,
>>>
>>> The setting seems Ok.
>>> It may sound silly, but do you mind trying to load some other dataset
>>> from
>>> local drives (not from HDFS) to see if same problem occurs ?
>>>
>>> One other question just out of my curiosity: Assuming that '
>>> /home/schultzem
>>> ' is on NFS and  ' /data/schultzem ' is local on each machine; is there
>>> is
>>> any specific reason that you decided to set 2 partitions per NC, one on
>>> NFS
>>> and one on local storage ?
>>>
>>> Pouria
>>>
>>>
>>>
>>> On Wed, Nov 4, 2015 at 11:57 AM, <schultze@informatik.hu-berlin.de>
>>> wrote:
>>>
>>>> Hi Pouria,
>>>>
>>>> as a sample I show you the creation and loading of the lineitem table:
>>>>
>>>> create dataverse tpch;
>>>>
>>>>      use dataverse tpch;
>>>>      create type LineitemType as closed {
>>>>        orderkey: int32,
>>>>        partkey: int32,
>>>>        suppkey: int32,
>>>>        linenumber: int32,
>>>>        quantity: double,
>>>>        extendedprice: double,
>>>>        discount: double,
>>>>        tax: double,
>>>>        returnflag: string,
>>>>        linestatus: string,
>>>>        shipdate: string,
>>>>        commitdate: string,
>>>>        receiptdate: string,
>>>>        shipinstruct: string,
>>>>        shipmode: string,
>>>>        comment: string}
>>>>
>>>> use dataverse tpch;
>>>> create dataset lineitem(LineitemType) if not exists primary key
>>>> orderkey,
>>>> linenumber
>>>>
>>>> use dataverse tpch;
>>>> load dataset lineitem using hdfs
>>>> (("hdfs"="hdfs://192.168.127.21:50040"),
>>>> ("path"="/user/schultzem/lineitem.tbl"),
>>>> ("input-format"="text-input-format"),
>>>> ("format"="delimited-text"),
>>>> ("delimiter"="|"));
>>>>
>>>> Attached to this mail you find the master configuration .xml file.
>>>>
>>>> Regards, Max
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> - Can you please share (a sample of) DDL and load statements that you
>>>> used
>>>>> ?
>>>>> - Which SF do you use with dbgen ?
>>>>> - Can you also share your cluster.xml file as well, so we can see how
>>>> the
>>>>> NCs, and their IO-Devices are defined.
>>>>>
>>>>> The fact is once you define the primary key for a dataset, AsterixDB
>>>> uses
>>>>> Hash Partitioning to distribute the data among NCs. The data for TPCH
>>>> does
>>>>> not really have skew issues in this scheme.
>>>>>
>>>>> Pouria
>>>>>
>>>>> On Wed, Nov 4, 2015 at 11:36 AM, <schultze@informatik.hu-berlin.de>
>>>> wrote:
>>>>>> Hi Pouria,
>>>>>>
>>>>>> I create internal datasets and load the data by reading record files
>>>>>> from
>>>>>> a HDFS.
>>>>>>
>>>>>> Regards, Max
>>>>>>
>>>>>>> Hi Max,
>>>>>>>
>>>>>>> Can you please explain this part a bit more:
>>>>>>> "… When I load the external data it is all saved on a single
>>>> node"
>>>>>>> Are you using "external datasets" or "internal datasets, loaded
>>>> from
>>>>>> files
>>>>>>> on HDFS".
>>>>>>> The fact is if you are using "external datasets", then AsterixDB
>>>> does
>>>>>> not
>>>>>>> really load any thing. It just gets the location of blocks on
HDFS
>>>> and
>>>>>>> remembers them. So in this case, if there is any issue with uniform
>>>>>>> distribution of data files, that is really related to HDFS and
not
>>>>>>> AsterixDB. But if you are 'loading' an "internal" dataset by
>>>> reading
>>>>>>> records from files on HDFS and you see issues with uniform
>>>>>> distribution
>>>>>> of
>>>>>>> created on-disk components, then that is another issue and could
be
>>>>>>> related
>>>>>>> to AsterixDB.
>>>>>>>
>>>>>>> Pouria
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 4, 2015 at 11:23 AM, <schultze@informatik.hu-berlin.de>
>>>>>> wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have a cluster setup of AsterixDB running 4 nodes with
the first
>>>>>> being
>>>>>>>> the master node and a node controller running on each of
them. As
>>>> a
>>>>>> test
>>>>>>>> I
>>>>>>>> run TPC-H queries on them loading the generated TPC-H datasets
>>>> from a
>>>>>>>> hadoop distributed file system.
>>>>>>>>
>>>>>>>> When I load the external data it is all saved on a single
node.
>>>> For
>>>>>>>> later
>>>>>>>> querying that means that most of the computations are done
by that
>>>>>>>> single
>>>>>>>> node which slows down the whole query (and makes the distributed
>>>>>>>> computation idea obsolete).
>>>>>>>>
>>>>>>>> By now I tried to setup the system several times and interestingly
>>>>>>>> enough
>>>>>>>> two times I was able to receive a fully functional system.
>>>>>> Unfortunatly
>>>>>>>> I
>>>>>>>> currently cannot reproduce a functional system state and
whenever
>>>> I
>>>>>> try
>>>>>>>> to
>>>>>>>> do a new setup I get the data skewing towards one node.
>>>>>>>>
>>>>>>>> Has that ever happened before? Do you know the reason for
this or
>>>> how
>>>>>> to
>>>>>>>> fix that?
>>>>>>>>
>>>>>>>> Regards, Max
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>
>
>



Mime
View raw message