asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Carey <mjca...@ics.uci.edu>
Subject Re: Data in AsterixDB skewing towards one node
Date Thu, 12 Nov 2015 01:04:28 GMT
Any updates on this?  Just curious.  (We've not seen this skew problem 
before that I'm aware of...)

On 11/5/15 10:37 AM, schultze@informatik.hu-berlin.de wrote:
> Hi Pouria,
>
> I found out that the "skew" is already created on dataset creation. When I
> create a dataset like the lineitem dataset I sent earlier, I can detect a
> slight increase of the size of the storage directory at only a single
> node. If I remember correctly the storage directory on all of the nodes
> increased slightly on dataset creation beforehand. Any insertion
> afterwards is stored on that same node. So it seems like it cannot be an
> HDFS error.
>
> I use two partitions per node controller, as each of the machines has two
> separate hard disks and I wanted to access as much disk space as possible.
>
> For now I worked with external datasets in the HDFS without preloading
> them. That worked to some extend (I had some system crashes or not
> finishing queries).
>
> Regards, Max
>
>> Max,
>>
>> The setting seems Ok.
>> It may sound silly, but do you mind trying to load some other dataset from
>> local drives (not from HDFS) to see if same problem occurs ?
>>
>> One other question just out of my curiosity: Assuming that '
>> /home/schultzem
>> ' is on NFS and  ' /data/schultzem ' is local on each machine; is there is
>> any specific reason that you decided to set 2 partitions per NC, one on
>> NFS
>> and one on local storage ?
>>
>> Pouria
>>
>>
>>
>> On Wed, Nov 4, 2015 at 11:57 AM, <schultze@informatik.hu-berlin.de> wrote:
>>
>>> Hi Pouria,
>>>
>>> as a sample I show you the creation and loading of the lineitem table:
>>>
>>> create dataverse tpch;
>>>
>>>      use dataverse tpch;
>>>      create type LineitemType as closed {
>>>        orderkey: int32,
>>>        partkey: int32,
>>>        suppkey: int32,
>>>        linenumber: int32,
>>>        quantity: double,
>>>        extendedprice: double,
>>>        discount: double,
>>>        tax: double,
>>>        returnflag: string,
>>>        linestatus: string,
>>>        shipdate: string,
>>>        commitdate: string,
>>>        receiptdate: string,
>>>        shipinstruct: string,
>>>        shipmode: string,
>>>        comment: string}
>>>
>>> use dataverse tpch;
>>> create dataset lineitem(LineitemType) if not exists primary key
>>> orderkey,
>>> linenumber
>>>
>>> use dataverse tpch;
>>> load dataset lineitem using hdfs
>>> (("hdfs"="hdfs://192.168.127.21:50040"),
>>> ("path"="/user/schultzem/lineitem.tbl"),
>>> ("input-format"="text-input-format"),
>>> ("format"="delimited-text"),
>>> ("delimiter"="|"));
>>>
>>> Attached to this mail you find the master configuration .xml file.
>>>
>>> Regards, Max
>>>
>>>
>>>
>>>
>>>
>>>> - Can you please share (a sample of) DDL and load statements that you
>>> used
>>>> ?
>>>> - Which SF do you use with dbgen ?
>>>> - Can you also share your cluster.xml file as well, so we can see how
>>> the
>>>> NCs, and their IO-Devices are defined.
>>>>
>>>> The fact is once you define the primary key for a dataset, AsterixDB
>>> uses
>>>> Hash Partitioning to distribute the data among NCs. The data for TPCH
>>> does
>>>> not really have skew issues in this scheme.
>>>>
>>>> Pouria
>>>>
>>>> On Wed, Nov 4, 2015 at 11:36 AM, <schultze@informatik.hu-berlin.de>
>>> wrote:
>>>>> Hi Pouria,
>>>>>
>>>>> I create internal datasets and load the data by reading record files
>>>>> from
>>>>> a HDFS.
>>>>>
>>>>> Regards, Max
>>>>>
>>>>>> Hi Max,
>>>>>>
>>>>>> Can you please explain this part a bit more:
>>>>>> "… When I load the external data it is all saved on a single
>>> node"
>>>>>> Are you using "external datasets" or "internal datasets, loaded
>>> from
>>>>> files
>>>>>> on HDFS".
>>>>>> The fact is if you are using "external datasets", then AsterixDB
>>> does
>>>>> not
>>>>>> really load any thing. It just gets the location of blocks on HDFS
>>> and
>>>>>> remembers them. So in this case, if there is any issue with uniform
>>>>>> distribution of data files, that is really related to HDFS and not
>>>>>> AsterixDB. But if you are 'loading' an "internal" dataset by
>>> reading
>>>>>> records from files on HDFS and you see issues with uniform
>>>>> distribution
>>>>> of
>>>>>> created on-disk components, then that is another issue and could
be
>>>>>> related
>>>>>> to AsterixDB.
>>>>>>
>>>>>> Pouria
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 4, 2015 at 11:23 AM, <schultze@informatik.hu-berlin.de>
>>>>> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a cluster setup of AsterixDB running 4 nodes with the
first
>>>>> being
>>>>>>> the master node and a node controller running on each of them.
As
>>> a
>>>>> test
>>>>>>> I
>>>>>>> run TPC-H queries on them loading the generated TPC-H datasets
>>> from a
>>>>>>> hadoop distributed file system.
>>>>>>>
>>>>>>> When I load the external data it is all saved on a single node.
>>> For
>>>>>>> later
>>>>>>> querying that means that most of the computations are done by
that
>>>>>>> single
>>>>>>> node which slows down the whole query (and makes the distributed
>>>>>>> computation idea obsolete).
>>>>>>>
>>>>>>> By now I tried to setup the system several times and interestingly
>>>>>>> enough
>>>>>>> two times I was able to receive a fully functional system.
>>>>> Unfortunatly
>>>>>>> I
>>>>>>> currently cannot reproduce a functional system state and whenever
>>> I
>>>>> try
>>>>>>> to
>>>>>>> do a new setup I get the data skewing towards one node.
>>>>>>>
>>>>>>> Has that ever happened before? Do you know the reason for this
or
>>> how
>>>>> to
>>>>>>> fix that?
>>>>>>>
>>>>>>> Regards, Max
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>


Mime
View raw message