asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From schul...@informatik.hu-berlin.de
Subject Re: Data in AsterixDB skewing towards one node
Date Wed, 04 Nov 2015 19:57:41 GMT
Hi Pouria,

as a sample I show you the creation and loading of the lineitem table:

create dataverse tpch;

    use dataverse tpch;
    create type LineitemType as closed {
      orderkey: int32,
      partkey: int32,
      suppkey: int32,
      linenumber: int32,
      quantity: double,
      extendedprice: double,
      discount: double,
      tax: double,
      returnflag: string,
      linestatus: string,
      shipdate: string,
      commitdate: string,
      receiptdate: string,
      shipinstruct: string,
      shipmode: string,
      comment: string}

use dataverse tpch;
create dataset lineitem(LineitemType) if not exists primary key orderkey,
linenumber

use dataverse tpch;
load dataset lineitem using hdfs
(("hdfs"="hdfs://192.168.127.21:50040"),
("path"="/user/schultzem/lineitem.tbl"),
("input-format"="text-input-format"),
("format"="delimited-text"),
("delimiter"="|"));

Attached to this mail you find the master configuration .xml file.

Regards, Max





> - Can you please share (a sample of) DDL and load statements that you used
> ?
> - Which SF do you use with dbgen ?
> - Can you also share your cluster.xml file as well, so we can see how the
> NCs, and their IO-Devices are defined.
>
> The fact is once you define the primary key for a dataset, AsterixDB uses
> Hash Partitioning to distribute the data among NCs. The data for TPCH does
> not really have skew issues in this scheme.
>
> Pouria
>
> On Wed, Nov 4, 2015 at 11:36 AM, <schultze@informatik.hu-berlin.de> wrote:
>
>> Hi Pouria,
>>
>> I create internal datasets and load the data by reading record files
>> from
>> a HDFS.
>>
>> Regards, Max
>>
>> > Hi Max,
>> >
>> > Can you please explain this part a bit more:
>> > "… When I load the external data it is all saved on a single node"
>> >
>> > Are you using "external datasets" or "internal datasets, loaded from
>> files
>> > on HDFS".
>> > The fact is if you are using "external datasets", then AsterixDB does
>> not
>> > really load any thing. It just gets the location of blocks on HDFS and
>> > remembers them. So in this case, if there is any issue with uniform
>> > distribution of data files, that is really related to HDFS and not
>> > AsterixDB. But if you are 'loading' an "internal" dataset by reading
>> > records from files on HDFS and you see issues with uniform
>> distribution
>> of
>> > created on-disk components, then that is another issue and could be
>> > related
>> > to AsterixDB.
>> >
>> > Pouria
>> >
>> >
>> >
>> > On Wed, Nov 4, 2015 at 11:23 AM, <schultze@informatik.hu-berlin.de>
>> wrote:
>> >
>> >> Hello,
>> >>
>> >> I have a cluster setup of AsterixDB running 4 nodes with the first
>> being
>> >> the master node and a node controller running on each of them. As a
>> test
>> >> I
>> >> run TPC-H queries on them loading the generated TPC-H datasets from a
>> >> hadoop distributed file system.
>> >>
>> >> When I load the external data it is all saved on a single node. For
>> >> later
>> >> querying that means that most of the computations are done by that
>> >> single
>> >> node which slows down the whole query (and makes the distributed
>> >> computation idea obsolete).
>> >>
>> >> By now I tried to setup the system several times and interestingly
>> >> enough
>> >> two times I was able to receive a fully functional system.
>> Unfortunatly
>> >> I
>> >> currently cannot reproduce a functional system state and whenever I
>> try
>> >> to
>> >> do a new setup I get the data skewing towards one node.
>> >>
>> >> Has that ever happened before? Do you know the reason for this or how
>> to
>> >> fix that?
>> >>
>> >> Regards, Max
>> >>
>> >>
>> >
>>
>>
>>
>

Mime
View raw message