asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From schul...@informatik.hu-berlin.de
Subject Re: Data in AsterixDB skewing towards one node
Date Thu, 05 Nov 2015 18:37:51 GMT
Hi Pouria,

I found out that the "skew" is already created on dataset creation. When I
create a dataset like the lineitem dataset I sent earlier, I can detect a
slight increase of the size of the storage directory at only a single
node. If I remember correctly the storage directory on all of the nodes
increased slightly on dataset creation beforehand. Any insertion
afterwards is stored on that same node. So it seems like it cannot be an
HDFS error.

I use two partitions per node controller, as each of the machines has two
separate hard disks and I wanted to access as much disk space as possible.

For now I worked with external datasets in the HDFS without preloading
them. That worked to some extend (I had some system crashes or not
finishing queries).

Regards, Max

> Max,
>
> The setting seems Ok.
> It may sound silly, but do you mind trying to load some other dataset from
> local drives (not from HDFS) to see if same problem occurs ?
>
> One other question just out of my curiosity: Assuming that '
> /home/schultzem
> ' is on NFS and  ' /data/schultzem ' is local on each machine; is there is
> any specific reason that you decided to set 2 partitions per NC, one on
> NFS
> and one on local storage ?
>
> Pouria
>
>
>
> On Wed, Nov 4, 2015 at 11:57 AM, <schultze@informatik.hu-berlin.de> wrote:
>
>> Hi Pouria,
>>
>> as a sample I show you the creation and loading of the lineitem table:
>>
>> create dataverse tpch;
>>
>>     use dataverse tpch;
>>     create type LineitemType as closed {
>>       orderkey: int32,
>>       partkey: int32,
>>       suppkey: int32,
>>       linenumber: int32,
>>       quantity: double,
>>       extendedprice: double,
>>       discount: double,
>>       tax: double,
>>       returnflag: string,
>>       linestatus: string,
>>       shipdate: string,
>>       commitdate: string,
>>       receiptdate: string,
>>       shipinstruct: string,
>>       shipmode: string,
>>       comment: string}
>>
>> use dataverse tpch;
>> create dataset lineitem(LineitemType) if not exists primary key
>> orderkey,
>> linenumber
>>
>> use dataverse tpch;
>> load dataset lineitem using hdfs
>> (("hdfs"="hdfs://192.168.127.21:50040"),
>> ("path"="/user/schultzem/lineitem.tbl"),
>> ("input-format"="text-input-format"),
>> ("format"="delimited-text"),
>> ("delimiter"="|"));
>>
>> Attached to this mail you find the master configuration .xml file.
>>
>> Regards, Max
>>
>>
>>
>>
>>
>> > - Can you please share (a sample of) DDL and load statements that you
>> used
>> > ?
>> > - Which SF do you use with dbgen ?
>> > - Can you also share your cluster.xml file as well, so we can see how
>> the
>> > NCs, and their IO-Devices are defined.
>> >
>> > The fact is once you define the primary key for a dataset, AsterixDB
>> uses
>> > Hash Partitioning to distribute the data among NCs. The data for TPCH
>> does
>> > not really have skew issues in this scheme.
>> >
>> > Pouria
>> >
>> > On Wed, Nov 4, 2015 at 11:36 AM, <schultze@informatik.hu-berlin.de>
>> wrote:
>> >
>> >> Hi Pouria,
>> >>
>> >> I create internal datasets and load the data by reading record files
>> >> from
>> >> a HDFS.
>> >>
>> >> Regards, Max
>> >>
>> >> > Hi Max,
>> >> >
>> >> > Can you please explain this part a bit more:
>> >> > "… When I load the external data it is all saved on a single
>> node"
>> >> >
>> >> > Are you using "external datasets" or "internal datasets, loaded
>> from
>> >> files
>> >> > on HDFS".
>> >> > The fact is if you are using "external datasets", then AsterixDB
>> does
>> >> not
>> >> > really load any thing. It just gets the location of blocks on HDFS
>> and
>> >> > remembers them. So in this case, if there is any issue with uniform
>> >> > distribution of data files, that is really related to HDFS and not
>> >> > AsterixDB. But if you are 'loading' an "internal" dataset by
>> reading
>> >> > records from files on HDFS and you see issues with uniform
>> >> distribution
>> >> of
>> >> > created on-disk components, then that is another issue and could be
>> >> > related
>> >> > to AsterixDB.
>> >> >
>> >> > Pouria
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Nov 4, 2015 at 11:23 AM, <schultze@informatik.hu-berlin.de>
>> >> wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> I have a cluster setup of AsterixDB running 4 nodes with the first
>> >> being
>> >> >> the master node and a node controller running on each of them.
As
>> a
>> >> test
>> >> >> I
>> >> >> run TPC-H queries on them loading the generated TPC-H datasets
>> from a
>> >> >> hadoop distributed file system.
>> >> >>
>> >> >> When I load the external data it is all saved on a single node.
>> For
>> >> >> later
>> >> >> querying that means that most of the computations are done by that
>> >> >> single
>> >> >> node which slows down the whole query (and makes the distributed
>> >> >> computation idea obsolete).
>> >> >>
>> >> >> By now I tried to setup the system several times and interestingly
>> >> >> enough
>> >> >> two times I was able to receive a fully functional system.
>> >> Unfortunatly
>> >> >> I
>> >> >> currently cannot reproduce a functional system state and whenever
>> I
>> >> try
>> >> >> to
>> >> >> do a new setup I get the data skewing towards one node.
>> >> >>
>> >> >> Has that ever happened before? Do you know the reason for this
or
>> how
>> >> to
>> >> >> fix that?
>> >> >>
>> >> >> Regards, Max
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >
>>
>



Mime
View raw message