asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pouria Pirzadeh <pouria.pirza...@gmail.com>
Subject Re: Data in AsterixDB skewing towards one node
Date Thu, 05 Nov 2015 17:47:23 GMT
Max,

The setting seems Ok.
It may sound silly, but do you mind trying to load some other dataset from
local drives (not from HDFS) to see if same problem occurs ?

One other question just out of my curiosity: Assuming that ' /home/schultzem
' is on NFS and  ' /data/schultzem ' is local on each machine; is there is
any specific reason that you decided to set 2 partitions per NC, one on NFS
and one on local storage ?

Pouria



On Wed, Nov 4, 2015 at 11:57 AM, <schultze@informatik.hu-berlin.de> wrote:

> Hi Pouria,
>
> as a sample I show you the creation and loading of the lineitem table:
>
> create dataverse tpch;
>
>     use dataverse tpch;
>     create type LineitemType as closed {
>       orderkey: int32,
>       partkey: int32,
>       suppkey: int32,
>       linenumber: int32,
>       quantity: double,
>       extendedprice: double,
>       discount: double,
>       tax: double,
>       returnflag: string,
>       linestatus: string,
>       shipdate: string,
>       commitdate: string,
>       receiptdate: string,
>       shipinstruct: string,
>       shipmode: string,
>       comment: string}
>
> use dataverse tpch;
> create dataset lineitem(LineitemType) if not exists primary key orderkey,
> linenumber
>
> use dataverse tpch;
> load dataset lineitem using hdfs
> (("hdfs"="hdfs://192.168.127.21:50040"),
> ("path"="/user/schultzem/lineitem.tbl"),
> ("input-format"="text-input-format"),
> ("format"="delimited-text"),
> ("delimiter"="|"));
>
> Attached to this mail you find the master configuration .xml file.
>
> Regards, Max
>
>
>
>
>
> > - Can you please share (a sample of) DDL and load statements that you
> used
> > ?
> > - Which SF do you use with dbgen ?
> > - Can you also share your cluster.xml file as well, so we can see how the
> > NCs, and their IO-Devices are defined.
> >
> > The fact is once you define the primary key for a dataset, AsterixDB uses
> > Hash Partitioning to distribute the data among NCs. The data for TPCH
> does
> > not really have skew issues in this scheme.
> >
> > Pouria
> >
> > On Wed, Nov 4, 2015 at 11:36 AM, <schultze@informatik.hu-berlin.de>
> wrote:
> >
> >> Hi Pouria,
> >>
> >> I create internal datasets and load the data by reading record files
> >> from
> >> a HDFS.
> >>
> >> Regards, Max
> >>
> >> > Hi Max,
> >> >
> >> > Can you please explain this part a bit more:
> >> > "… When I load the external data it is all saved on a single node"
> >> >
> >> > Are you using "external datasets" or "internal datasets, loaded from
> >> files
> >> > on HDFS".
> >> > The fact is if you are using "external datasets", then AsterixDB does
> >> not
> >> > really load any thing. It just gets the location of blocks on HDFS and
> >> > remembers them. So in this case, if there is any issue with uniform
> >> > distribution of data files, that is really related to HDFS and not
> >> > AsterixDB. But if you are 'loading' an "internal" dataset by reading
> >> > records from files on HDFS and you see issues with uniform
> >> distribution
> >> of
> >> > created on-disk components, then that is another issue and could be
> >> > related
> >> > to AsterixDB.
> >> >
> >> > Pouria
> >> >
> >> >
> >> >
> >> > On Wed, Nov 4, 2015 at 11:23 AM, <schultze@informatik.hu-berlin.de>
> >> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> I have a cluster setup of AsterixDB running 4 nodes with the first
> >> being
> >> >> the master node and a node controller running on each of them. As a
> >> test
> >> >> I
> >> >> run TPC-H queries on them loading the generated TPC-H datasets from
a
> >> >> hadoop distributed file system.
> >> >>
> >> >> When I load the external data it is all saved on a single node. For
> >> >> later
> >> >> querying that means that most of the computations are done by that
> >> >> single
> >> >> node which slows down the whole query (and makes the distributed
> >> >> computation idea obsolete).
> >> >>
> >> >> By now I tried to setup the system several times and interestingly
> >> >> enough
> >> >> two times I was able to receive a fully functional system.
> >> Unfortunatly
> >> >> I
> >> >> currently cannot reproduce a functional system state and whenever I
> >> try
> >> >> to
> >> >> do a new setup I get the data skewing towards one node.
> >> >>
> >> >> Has that ever happened before? Do you know the reason for this or how
> >> to
> >> >> fix that?
> >> >>
> >> >> Regards, Max
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>

Mime
View raw message