asterixdb-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pouria Pirzadeh <pouria.pirza...@gmail.com>
Subject Re: Data in AsterixDB skewing towards one node
Date Wed, 04 Nov 2015 19:41:42 GMT
- Can you please share (a sample of) DDL and load statements that you used ?
- Which SF do you use with dbgen ?
- Can you also share your cluster.xml file as well, so we can see how the
NCs, and their IO-Devices are defined.

The fact is once you define the primary key for a dataset, AsterixDB uses
Hash Partitioning to distribute the data among NCs. The data for TPCH does
not really have skew issues in this scheme.

Pouria

On Wed, Nov 4, 2015 at 11:36 AM, <schultze@informatik.hu-berlin.de> wrote:

> Hi Pouria,
>
> I create internal datasets and load the data by reading record files from
> a HDFS.
>
> Regards, Max
>
> > Hi Max,
> >
> > Can you please explain this part a bit more:
> > "… When I load the external data it is all saved on a single node"
> >
> > Are you using "external datasets" or "internal datasets, loaded from
> files
> > on HDFS".
> > The fact is if you are using "external datasets", then AsterixDB does not
> > really load any thing. It just gets the location of blocks on HDFS and
> > remembers them. So in this case, if there is any issue with uniform
> > distribution of data files, that is really related to HDFS and not
> > AsterixDB. But if you are 'loading' an "internal" dataset by reading
> > records from files on HDFS and you see issues with uniform distribution
> of
> > created on-disk components, then that is another issue and could be
> > related
> > to AsterixDB.
> >
> > Pouria
> >
> >
> >
> > On Wed, Nov 4, 2015 at 11:23 AM, <schultze@informatik.hu-berlin.de>
> wrote:
> >
> >> Hello,
> >>
> >> I have a cluster setup of AsterixDB running 4 nodes with the first being
> >> the master node and a node controller running on each of them. As a test
> >> I
> >> run TPC-H queries on them loading the generated TPC-H datasets from a
> >> hadoop distributed file system.
> >>
> >> When I load the external data it is all saved on a single node. For
> >> later
> >> querying that means that most of the computations are done by that
> >> single
> >> node which slows down the whole query (and makes the distributed
> >> computation idea obsolete).
> >>
> >> By now I tried to setup the system several times and interestingly
> >> enough
> >> two times I was able to receive a fully functional system. Unfortunatly
> >> I
> >> currently cannot reproduce a functional system state and whenever I try
> >> to
> >> do a new setup I get the data skewing towards one node.
> >>
> >> Has that ever happened before? Do you know the reason for this or how to
> >> fix that?
> >>
> >> Regards, Max
> >>
> >>
> >
>
>
>

Mime
View raw message