carbondata-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravindra Pesala <ravi.pes...@gmail.com>
Subject Re: [POSSIBLE BUG] Carbondata 1.1.1 inaccurate results
Date Thu, 24 Aug 2017 02:57:01 GMT
Hi,

I have verified using tpch tables with 1 GB generated data. on 1.1.1  but I
got below result. I don't have the exact schema as you mentioned but with
original TPCH schema, I verified.

0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey)
from customer, orders where o_Custkey = c_CustKey;
+-------------------+-------------------+--+
| count(c_CustKey)  | count(o_CustKey)  |
+-------------------+-------------------+--+
| 1500000           | 1500000           |
+-------------------+-------------------+--+


On parquet with same data.

0: jdbc:hive2://localhost:10000> select count(c_CustKey),count(o_CustKey)
from customer, orders where o_Custkey = c_CustKey;
+-------------------+-------------------+--+
| count(c_CustKey)  | count(o_CustKey)  |
+-------------------+-------------------+--+
| 1500000           | 1500000           |
+-------------------+-------------------+--+


Regards,
Ravindra.

On 23 August 2017 at 19:40, Swapnil Shinde <swapnilushinde@gmail.com> wrote:

> Hello All
>     We are observing incorrect query results with carbondata 1.1.1. Please
> find details below -
>
> *Datasets used -*
>      TPC-H star schema based datasets (http://www.cs.umb.edu/~
> poneil/StarSchemaB.PDF)
> *Query - *
> *     select cCustKey,loCustKey from customer, lineorder where loCustkey =
> cCustKey*
> *How we load data -*
>      We validated loading data through dataframe and "INSERT" statements
> and both ways produce incorrect results. I am putting one way here-
>
>
> *-- CREATE CUSTOMER TABLE*
>
> *carbon.sql("CREATE TABLE IF NOT EXISTS customer(cCustKey Int, cName
> string, cAddress string, cCity string, cNation string, cRegion string,
> cPhone string, cMktSegment string, dummy string) STORED BY 'carbondata'")*
>
> *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/customer' INTO TABLE
> customer
> OPTIONS('DELIMITER'='\t','FILEHEADER'='cCustKey,cName,cAddress,cCity,cNation,cRegion,cPhone,cMktsegment,dummy')")*
>
>
>
> *-- CREATE LINEORDER TABLE*
>
> *carbon.sql("CREATE TABLE IF NOT EXISTS lineorder(loOrderkey
> bigint,loLinenumber Int,loCustkey Int,loPartkey Int,loSuppkey
> Int,loOrderdate Int,loOrderpriority String,loShippriority Int,loQuantity
> Int,loExtendedprice Int,loOrdtotalprice Int,loDiscount Int,loRevenue
> Int,loSupplycost Int,loTax Int,loCommitdate Int,loShipmode String,dummy
> String) STORED BY 'carbondata'")*
>
> *carbon.sql("LOAD DATA INPATH '/xxxx/yyyy/tmp/ssb_raw/lineorder' INTO
> TABLE lineorder
> OPTIONS('DELIMITER'='\t','FILEHEADER'='loOrderkey,loLinenumber,loCustkey,loPartkey,loSuppkey,loOrderdate,loOrderpriority,loShippriority,loQuantity,loExtendedprice,loOrdtotalprice,loDiscount,loRevenue,loSupplycost,loTax,loCommitdate,loShipmode,dummy')")*
>
>
> *Results with different version - *
>
> *   1.1.0 - *Provides correct results for above query. Validated with
> results from parquet.
>
> *   1.1.1 - *Built from this
> <https://github.com/apache/carbondata/tree/apache-carbondata-1.1.1-rc1>.
> Join is missing lots of rows compared to parquet.
>
> *   1.1.1 - *Built from source code available for download
> <https://dist.apache.org/repos/dist/release/carbondata/1.1.1/apache-carbondata-1.1.1-source-release.zip>.
> Join is missing lots of rows compared to parquet.
>
> *      1.2 - *Built from master branch. Generated correct results similar
> to parquet.
>
>
> *Debugging further - *
>
> 1. Row counts for both lineOrder and customer tables are same.
>
> 2. If I try to find out key column in carbondata vs parquet then it is
> matching as well -
>
>          val cd = carbon.sql("select cCustKey from customer")
> //.distinct.count -- 30,000,000
>
>          val sp = spark.sql("select cCustKey from pcustomer")
> //.distinct.count -- 30,000,000
>
>          cd.intersect(sp) -- 30,000,000 (carbon data has same keys
> compared to parquet)
>
>
>
>          val cd = carbon.sql("select loCustKey from lineorder")
> //.distinct.count -- 13,365,986
>
>          val sp = spark.sql("select loCustKey from plineorder")
> //.distinct.count -- 13,365,986
>
>          cd.intersect(sp) --13,365,986 (carbon data has same keys
> compared to parquet)
>
>
> Above query shows that carbondata customer and lineitem has same key
> values compared to parquet.
>
> However, when you run above join query, carbondata generates very small
> subset of expected rows. If we run filter query for any specific key then
> that also returns no results.
>
>
> Not sure why v1.1.1 is producing incorrect results. My guess is that
> carbondata is skipping rows that it shouldn't in v1.1.1.
>
> Any help and suggestions are very much appreciated!! Thanks in advance..
>
>
>
> Thanks
>
> Swapnil Shinde
>
>
>
>
>
>
>
>
>
>
>


-- 
Thanks & Regards,
Ravi

Mime
View raw message