hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <e...@lifeless.net>
Subject Null values in hive output
Date Tue, 05 Jan 2010 00:10:39 GMT
All:

I apologize in advance if this is common. I've searched and I can't find
an explanation.

I'm loading a plain text tab delimited file into a Hive (0.4.1-dev)
table. This file is a small sample set of my full dataset and is the
result of a M/R job, written by TextOutputFormat, if it matters. When I
query the table, a small percentage (a few hundred out of a few million)
of the rows contain null values where as the input file does not contain
any null values. The number of null field records seems to grow
proportionally to the total number of records at a relatively constant
rate. It looks as if it's a SerDe error / misconfiguration of some kind,
but I can't pinpoint anything that would cause the issue.

To confirm, I've done an fs -cat of the file to local disk and used cut
and sort to confirm all fields are properly formatted and populated.
Below is the extended table description along with some additional
information. Any help is greatly appreciated as using Hive for simple
aggregation is saving me a ton of time from having to hand write the M/R
jobs myself.

I'm sure there's something I've done wrong. Unfortunately, I'm in a
situation where I can't deal with any portion of the records being
dumped (part of a reporting system).

Original create:

hive> create table raw_facts ( year int, month int, day int, application
string, company_id int, country_code string, receiver_code_id int,
keyword string, total int ) row format delimited fields terminated by '\t';

(I've also tried row format TEXTFORMAT or whatever it is; all fields
were null - assumed it was because hive was expecting ^A delimited.)

Table:

hive> describe extended raw_facts;
OK
year    int
month   int
day     int
application     string
company_id      int
country_code    string
receiver_code_id        int
keyword string
total   int

Detailed Table Information      Table(tableName:raw_facts,
dbName:default, owner:<snip>, createTime:1262631537, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:year, type:int,
comment:null), FieldSchema(name:month, type:int, comment:null),
FieldSchema(name:day, type:int, comment:null),
FieldSchema(name:application, type:string, comment:null),
FieldSchema(name:company_id, type:int, comment:null),
FieldSchema(name:country_code, type:string, comment:null),
FieldSchema(name:receiver_code_id, type:int, comment:null),
FieldSchema(name:keyword, type:string, comment:null),
FieldSchema(name:total, type:int, comment:null)],
location:hdfs://<snip>/home/hive/warehouse/raw_facts,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=9,field.delim=      }), bucketCols:[],
sortCols:[], parameters:{}), partitionKeys:[], parameters:{})

Sample (real) rows: (these are tab separated in the file)

2009    12      01      f       98      US      171     test    222
2009    12      01      f       98      US      199     test    222
2009    12      01      f       98      US      220     test    222

Load command used:

hive> load data inpath 'hdfs://<snip>/some/path/out/part-r-00000'
overwrite into table raw_facts ;

Some queries:

hive> select count(1) from raw_facts;
OK
4723253

hive> select count(1) from raw_facts where year is null;
OK
277

hive> select year,count(1) from raw_facts group by year;
OK
NULL    277
2009    4722976

Thanks in advance.
-- 
Eric Sammer
eric@lifless.net
http://esammer.blogspot.com

Mime
View raw message