hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject ORC Files: Does this get me anything?
Date Wed, 16 Oct 2013 22:42:03 GMT
So I am experimenting with ORC files, and I have a fast little table that
has login events.  Out of curiosity, I was wondering if based on what we
all knew about ORC files, if did the below, would the per file indexing get
me anything? Now, before people complain about small files, let's toss that
aside for now.

I have
set mapred.reduce.tasks=26;
insert into table ogintable
select * from main_table where loginid != ''
distribute by (abs(hash(substring(loginid, 0, 1))) % 26)
sort by loginid


Basically, I am thinking that if I distribute by what I put, each letter
will get it's own file, and thus, acts as a mini index? Am I over thinking
this? I know if I do just the sort by I get 3 to 4 files, with this method,
I get more files, and since loginid is extremely common where clause
member, I was thinking this may be a good thing? Maybe I am wrong, figured
I'd send it out to the group to get made fun of/ridiculed in public :)

Mime
View raw message