On Mon, May 3, 2010 at 12:28 AM, Sachin Bochare <sachin_bochare@persistent.co.in> wrote:

Thanks Edwards.

I was exploring indexing patch and wanted to know how the index table looks like.

Few points which were not mentioned in my earlier posting:

1. I haven't created any index on the test table. So index logic is not involved here. The query is not working on normal table.
2. I checked the jobtracker result in Hadoop web interface and found that Hadoop has read 277 bytes and written 33 bytes. The counters are HDFS_BYTES_READ=277 and HDFS_BYTES_WRITTEN=23. The input file size is 277 bytes and final result is 23 bytes. So Hadoop had returned the correct output. However for some reason Hive hasn't received or returned those results.

I guess it would be a minor code change. I want to identify the code and fix it temporary in my send box. Could someone please point to the module where I should look for this issue?

Regards,
Sachin

 


From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Sunday, May 02, 2010 8:08 PM
To: hive-user@hadoop.apache.org
Subject: Re: Select not working with Index patch

 

 

On Sun, May 2, 2010 at 2:29 AM, Sachin Bochare <sachin_bochare@persistent.co.in> wrote:

Hi,

 

I applied index patch available at : https://issues.apache.org/jira/browse/HIVE-678

 

However after applying the indexing patch, simple select statements are not showing any results. The "select *" is working but selecting a specific column is not working. I have pasted an example below which illustrates the problem.

 

The same select is working without the patch on the same metastore_db. The only difference between working code and non-working code is the patch.

 

I used 796926 version of the code. The patch attached in HIVE-678 was created on this version.

 

Following example illustrates the problem:

 

Example with patch code:

-----------------------------

 

=====================================

hive> create table ourtest (empid int, firstname string, lastname string, hoursworked int) partitioned by(dt string, place string) clustered by (empid) sorted by(hoursworked) into 4 buckets row format delimited fields terminated by ',' stored as textfile;
OK
Time taken: 0.307 seconds
hive> LOAD DATA LOCAL INPATH '/root/data/ourtest_data.csv' INTO TABLE ourtest PARTITION(dt='2010-02-27', place='Pune');
Copying data from file:/root/data/ourtest_data.csv
Loading data to table ourtest partition {dt=2010-02-27, place=Pune}
OK
Time taken: 0.753 seconds
hive> select * from ourtest; ---> Select * is working fine.
OK
0       firstname       lastname        0       2010-02-27      Pune
1       firstname1      lastname1       1       2010-02-27      Pune
2       firstname2      lastname2       2       2010-02-27      Pune
3       firstname3      lastname3       3       2010-02-27      Pune
4       firstname4      lastname4       4       2010-02-27      Pune
5       firstname5      lastname5       5       2010-02-27      Pune
6       firstname6      lastname6       6       2010-02-27      Pune
7       firstname7      lastname7       7       2010-02-27      Pune
8       firstname8      lastname8       8       2010-02-27      Pune
9       firstname9      lastname9       9       2010-02-27      Pune
10      firstname10     lastname10      10      2010-02-27      Pune
Time taken: 0.106 seconds
hive> select empid from ourtest; ---> Selecting specific column is not working.
Total MapReduce jobs = 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201002091652_0170, Tracking URL = http://v-hadoop3.persistent.co.in:60030/jobdetails.jsp?jobid=job_201002091652_0170
Kill Command = /root/hadoop-0.20.1/bin/../bin/hadoop job  -Dmapred.job.tracker=v-hadoop3.persistent.co.in:30001 -kill job_201002091652_0170
2010-05-02 08:40:48,951 map = 0%,  reduce =0%
2010-05-02 08:40:58,044 map = 50%,  reduce =0%
2010-05-02 08:40:59,057 map = 100%,  reduce =0%
2010-05-02 08:41:02,067 map = 100%,  reduce =100%
Ended Job = job_201002091652_0170
OK
Time taken: 15.494 seconds
=====================================

 

Example without patch code:

--------------------------------

Example query is working after using without-patch code on the same metastore_db.

 

=====================================

root@v-hadoop3:~/sachin/Hive-796926-Patch# ../Hive-796926/build/dist/bin/hive
Hive history file=/tmp/root/hive_job_log_root_201005020928_924651644.txt
hive> select empid from ourtest;
Total MapReduce jobs = 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201002091652_0190, Tracking URL = http://v-hadoop3.persistent.co.in:60030/jobdetails.jsp?jobid=job_201002091652_0190
Kill Command = /root/hadoop-0.20.1/bin/../bin/hadoop job  -Dmapred.job.tracker=v-hadoop3.persistent.co.in:30001 -kill job_201002091652_0190
2010-05-02 09:29:04,733 map = 0%,  reduce =0%
2010-05-02 09:29:18,799 map = 100%,  reduce =0%
2010-05-02 09:29:21,823 map = 100%,  reduce =100%
Ended Job = job_201002091652_0190
OK
0
1
2
3
4
5
6
7
8
9
10
Time taken: 22.268 seconds

=====================================

 

Can anyone point to what can be the problem here? Which module is a suspect here?

 

Regards,

Sachin

DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.


The comments for that issue seem to suggest the patch is not complete yet. For reference 'select *' queries simply read that block data from hdfs so they do not use map-reduce (and thus probably do not use any indexes either.

Edward

DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.


First, let me say that code is experiementatal it likely never worked completely at any point.

Hive trunk is fairly active, many things could change. I believe to find what you are looking for you should do it in the opposite manner.

It is the patch that is breaking hive. Hive did not break around your patch. Thus,

I suggest you go though the patch and only apply it incrementatly. First, do NOT modify anything only ADD new class files. Now run some testing maybe sure your query works. Run some Hive unit tests.

Then apply the sections of the patch that alter existing code method by method. Keep testing as you go, at some point your use case will break.

As I said up top, the patch was totally experimental and probably never worked completely. It needs some large refactoring before it comes close to considering. So if you want to hack at it have fun.