Return-Path: Delivered-To: apmail-hive-user-archive@www.apache.org Received: (qmail 87254 invoked from network); 19 Nov 2010 14:14:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Nov 2010 14:14:15 -0000 Received: (qmail 32592 invoked by uid 500); 19 Nov 2010 14:14:47 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 32402 invoked by uid 500); 19 Nov 2010 14:14:46 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 32392 invoked by uid 99); 19 Nov 2010 14:14:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Nov 2010 14:14:46 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dbrondsema@geek.net designates 74.125.149.75 as permitted sender) Received: from [74.125.149.75] (HELO na3sys009aog105.obsmtp.com) (74.125.149.75) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Nov 2010 14:14:38 +0000 Received: from source ([209.85.161.52]) by na3sys009aob105.postini.com ([74.125.148.12]) with SMTP ID DSNKTOaGN+giIpAHfrd229ufGJRu37yvr42O@postini.com; Fri, 19 Nov 2010 06:14:17 PST Received: by fxm15 with SMTP id 15so2678755fxm.11 for ; Fri, 19 Nov 2010 06:14:14 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.125.132 with SMTP id y4mr174214far.148.1290176054584; Fri, 19 Nov 2010 06:14:14 -0800 (PST) Received: by 10.223.115.84 with HTTP; Fri, 19 Nov 2010 06:14:14 -0800 (PST) In-Reply-To: References: <727149DB-5FAA-47F7-A74C-2DCE0E03CFB1@fb.com> <1DFCAC84-F860-4247-A6AE-6EF24DD75680@fb.com> Date: Fri, 19 Nov 2010 09:14:14 -0500 Message-ID: Subject: Re: Hive produces very small files despite hive.merge...=true settings From: Dave Brondsema To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001636c5ab796e63e4049568853e X-Virus-Checked: Checked by ClamAV on apache.org --001636c5ab796e63e4049568853e Content-Type: text/plain; charset=ISO-8859-1 What version of Hadoop are you on? On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev wrote: > I thought I was running Hive with those changes merged in, but to make > sure, I built the latest trunk version. The behavior changed somewhat > (as in, it runs 2 stages instead of 1), but it still generates the > same number of files (# of files generated is equal to the number of > the original mappers, so I have no idea what the second stage is > actually doing). > > See below for query / explain query. Stage 1 runs always; Stage 3 > runs if hive.merge.mapfiles=true is set, but it still generates lots > of small files. > > The query is kind of large, but in essence it's simply > insert overwrite table foo partition(bar) select [columns] from > [table] tablesample(bucket 1 out of 10000 on rand()) where > [conditions]. > > > explain insert overwrite table hbase_prefilter3_us_sample partition > (ds) select > server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip, > 'COUNTRY_CODE', './GeoIP.dat'),'',ds from alogs_master > TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) am_s where > am_s.ds='2010-11-05' and am_s.request_url rlike > '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and > geoip_int(am_s.client_ip, 'COUNTRY_CODE', './GeoIP.dat')='US'; > OK > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1 > 10000 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION > (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds)))) > (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR > (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL > time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL > server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL > request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR > (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url > (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL > user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR > (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE' > './GeoIP.dat')) (TOK_SELEXPR '') (TOK_SELEXPR (TOK_TABLE_OR_COL ds))) > (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL am_s) ds) '2010-11-05') > (rlike (. (TOK_TABLE_OR_COL am_s) request_url) > '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$')) (= (TOK_FUNCTION > geoip_int (. (TOK_TABLE_OR_COL am_s) client_ip) 'COUNTRY_CODE' > './GeoIP.dat') 'US'))))) > > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-5 depends on stages: Stage-1 , consists of Stage-4, Stage-3 > Stage-4 > Stage-0 depends on stages: Stage-4, Stage-3 > Stage-2 depends on stages: Stage-0 > Stage-3 > > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > am_s > TableScan > alias: am_s > Filter Operator > predicate: > expr: (((hash(rand()) & 2147483647) % 10000) = 0) > type: boolean > Filter Operator > predicate: > expr: ((request_url rlike > '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$') and > (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US')) > type: boolean > Filter Operator > predicate: > expr: (((ds = '2010-11-05') and (request_url > rlike '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$')) and > (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US')) > type: boolean > Select Operator > expressions: > expr: server_host > type: string > expr: client_ip > type: int > expr: time_stamp > type: int > expr: concat(server_host, ':', > regexp_extract(request_url, '/[^/]+/[^/]+/([^/]+)$', 1)) > type: string > expr: referrer > type: string > expr: parse_url(referrer, 'HOST') > type: string > expr: user_agent > type: string > expr: cookie > type: string > expr: GenericUDFGeoIP ( client_ip, > 'COUNTRY_CODE', './GeoIP.dat' ) > type: string > expr: '' > type: string > expr: ds > type: string > outputColumnNames: _col0, _col1, _col2, _col3, > _col4, _col5, _col6, _col7, _col8, _col9, _col10 > File Output Operator > compressed: true > GlobalTableId: 1 > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > name: hbase_prefilter3_us_sample > > Stage: Stage-5 > Conditional Operator > > Stage: Stage-4 > Move Operator > files: > hdfs directory: true > destination: > hdfs:// > namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726655151866456030/-ext-10000 > > Stage: Stage-0 > Move Operator > tables: > partition: > ds > replace: true > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > name: hbase_prefilter3_us_sample > > Stage: Stage-2 > Stats-Aggr Operator > > Stage: Stage-3 > Map Reduce > Alias -> Map Operator Tree: > hdfs:// > namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726655151866456030/-ext-10002 > File Output Operator > compressed: true > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > name: hbase_prefilter3_us_sample > > > > > On Thu, Nov 18, 2010 at 3:44 PM, Ning Zhang wrote: > > I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 need > to be there for merging to take place. HIVE-1307 was committed to trunk on > 08/25 and HIVE-1622 was committed on 09/13. The simplest way is to update > your Hive trunk and rerun the query. If it still doesn't work maybe you can > post your query and the result of 'explain ' and we can take a look. > > > > Ning > > > > On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote: > > > >> Hi Ning, > >> For the dataset I'm experimenting with, the total size of the output > >> is 2mb, and the files are at most a few kb in size. My > >> hive.input.format was set to default HiveInputFormat; however, when I > >> set it to CombineHiveInputFormat, it only made the first stage of the > >> job use fewer mappers. The merge job was *still* filtered out at > >> runtime. I also tried set hive.mergejob.maponly=false; that didn't > >> have any effect. > >> > >> I am a bit at a loss what to do here. Is there a way to see what's > >> going on exactly using e.g. debug log levels?.. Btw, I'm also using > >> dynamic partitions; could that somehow be interfering with the merge > >> job?.. > >> > >> I'm running a relatively fresh Hive from trunk (built maybe a month > ago). > >> > >> --Leo > >> > >> On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang wrote: > >>> The settings looks good. The parameter > hive.merge.size.smallfiles.avgsize is used to determine at run time if a > merge should be triggered: if the average size of the files in the partition > is SMALLER than the parameter and there are more than 1 file, the merge > should be scheduled. Can you try to see if you have any big files as well in > your resulting partition? If it is because of a very large file, you can set > the parameter large enough. > >>> > >>> Another possibility is that your Hadoop installation does not support > CombineHiveInputFormat, which is used for the new merge job. Someone > reported previously merge was not successful because of this. If that's the > case, you can turn off CombineHiveInputFormat and use the old > HiveInputFormat (though slower) by setting hive.mergejob.maponly=false. > >>> > >>> Ning > >>> On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote: > >>> > >>>> I have jobs that sample (or generate) a small amount of data from a > >>>> large table. At the end, I get e.g. about 3000 or more files of 1kb > >>>> or so. This becomes a nuisance. How can I make Hive do another pass > >>>> to merge the output? I have the following settings: > >>>> > >>>> hive.merge.mapfiles=true > >>>> hive.merge.mapredfiles=true > >>>> hive.merge.size.per.task=256000000 > >>>> hive.merge.size.smallfiles.avgsize=16000000 > >>>> > >>>> After setting hive.merge* to true, Hive started indicating "Total > >>>> MapReduce jobs = 2". However, after generating the > >>>> lots-of-small-files table, Hive says: > >>>> Ended Job = job_201011021934_1344 > >>>> Ended Job = 781771542, job is filtered out (removed at runtime). > >>>> > >>>> Is there a way to force the merge, or am I missing something? > >>>> --Leo > >>> > >>> > > > > > -- Dave Brondsema Software Engineer Geeknet www.geek.net --001636c5ab796e63e4049568853e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable What version of Hadoop are you on?

On Thu= , Nov 18, 2010 at 10:48 PM, Leo Alekseyev <dnquark@gmail.com> wrote:
I thought I was running Hive with those changes merged in, but to make
sure, I built the latest trunk version. =A0The behavior changed somewhat (as in, it runs 2 stages instead of 1), but it still generates the
same number of files (# of files generated is equal to the number of
the original mappers, so I have no idea what the second stage is
actually doing).

See below for query / explain query. =A0Stage 1 runs always; Stage 3
runs if hive.merge.mapfiles=3Dtrue is set, but it still generates lots
of small files.

The query is kind of large, but in essence it's simply
insert overwrite table foo partition(bar) select [columns] from
[table] tablesample(bucket 1 out of 10000 on rand()) where
[conditions].


explain insert overwrite table hbase_prefilter3_us_sample partition
(ds) select server_host,client_ip,time_stamp,concat(server_host,':'= ,regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,pa= rse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip,
'COUNTRY_CODE', =A0'./GeoIP.dat'),'',ds from alogs_= master
TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) am_s where
am_s.ds=3D'2010-11-05' and am_s.request_url rlike
'^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and
geoip_int(am_s.client_ip, 'COUNTRY_CODE', =A0'./GeoIP.dat')= =3D'US';
OK
ABSTRACT SYNTAX TREE:
=A0(TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1
10000 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION
(TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds))))
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR
(TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL
server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL
request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR
(TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url
(TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL=
user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR
(TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE'=
'./GeoIP.dat')) (TOK_SELEXPR '') (TOK_SELEXPR (TOK_TABLE_OR= _COL ds)))
(TOK_WHERE (and (and (=3D (. (TOK_TABLE_OR_COL am_s) ds) '2010-11-05= 9;)
(rlike (. (TOK_TABLE_OR_COL am_s) request_url)
'^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$')) (=3D (TOK_FUNC= TION
geoip_int (. (TOK_TABLE_OR_COL am_s) client_ip) 'COUNTRY_CODE'
'./GeoIP.dat') 'US')))))

STAGE DEPENDENCIES:
=A0Stage-1 is a root stage
=A0Stage-5 depends on stages: Stage-1 , consists of Stage-4, Stage-3
=A0Stage-4
=A0Stage-0 depends on stages: Stage-4, Stage-3
=A0Stage-2 depends on stages: Stage-0
=A0Stage-3

STAGE PLANS:
=A0Stage: Stage-1
=A0 =A0Map Reduce
=A0 =A0 =A0Alias -> Map Operator Tree:
=A0 =A0 =A0 =A0am_s
=A0 =A0 =A0 =A0 =A0TableScan
=A0 =A0 =A0 =A0 =A0 =A0alias: am_s
=A0 =A0 =A0 =A0 =A0 =A0Filter Operator
=A0 =A0 =A0 =A0 =A0 =A0 =A0predicate:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: (((hash(rand()) & 2147483647)= % 10000) =3D 0)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: boolean
=A0 =A0 =A0 =A0 =A0 =A0 =A0Filter Operator
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0predicate:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: ((request_url rlike
'^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$') and
(GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat'= ) =3D 'US'))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: boolean
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Filter Operator
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0predicate:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: (((ds =3D '2010-11-05= ') and (request_url
rlike '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$')) and
(GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat'= ) =3D 'US'))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: boolean
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Select Operator
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expressions:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: server_host
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: client_ip
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: int
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: time_stamp
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: int
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: concat(server_hos= t, ':',
regexp_extract(request_url, '/[^/]+/[^/]+/([^/]+)$', 1))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: referrer
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: parse_url(referre= r, 'HOST')
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: user_agent
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: cookie
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: GenericUDFGeoIP (= client_ip,
'COUNTRY_CODE', './GeoIP.dat' )
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: ''
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: ds
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0outputColumnNames: _col0, _col1, _c= ol2, _col3,
_col4, _col5, _col6, _col7, _col8, _col9, _col10
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0File Output Operator
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0compressed: true
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0GlobalTableId: 1
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0table:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0input format: org.apach= e.hadoop.mapred.TextInputFormat
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0name: hbase_prefilter3_= us_sample

=A0Stage: Stage-5
=A0 =A0Conditional Operator

=A0Stage: Stage-4
=A0 =A0Move Operator
=A0 =A0 =A0files:
=A0 =A0 =A0 =A0 =A0hdfs directory: true
=A0 =A0 =A0 =A0 =A0destination:
hdfs://na= menode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726= 655151866456030/-ext-10000

=A0Stage: Stage-0
=A0 =A0Move Operator
=A0 =A0 =A0tables:
=A0 =A0 =A0 =A0 =A0partition:
=A0 =A0 =A0 =A0 =A0 =A0ds
=A0 =A0 =A0 =A0 =A0replace: true
=A0 =A0 =A0 =A0 =A0table:
=A0 =A0 =A0 =A0 =A0 =A0 =A0input format: org.apache.hadoop.mapred.TextInpu= tFormat
=A0 =A0 =A0 =A0 =A0 =A0 =A0output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
=A0 =A0 =A0 =A0 =A0 =A0 =A0serde: org.apache.hadoop.hive.serde2.lazy.LazyS= impleSerDe
=A0 =A0 =A0 =A0 =A0 =A0 =A0name: hbase_prefilter3_us_sample

=A0Stage: Stage-2
=A0 =A0Stats-Aggr Operator

=A0Stage: Stage-3
=A0 =A0Map Reduce
=A0 =A0 =A0Alias -> Map Operator Tree:
=A0 =A0 =A0 =A0hdfs://namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_1= 7-58-36_843_6726655151866456030/-ext-10002
=A0 =A0 =A0 =A0 =A0 =A0File Output Operator
=A0 =A0 =A0 =A0 =A0 =A0 =A0compressed: true
=A0 =A0 =A0 =A0 =A0 =A0 =A0GlobalTableId: 0
=A0 =A0 =A0 =A0 =A0 =A0 =A0table:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0input format: org.apache.hadoop.mapred.= TextInputFormat
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0serde: org.apache.hadoop.hive.serde2.la= zy.LazySimpleSerDe
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0name: hbase_prefilter3_us_sample




On Thu, Nov 18, 2010 at 3:44 PM, Ning Zhang <nzhang@fb.com> wrote:
> I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 ne= ed to be there for merging to take place. HIVE-1307 was committed to trunk = on 08/25 and HIVE-1622 was committed on 09/13. The simplest way is to updat= e your Hive trunk and rerun the query. If it still doesn't work maybe y= ou can post your query and the result of 'explain <query>' an= d we can take a look.
>
> Ning
>
> On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote:
>
>> Hi Ning,
>> For the dataset I'm experimenting with, the total size of the = output
>> is 2mb, and the files are at most a few kb in size. =A0My
>> hive.input.format was set to default HiveInputFormat; however, whe= n I
>> set it to CombineHiveInputFormat, it only made the first stage of = the
>> job use fewer mappers. =A0The merge job was *still* filtered out a= t
>> runtime. =A0I also tried set hive.mergejob.maponly=3Dfalse; that d= idn't
>> have any effect.
>>
>> I am a bit at a loss what to do here. =A0Is there a way to see wha= t's
>> going on exactly using e.g. debug log levels?.. =A0Btw, I'm al= so using
>> dynamic partitions; could that somehow be interfering with the mer= ge
>> job?..
>>
>> I'm running a relatively fresh Hive from trunk (built maybe a = month ago).
>>
>> --Leo
>>
>> On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang <nzhang@fb.com> wrote:
>>> The settings looks good. The parameter hive.merge.size.smallfi= les.avgsize is used to determine at run time if a merge should be triggered= : if the average size of the files in the partition is SMALLER than the par= ameter and there are more than 1 file, the merge should be scheduled. Can y= ou try to see if you have any big files as well in your resulting partition= ? If it is because of a very large file, you can set the parameter large en= ough.
>>>
>>> Another possibility is that your Hadoop installation does not = support CombineHiveInputFormat, which is used for the new merge job. Someon= e reported previously merge was not successful because of this. If that'= ;s the case, you can turn off CombineHiveInputFormat and use the old HiveIn= putFormat (though slower) by setting hive.mergejob.maponly=3Dfalse.
>>>
>>> Ning
>>> On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:
>>>
>>>> I have jobs that sample (or generate) a small amount of da= ta from a
>>>> large table. =A0At the end, I get e.g. about 3000 or more = files of 1kb
>>>> or so. =A0This becomes a nuisance. =A0How can I make Hive = do another pass
>>>> to merge the output? =A0I have the following settings:
>>>>
>>>> hive.merge.mapfiles=3Dtrue
>>>> hive.merge.mapredfiles=3Dtrue
>>>> hive.merge.size.per.task=3D256000000
>>>> hive.merge.size.smallfiles.avgsize=3D16000000
>>>>
>>>> After setting hive.merge* to true, Hive started indicating= "Total
>>>> MapReduce jobs =3D 2". =A0However, after generating t= he
>>>> lots-of-small-files table, Hive says:
>>>> Ended Job =3D job_201011021934_1344
>>>> Ended Job =3D 781771542, job is filtered out (removed at r= untime).
>>>>
>>>> Is there a way to force the merge, or am I missing somethi= ng?
>>>> --Leo
>>>
>>>
>
>



--
Dave Bronds= ema
Software Engineer
Geeknet

= www.geek.net
--001636c5ab796e63e4049568853e--