Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of dbrondsema@geek.net designates
 74.125.149.75 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <AANLkTi=vAi0Eev00DFyPDF7MZJT1YrZw6jikGzc7ZHDm@mail.gmail.com>
References: <AANLkTin25F5eJ-pyS1oNP3otJZ3vbLPQTY6QuM1HtadA@mail.gmail.com>
	<727149DB-5FAA-47F7-A74C-2DCE0E03CFB1@fb.com>
	<AANLkTikmoKHUP2BtJUvHe_p+vGpk_P-xPyQtb+kUrwFQ@mail.gmail.com>
	<1DFCAC84-F860-4247-A6AE-6EF24DD75680@fb.com>
	<AANLkTi=vAi0Eev00DFyPDF7MZJT1YrZw6jikGzc7ZHDm@mail.gmail.com>
Date: Fri, 19 Nov 2010 09:14:14 -0500
Message-ID: <AANLkTikCQH9codTVgrme_AeQrV4Ro7m=acgV0WR1YH+m@mail.gmail.com>
Subject: Re: Hive produces very small files despite hive.merge...=true
 settings
From: Dave Brondsema <dbrondsema@geek.net>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=001636c5ab796e63e4049568853e

--001636c5ab796e63e4049568853e
Content-Type: text/plain; charset=ISO-8859-1

What version of Hadoop are you on?

On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev <dnquark@gmail.com> wrote:

> I thought I was running Hive with those changes merged in, but to make
> sure, I built the latest trunk version.  The behavior changed somewhat
> (as in, it runs 2 stages instead of 1), but it still generates the
> same number of files (# of files generated is equal to the number of
> the original mappers, so I have no idea what the second stage is
> actually doing).
>
> See below for query / explain query.  Stage 1 runs always; Stage 3
> runs if hive.merge.mapfiles=true is set, but it still generates lots
> of small files.
>
> The query is kind of large, but in essence it's simply
> insert overwrite table foo partition(bar) select [columns] from
> [table] tablesample(bucket 1 out of 10000 on rand()) where
> [conditions].
>
>
> explain insert overwrite table hbase_prefilter3_us_sample partition
> (ds) select
> server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip,
> 'COUNTRY_CODE',  './GeoIP.dat'),'',ds from alogs_master
> TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) am_s where
> am_s.ds='2010-11-05' and am_s.request_url rlike
> '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and
> geoip_int(am_s.client_ip, 'COUNTRY_CODE',  './GeoIP.dat')='US';
> OK
> ABSTRACT SYNTAX TREE:
>  (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1
> 10000 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION
> (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds))))
> (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR
> (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL
> time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL
> server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL
> request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR
> (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url
> (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL
> user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR
> (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE'
> './GeoIP.dat')) (TOK_SELEXPR '') (TOK_SELEXPR (TOK_TABLE_OR_COL ds)))
> (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL am_s) ds) '2010-11-05')
> (rlike (. (TOK_TABLE_OR_COL am_s) request_url)
> '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$')) (= (TOK_FUNCTION
> geoip_int (. (TOK_TABLE_OR_COL am_s) client_ip) 'COUNTRY_CODE'
> './GeoIP.dat') 'US')))))
>
> STAGE DEPENDENCIES:
>  Stage-1 is a root stage
>  Stage-5 depends on stages: Stage-1 , consists of Stage-4, Stage-3
>  Stage-4
>  Stage-0 depends on stages: Stage-4, Stage-3
>  Stage-2 depends on stages: Stage-0
>  Stage-3
>
> STAGE PLANS:
>  Stage: Stage-1
>    Map Reduce
>      Alias -> Map Operator Tree:
>        am_s
>          TableScan
>            alias: am_s
>            Filter Operator
>              predicate:
>                  expr: (((hash(rand()) & 2147483647) % 10000) = 0)
>                  type: boolean
>              Filter Operator
>                predicate:
>                    expr: ((request_url rlike
> '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$') and
> (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US'))
>                    type: boolean
>                Filter Operator
>                  predicate:
>                      expr: (((ds = '2010-11-05') and (request_url
> rlike '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$')) and
> (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US'))
>                      type: boolean
>                  Select Operator
>                    expressions:
>                          expr: server_host
>                          type: string
>                          expr: client_ip
>                          type: int
>                          expr: time_stamp
>                          type: int
>                          expr: concat(server_host, ':',
> regexp_extract(request_url, '/[^/]+/[^/]+/([^/]+)$', 1))
>                          type: string
>                          expr: referrer
>                          type: string
>                          expr: parse_url(referrer, 'HOST')
>                          type: string
>                          expr: user_agent
>                          type: string
>                          expr: cookie
>                          type: string
>                          expr: GenericUDFGeoIP ( client_ip,
> 'COUNTRY_CODE', './GeoIP.dat' )
>                          type: string
>                          expr: ''
>                          type: string
>                          expr: ds
>                          type: string
>                    outputColumnNames: _col0, _col1, _col2, _col3,
> _col4, _col5, _col6, _col7, _col8, _col9, _col10
>                    File Output Operator
>                      compressed: true
>                      GlobalTableId: 1
>                      table:
>                          input format:
> org.apache.hadoop.mapred.TextInputFormat
>                          output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                          serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                          name: hbase_prefilter3_us_sample
>
>  Stage: Stage-5
>    Conditional Operator
>
>  Stage: Stage-4
>    Move Operator
>      files:
>          hdfs directory: true
>          destination:
> hdfs://
> namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726655151866456030/-ext-10000
>
>  Stage: Stage-0
>    Move Operator
>      tables:
>          partition:
>            ds
>          replace: true
>          table:
>              input format: org.apache.hadoop.mapred.TextInputFormat
>              output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>              name: hbase_prefilter3_us_sample
>
>  Stage: Stage-2
>    Stats-Aggr Operator
>
>  Stage: Stage-3
>    Map Reduce
>      Alias -> Map Operator Tree:
>        hdfs://
> namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726655151866456030/-ext-10002
>            File Output Operator
>              compressed: true
>              GlobalTableId: 0
>              table:
>                  input format: org.apache.hadoop.mapred.TextInputFormat
>                  output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                  name: hbase_prefilter3_us_sample
>
>
>
>
> On Thu, Nov 18, 2010 at 3:44 PM, Ning Zhang <nzhang@fb.com> wrote:
> > I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 need
> to be there for merging to take place. HIVE-1307 was committed to trunk on
> 08/25 and HIVE-1622 was committed on 09/13. The simplest way is to update
> your Hive trunk and rerun the query. If it still doesn't work maybe you can
> post your query and the result of 'explain <query>' and we can take a look.
> >
> > Ning
> >
> > On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote:
> >
> >> Hi Ning,
> >> For the dataset I'm experimenting with, the total size of the output
> >> is 2mb, and the files are at most a few kb in size.  My
> >> hive.input.format was set to default HiveInputFormat; however, when I
> >> set it to CombineHiveInputFormat, it only made the first stage of the
> >> job use fewer mappers.  The merge job was *still* filtered out at
> >> runtime.  I also tried set hive.mergejob.maponly=false; that didn't
> >> have any effect.
> >>
> >> I am a bit at a loss what to do here.  Is there a way to see what's
> >> going on exactly using e.g. debug log levels?..  Btw, I'm also using
> >> dynamic partitions; could that somehow be interfering with the merge
> >> job?..
> >>
> >> I'm running a relatively fresh Hive from trunk (built maybe a month
> ago).
> >>
> >> --Leo
> >>
> >> On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang <nzhang@fb.com> wrote:
> >>> The settings looks good. The parameter
> hive.merge.size.smallfiles.avgsize is used to determine at run time if a
> merge should be triggered: if the average size of the files in the partition
> is SMALLER than the parameter and there are more than 1 file, the merge
> should be scheduled. Can you try to see if you have any big files as well in
> your resulting partition? If it is because of a very large file, you can set
> the parameter large enough.
> >>>
> >>> Another possibility is that your Hadoop installation does not support
> CombineHiveInputFormat, which is used for the new merge job. Someone
> reported previously merge was not successful because of this. If that's the
> case, you can turn off CombineHiveInputFormat and use the old
> HiveInputFormat (though slower) by setting hive.mergejob.maponly=false.
> >>>
> >>> Ning
> >>> On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:
> >>>
> >>>> I have jobs that sample (or generate) a small amount of data from a
> >>>> large table.  At the end, I get e.g. about 3000 or more files of 1kb
> >>>> or so.  This becomes a nuisance.  How can I make Hive do another pass
> >>>> to merge the output?  I have the following settings:
> >>>>
> >>>> hive.merge.mapfiles=true
> >>>> hive.merge.mapredfiles=true
> >>>> hive.merge.size.per.task=256000000
> >>>> hive.merge.size.smallfiles.avgsize=16000000
> >>>>
> >>>> After setting hive.merge* to true, Hive started indicating "Total
> >>>> MapReduce jobs = 2".  However, after generating the
> >>>> lots-of-small-files table, Hive says:
> >>>> Ended Job = job_201011021934_1344
> >>>> Ended Job = 781771542, job is filtered out (removed at runtime).
> >>>>
> >>>> Is there a way to force the merge, or am I missing something?
> >>>> --Leo
> >>>
> >>>
> >
> >
>


-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net

--001636c5ab796e63e4049568853e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

What version of Hadoop are you on?<br><br><div class=3D"gmail_quote">On Thu=
, Nov 18, 2010 at 10:48 PM, Leo Alekseyev <span dir=3D"ltr">&lt;<a href=3D"=
mailto:dnquark@gmail.com">dnquark@gmail.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex;">
I thought I was running Hive with those changes merged in, but to make<br>
sure, I built the latest trunk version. =A0The behavior changed somewhat<br=
>
(as in, it runs 2 stages instead of 1), but it still generates the<br>
same number of files (# of files generated is equal to the number of<br>
the original mappers, so I have no idea what the second stage is<br>
actually doing).<br>
<br>
See below for query / explain query. =A0Stage 1 runs always; Stage 3<br>
runs if hive.merge.mapfiles=3Dtrue is set, but it still generates lots<br>
of small files.<br>
<br>
The query is kind of large, but in essence it&#39;s simply<br>
insert overwrite table foo partition(bar) select [columns] from<br>
[table] tablesample(bucket 1 out of 10000 on rand()) where<br>
[conditions].<br>
<br>
<br>
explain insert overwrite table hbase_prefilter3_us_sample partition<br>
(ds) select server_host,client_ip,time_stamp,concat(server_host,&#39;:&#39;=
,regexp_extract(request_url,&#39;/[^/]+/[^/]+/([^/]+)$&#39;,1)),referrer,pa=
rse_url(referrer,&#39;HOST&#39;),user_agent,cookie,geoip_int(client_ip,<br>

&#39;COUNTRY_CODE&#39;, =A0&#39;./GeoIP.dat&#39;),&#39;&#39;,ds from alogs_=
master<br>
TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) am_s where<br>
am_s.ds=3D&#39;2010-11-05&#39; and am_s.request_url rlike<br>
&#39;^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$&#39; and<br>
geoip_int(am_s.client_ip, &#39;COUNTRY_CODE&#39;, =A0&#39;./GeoIP.dat&#39;)=
=3D&#39;US&#39;;<br>
OK<br>
ABSTRACT SYNTAX TREE:<br>
 =A0(TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1<br>
10000 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION<br>
(TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds))))<br>
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR<br>
(TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL<br>
time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL<br>
server_host) &#39;:&#39; (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL<br>
request_url) &#39;/[^/]+/[^/]+/([^/]+)$&#39; 1))) (TOK_SELEXPR<br>
(TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url<br>
(TOK_TABLE_OR_COL referrer) &#39;HOST&#39;)) (TOK_SELEXPR (TOK_TABLE_OR_COL=
<br>
user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR<br>
(TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) &#39;COUNTRY_CODE&#39;=
<br>
&#39;./GeoIP.dat&#39;)) (TOK_SELEXPR &#39;&#39;) (TOK_SELEXPR (TOK_TABLE_OR=
_COL ds)))<br>
(TOK_WHERE (and (and (=3D (. (TOK_TABLE_OR_COL am_s) ds) &#39;2010-11-05=
9;)<br>
(rlike (. (TOK_TABLE_OR_COL am_s) request_url)<br>
&#39;^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$&#39;)) (=3D (TOK_FUNC=
TION<br>
geoip_int (. (TOK_TABLE_OR_COL am_s) client_ip) &#39;COUNTRY_CODE&#39;<br>
&#39;./GeoIP.dat&#39;) &#39;US&#39;)))))<br>
<br>
STAGE DEPENDENCIES:<br>
 =A0Stage-1 is a root stage<br>
 =A0Stage-5 depends on stages: Stage-1 , consists of Stage-4, Stage-3<br>
 =A0Stage-4<br>
 =A0Stage-0 depends on stages: Stage-4, Stage-3<br>
 =A0Stage-2 depends on stages: Stage-0<br>
 =A0Stage-3<br>
<br>
STAGE PLANS:<br>
 =A0Stage: Stage-1<br>
 =A0 =A0Map Reduce<br>
 =A0 =A0 =A0Alias -&gt; Map Operator Tree:<br>
 =A0 =A0 =A0 =A0am_s<br>
 =A0 =A0 =A0 =A0 =A0TableScan<br>
 =A0 =A0 =A0 =A0 =A0 =A0alias: am_s<br>
 =A0 =A0 =A0 =A0 =A0 =A0Filter Operator<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0predicate:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: (((hash(rand()) &amp; 2147483647)=
 % 10000) =3D 0)<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: boolean<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0Filter Operator<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0predicate:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: ((request_url rlike<br>
&#39;^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$&#39;) and<br>
(GenericUDFGeoIP ( client_ip, &#39;COUNTRY_CODE&#39;, &#39;./GeoIP.dat&#39;=
 ) =3D &#39;US&#39;))<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: boolean<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Filter Operator<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0predicate:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: (((ds =3D &#39;2010-11-05=
&#39;) and (request_url<br>
rlike &#39;^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$&#39;)) and<br>
(GenericUDFGeoIP ( client_ip, &#39;COUNTRY_CODE&#39;, &#39;./GeoIP.dat&#39;=
 ) =3D &#39;US&#39;))<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: boolean<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Select Operator<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expressions:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: server_host<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: client_ip<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: int<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: time_stamp<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: int<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: concat(server_hos=
t, &#39;:&#39;,<br>
regexp_extract(request_url, &#39;/[^/]+/[^/]+/([^/]+)$&#39;, 1))<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: referrer<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: parse_url(referre=
r, &#39;HOST&#39;)<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: user_agent<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: cookie<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: GenericUDFGeoIP (=
 client_ip,<br>
&#39;COUNTRY_CODE&#39;, &#39;./GeoIP.dat&#39; )<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: &#39;&#39;<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr: ds<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: string<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0outputColumnNames: _col0, _col1, _c=
ol2, _col3,<br>
_col4, _col5, _col6, _col7, _col8, _col9, _col10<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0File Output Operator<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0compressed: true<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0GlobalTableId: 1<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0table:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0input format: org.apach=
e.hadoop.mapred.TextInputFormat<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0output format:<br>
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0serde:<br>
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0name: hbase_prefilter3_=
us_sample<br>
<br>
 =A0Stage: Stage-5<br>
 =A0 =A0Conditional Operator<br>
<br>
 =A0Stage: Stage-4<br>
 =A0 =A0Move Operator<br>
 =A0 =A0 =A0files:<br>
 =A0 =A0 =A0 =A0 =A0hdfs directory: true<br>
 =A0 =A0 =A0 =A0 =A0destination:<br>
hdfs://<a href=3D"http://namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2=
010-11-18_17-58-36_843_6726655151866456030/-ext-10000" target=3D"_blank">na=
menode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726=
655151866456030/-ext-10000</a><br>

<br>
 =A0Stage: Stage-0<br>
 =A0 =A0Move Operator<br>
 =A0 =A0 =A0tables:<br>
 =A0 =A0 =A0 =A0 =A0partition:<br>
 =A0 =A0 =A0 =A0 =A0 =A0ds<br>
 =A0 =A0 =A0 =A0 =A0replace: true<br>
 =A0 =A0 =A0 =A0 =A0table:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0input format: org.apache.hadoop.mapred.TextInpu=
tFormat<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0output format:<br>
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0serde: org.apache.hadoop.hive.serde2.lazy.LazyS=
impleSerDe<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0name: hbase_prefilter3_us_sample<br>
<br>
 =A0Stage: Stage-2<br>
 =A0 =A0Stats-Aggr Operator<br>
<br>
 =A0Stage: Stage-3<br>
 =A0 =A0Map Reduce<br>
 =A0 =A0 =A0Alias -&gt; Map Operator Tree:<br>
 =A0 =A0 =A0 =A0hdfs://<a href=3D"http://namenode.imageshack.us:9000/tmp/hi=
ve-hadoop/hive_2010-11-18_17-58-36_843_6726655151866456030/-ext-10002" targ=
et=3D"_blank">namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_1=
7-58-36_843_6726655151866456030/-ext-10002</a><br>

 =A0 =A0 =A0 =A0 =A0 =A0File Output Operator<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0compressed: true<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0GlobalTableId: 0<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0table:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0input format: org.apache.hadoop.mapred.=
TextInputFormat<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0output format:<br>
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0serde: org.apache.hadoop.hive.serde2.la=
zy.LazySimpleSerDe<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0name: hbase_prefilter3_us_sample<br>
<div><div></div><div class=3D"h5"><br>
<br>
<br>
<br>
On Thu, Nov 18, 2010 at 3:44 PM, Ning Zhang &lt;<a href=3D"mailto:nzhang@fb=
.com">nzhang@fb.com</a>&gt; wrote:<br>
&gt; I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 ne=
ed to be there for merging to take place. HIVE-1307 was committed to trunk =
on 08/25 and HIVE-1622 was committed on 09/13. The simplest way is to updat=
e your Hive trunk and rerun the query. If it still doesn&#39;t work maybe y=
ou can post your query and the result of &#39;explain &lt;query&gt;&#39; an=
d we can take a look.<br>

&gt;<br>
&gt; Ning<br>
&gt;<br>
&gt; On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote:<br>
&gt;<br>
&gt;&gt; Hi Ning,<br>
&gt;&gt; For the dataset I&#39;m experimenting with, the total size of the =
output<br>
&gt;&gt; is 2mb, and the files are at most a few kb in size. =A0My<br>
&gt;&gt; hive.input.format was set to default HiveInputFormat; however, whe=
n I<br>
&gt;&gt; set it to CombineHiveInputFormat, it only made the first stage of =
the<br>
&gt;&gt; job use fewer mappers. =A0The merge job was *still* filtered out a=
t<br>
&gt;&gt; runtime. =A0I also tried set hive.mergejob.maponly=3Dfalse; that d=
idn&#39;t<br>
&gt;&gt; have any effect.<br>
&gt;&gt;<br>
&gt;&gt; I am a bit at a loss what to do here. =A0Is there a way to see wha=
t&#39;s<br>
&gt;&gt; going on exactly using e.g. debug log levels?.. =A0Btw, I&#39;m al=
so using<br>
&gt;&gt; dynamic partitions; could that somehow be interfering with the mer=
ge<br>
&gt;&gt; job?..<br>
&gt;&gt;<br>
&gt;&gt; I&#39;m running a relatively fresh Hive from trunk (built maybe a =
month ago).<br>
&gt;&gt;<br>
&gt;&gt; --Leo<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang &lt;<a href=3D"mailto:=
nzhang@fb.com">nzhang@fb.com</a>&gt; wrote:<br>
&gt;&gt;&gt; The settings looks good. The parameter hive.merge.size.smallfi=
les.avgsize is used to determine at run time if a merge should be triggered=
: if the average size of the files in the partition is SMALLER than the par=
ameter and there are more than 1 file, the merge should be scheduled. Can y=
ou try to see if you have any big files as well in your resulting partition=
? If it is because of a very large file, you can set the parameter large en=
ough.<br>

&gt;&gt;&gt;<br>
&gt;&gt;&gt; Another possibility is that your Hadoop installation does not =
support CombineHiveInputFormat, which is used for the new merge job. Someon=
e reported previously merge was not successful because of this. If that&#39=
;s the case, you can turn off CombineHiveInputFormat and use the old HiveIn=
putFormat (though slower) by setting hive.mergejob.maponly=3Dfalse.<br>

&gt;&gt;&gt;<br>
&gt;&gt;&gt; Ning<br>
&gt;&gt;&gt; On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I have jobs that sample (or generate) a small amount of da=
ta from a<br>
&gt;&gt;&gt;&gt; large table. =A0At the end, I get e.g. about 3000 or more =
files of 1kb<br>
&gt;&gt;&gt;&gt; or so. =A0This becomes a nuisance. =A0How can I make Hive =
do another pass<br>
&gt;&gt;&gt;&gt; to merge the output? =A0I have the following settings:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; hive.merge.mapfiles=3Dtrue<br>
&gt;&gt;&gt;&gt; hive.merge.mapredfiles=3Dtrue<br>
&gt;&gt;&gt;&gt; hive.merge.size.per.task=3D256000000<br>
&gt;&gt;&gt;&gt; hive.merge.size.smallfiles.avgsize=3D16000000<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; After setting hive.merge* to true, Hive started indicating=
 &quot;Total<br>
&gt;&gt;&gt;&gt; MapReduce jobs =3D 2&quot;. =A0However, after generating t=
he<br>
&gt;&gt;&gt;&gt; lots-of-small-files table, Hive says:<br>
&gt;&gt;&gt;&gt; Ended Job =3D job_201011021934_1344<br>
&gt;&gt;&gt;&gt; Ended Job =3D 781771542, job is filtered out (removed at r=
untime).<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Is there a way to force the merge, or am I missing somethi=
ng?<br>
&gt;&gt;&gt;&gt; --Leo<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Dave Bronds=
ema<br>Software Engineer<br>Geeknet<br><br><a href=3D"http://www.geek.net">=
www.geek.net</a><br>

--001636c5ab796e63e4049568853e--