hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: clustered bucket and tablesample
Date Sun, 15 May 2016 20:54:46 GMT
Hi,

OK I tried your table.

0: jdbc:hive2://rhes564:10010/default> describe formatted bucket_x;
OK
+-------------------------------+-----------------------------------------------------------+-----------------------------+--+
|           col_name            |
data_type                         |           comment           |
+-------------------------------+-----------------------------------------------------------+-----------------------------+--+
| # col_name                    |
data_type                                                 |
comment                     |
|                               |
NULL                                                      |
NULL                        |
| classifier                    |
string
|                             |
|                               |
NULL                                                      |
NULL                        |
| # Detailed Table Information  |
NULL                                                      |
NULL                        |
| Database:                     |
test                                                      |
NULL                        |
| Owner:                        |
hduser                                                    |
NULL                        |
| CreateTime:                   | Sun May 15 19:10:52 BST
2016                              | NULL                        |
| LastAccessTime:               |
UNKNOWN                                                   |
NULL                        |
| Retention:                    |
0                                                         |
NULL                        |
| Location:                     |
hdfs://rhes564:9000/user/hive/warehouse/test.db/bucket_x  |
NULL                        |
| Table Type:                   |
MANAGED_TABLE                                             |
NULL                        |
| Table Parameters:             |
NULL                                                      |
NULL                        |
|                               |
COLUMN_STATS_ACCURATE                                     |
{\"BASIC_STATS\":\"true\"}  |
|                               |
numFiles                                                  |
256                         |
|                               |
numRows                                                   |
1000                        |
|                               |
rawDataSize                                               |
93336                       |
|                               |
totalSize                                                 |
57025                       |
|                               |
transient_lastDdlTime                                     |
1463338829                  |
|                               |
NULL                                                      |
NULL                        |
| # Storage Information         |
NULL                                                      |
NULL                        |
| SerDe Library:                |
org.apache.hadoop.hive.ql.io.orc.OrcSerde                 |
NULL                        |
| InputFormat:                  |
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat           |
NULL                        |
| OutputFormat:                 |
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat          |
NULL                        |
| Compressed:                   |
No                                                        |
NULL                        |
| Num Buckets:                  |
256                                                       |
NULL                        |
| Bucket Columns:               |
[classifier]                                              |
NULL                        |
| Sort Columns:                 |
[]                                                        |
NULL                        |
| Storage Desc Params:          |
NULL                                                      |
NULL                        |
|                               |
serialization.format                                      |
1                           |
+-------------------------------+-----------------------------------------------------------+-----------------------------+--+


select count(1) from bucket_x;
INFO  : OK
+-------+--+
|  c0   |
+-------+--+
| 1000  |

Now let us try and iterate through all those 256 buckets for the following.
Create the script ${IN_FILE} and run it against Hive. In my case is pretty
fast as I use Hive on Spark engine

function genrandomnumber {
integer BUCKETNUMBER=1
integer BUCKETS=256
while ((BUCKETNUMBER <= BUCKETS))
do
   echo "SELECT ${BUCKETNUMBER} AS BucketNumber, COUNT(1) AS Occurance FROM
bucket_x tablesample(BUCKET ${BUCKETNUMBER} OUT of ${BUCKETS} ON
classifier='cl_900');" >> ${IN_FILE}
   ((BUCKETNUMBER = BUCKETNUMBER + 1))
done
}

And the results don't make sense!

| bucketnumber  | occurance  |
+---------------+------------+--+
| 1             | 999        |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 2             | 1          |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 3             | 0          |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 4             | 0          |

So apparently Bucket 1 has 999 entry and bucket 2 has 1.

Let us try it for classifier='cl_103'

| bucketnumber  | occurance  |
+---------------+------------+--+
| 1             | 999        |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 2             | 1          |
+---------------+------------+--+
| bucketnumber  | occurance  |
+---------------+------------+--+
| 3             | 0          |
+---------------+------------+--+
The same crap. To me hash partitioning on a string column is unpredictable.
With integer it is fine. I believe there is an underlying bug in here.
Other alternative is to an integer as a surrogate column for hash
partitioning. like a seqiuence in Oracle or identity in Sybase/MSSQL

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 12:29, no jihun <jeesim2@gmail.com> wrote:

> OK Talebzadeh thanks.
>
> Have you ever tried tablesample with string value hash?
> something like 'tablesample( bucket 1 out of 256 on
> some_field='somedata....')'
>
>
> I wrote a full scenario.
>
> # table creating
> Time taken: 0.155 seconds, Fetched: 36 row(s)
> hive> CREATE TABLE `bucket_x` (
>     >   `classifier` string)
>     > CLUSTERED BY ( classifier)
>     > INTO 256 BUCKETS
>     > STORED AS ORC;
> OK
>
>
> *# check option before data insert;*
> hive> set hive.enforce.bucketing;
> hive.enforce.bucketing=true
>
>
> *# insert 1,000 data*
> (also number of distinct value is 1,000)
>
> hive > insert into bucket_x
> values ('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),('cl_109'),('cl_110'),('cl_111'),('cl_112'),('cl_113'),('cl_114'),('cl_115'),('cl_116'),('cl_117'),('cl_118'),('cl_119'),('cl_120'),('cl_121'),('cl_122'),('cl_123'),('cl_124'),('cl_125'),('cl_126'),('cl_127'),('cl_128'),('cl_129'),('cl_130'),('cl_131'),('cl_132'),('cl_133'),('cl_134'),('cl_135'),('cl_136'),('cl_137'),('cl_138'),('cl_139'),('cl_140'),('cl_141'),('cl_142'),('cl_143'),('cl_144'),('cl_145'),('cl_146'),('cl_147'),('cl_148'),('cl_149'),('cl_150'),('cl_151'),('cl_152'),('cl_153'),('cl_154'),('cl_155'),('cl_156'),('cl_157'),('cl_158'),('cl_159'),('cl_160'),('cl_161'),('cl_162'),('cl_163'),('cl_164'),('cl_165'),('cl_166'),('cl_167'),('cl_168'),('cl_169'),('cl_170'),('cl_171'),('cl_172'),('cl_173'),('cl_174'),('cl_175'),('cl_176'),('cl_177'),('cl_178'),('cl_179'),('cl_180'),('cl_181'),('cl_182'),('cl_183'),('cl_184'),('cl_185'),('cl_186'),('cl_187'),('cl_188'),('cl_189'),('cl_190'),('cl_191'),('cl_192'),('cl_193'),('cl_194'),('cl_195'),('cl_196'),('cl_197'),('cl_198'),('cl_199'),('cl_200'),('cl_201'),('cl_202'),('cl_203'),('cl_204'),('cl_205'),('cl_206'),('cl_207'),('cl_208'),('cl_209'),('cl_210'),('cl_211'),('cl_212'),('cl_213'),('cl_214'),('cl_215'),('cl_216'),('cl_217'),('cl_218'),('cl_219'),('cl_220'),('cl_221'),('cl_222'),('cl_223'),('cl_224'),('cl_225'),('cl_226'),('cl_227'),('cl_228'),('cl_229'),('cl_230'),('cl_231'),('cl_232'),('cl_233'),('cl_234'),('cl_235'),('cl_236'),('cl_237'),('cl_238'),('cl_239'),('cl_240'),('cl_241'),('cl_242'),('cl_243'),('cl_244'),('cl_245'),('cl_246'),('cl_247'),('cl_248'),('cl_249'),('cl_250'),('cl_251'),('cl_252'),('cl_253'),('cl_254'),('cl_255'),('cl_256'),('cl_257'),('cl_258'),('cl_259'),('cl_260'),('cl_261'),('cl_262'),('cl_263'),('cl_264'),('cl_265'),('cl_266'),('cl_267'),('cl_268'),('cl_269'),('cl_270'),('cl_271'),('cl_272'),('cl_273'),('cl_274'),('cl_275'),('cl_276'),('cl_277'),('cl_278'),('cl_279'),('cl_280'),('cl_281'),('cl_282'),('cl_283'),('cl_284'),('cl_285'),('cl_286'),('cl_287'),('cl_288'),('cl_289'),('cl_290'),('cl_291'),('cl_292'),('cl_293'),('cl_294'),('cl_295'),('cl_296'),('cl_297'),('cl_298'),('cl_299'),('cl_300'),('cl_301'),('cl_302'),('cl_303'),('cl_304'),('cl_305'),('cl_306'),('cl_307'),('cl_308'),('cl_309'),('cl_310'),('cl_311'),('cl_312'),('cl_313'),('cl_314'),('cl_315'),('cl_316'),('cl_317'),('cl_318'),('cl_319'),('cl_320'),('cl_321'),('cl_322'),('cl_323'),('cl_324'),('cl_325'),('cl_326'),('cl_327'),('cl_328'),('cl_329'),('cl_330'),('cl_331'),('cl_332'),('cl_333'),('cl_334'),('cl_335'),('cl_336'),('cl_337'),('cl_338'),('cl_339'),('cl_340'),('cl_341'),('cl_342'),('cl_343'),('cl_344'),('cl_345'),('cl_346'),('cl_347'),('cl_348'),('cl_349'),('cl_350'),('cl_351'),('cl_352'),('cl_353'),('cl_354'),('cl_355'),('cl_356'),('cl_357'),('cl_358'),('cl_359'),('cl_360'),('cl_361'),('cl_362'),('cl_363'),('cl_364'),('cl_365'),('cl_366'),('cl_367'),('cl_368'),('cl_369'),('cl_370'),('cl_371'),('cl_372'),('cl_373'),('cl_374'),('cl_375'),('cl_376'),('cl_377'),('cl_378'),('cl_379'),('cl_380'),('cl_381'),('cl_382'),('cl_383'),('cl_384'),('cl_385'),('cl_386'),('cl_387'),('cl_388'),('cl_389'),('cl_390'),('cl_391'),('cl_392'),('cl_393'),('cl_394'),('cl_395'),('cl_396'),('cl_397'),('cl_398'),('cl_399'),('cl_400'),('cl_401'),('cl_402'),('cl_403'),('cl_404'),('cl_405'),('cl_406'),('cl_407'),('cl_408'),('cl_409'),('cl_410'),('cl_411'),('cl_412'),('cl_413'),('cl_414'),('cl_415'),('cl_416'),('cl_417'),('cl_418'),('cl_419'),('cl_420'),('cl_421'),('cl_422'),('cl_423'),('cl_424'),('cl_425'),('cl_426'),('cl_427'),('cl_428'),('cl_429'),('cl_430'),('cl_431'),('cl_432'),('cl_433'),('cl_434'),('cl_435'),('cl_436'),('cl_437'),('cl_438'),('cl_439'),('cl_440'),('cl_441'),('cl_442'),('cl_443'),('cl_444'),('cl_445'),('cl_446'),('cl_447'),('cl_448'),('cl_449'),('cl_450'),('cl_451'),('cl_452'),('cl_453'),('cl_454'),('cl_455'),('cl_456'),('cl_457'),('cl_458'),('cl_459'),('cl_460'),('cl_461'),('cl_462'),('cl_463'),('cl_464'),('cl_465'),('cl_466'),('cl_467'),('cl_468'),('cl_469'),('cl_470'),('cl_471'),('cl_472'),('cl_473'),('cl_474'),('cl_475'),('cl_476'),('cl_477'),('cl_478'),('cl_479'),('cl_480'),('cl_481'),('cl_482'),('cl_483'),('cl_484'),('cl_485'),('cl_486'),('cl_487'),('cl_488'),('cl_489'),('cl_490'),('cl_491'),('cl_492'),('cl_493'),('cl_494'),('cl_495'),('cl_496'),('cl_497'),('cl_498'),('cl_499'),('cl_500'),('cl_501'),('cl_502'),('cl_503'),('cl_504'),('cl_505'),('cl_506'),('cl_507'),('cl_508'),('cl_509'),('cl_510'),('cl_511'),('cl_512'),('cl_513'),('cl_514'),('cl_515'),('cl_516'),('cl_517'),('cl_518'),('cl_519'),('cl_520'),('cl_521'),('cl_522'),('cl_523'),('cl_524'),('cl_525'),('cl_526'),('cl_527'),('cl_528'),('cl_529'),('cl_530'),('cl_531'),('cl_532'),('cl_533'),('cl_534'),('cl_535'),('cl_536'),('cl_537'),('cl_538'),('cl_539'),('cl_540'),('cl_541'),('cl_542'),('cl_543'),('cl_544'),('cl_545'),('cl_546'),('cl_547'),('cl_548'),('cl_549'),('cl_550'),('cl_551'),('cl_552'),('cl_553'),('cl_554'),('cl_555'),('cl_556'),('cl_557'),('cl_558'),('cl_559'),('cl_560'),('cl_561'),('cl_562'),('cl_563'),('cl_564'),('cl_565'),('cl_566'),('cl_567'),('cl_568'),('cl_569'),('cl_570'),('cl_571'),('cl_572'),('cl_573'),('cl_574'),('cl_575'),('cl_576'),('cl_577'),('cl_578'),('cl_579'),('cl_580'),('cl_581'),('cl_582'),('cl_583'),('cl_584'),('cl_585'),('cl_586'),('cl_587'),('cl_588'),('cl_589'),('cl_590'),('cl_591'),('cl_592'),('cl_593'),('cl_594'),('cl_595'),('cl_596'),('cl_597'),('cl_598'),('cl_599'),('cl_600'),('cl_601'),('cl_602'),('cl_603'),('cl_604'),('cl_605'),('cl_606'),('cl_607'),('cl_608'),('cl_609'),('cl_610'),('cl_611'),('cl_612'),('cl_613'),('cl_614'),('cl_615'),('cl_616'),('cl_617'),('cl_618'),('cl_619'),('cl_620'),('cl_621'),('cl_622'),('cl_623'),('cl_624'),('cl_625'),('cl_626'),('cl_627'),('cl_628'),('cl_629'),('cl_630'),('cl_631'),('cl_632'),('cl_633'),('cl_634'),('cl_635'),('cl_636'),('cl_637'),('cl_638'),('cl_639'),('cl_640'),('cl_641'),('cl_642'),('cl_643'),('cl_644'),('cl_645'),('cl_646'),('cl_647'),('cl_648'),('cl_649'),('cl_650'),('cl_651'),('cl_652'),('cl_653'),('cl_654'),('cl_655'),('cl_656'),('cl_657'),('cl_658'),('cl_659'),('cl_660'),('cl_661'),('cl_662'),('cl_663'),('cl_664'),('cl_665'),('cl_666'),('cl_667'),('cl_668'),('cl_669'),('cl_670'),('cl_671'),('cl_672'),('cl_673'),('cl_674'),('cl_675'),('cl_676'),('cl_677'),('cl_678'),('cl_679'),('cl_680'),('cl_681'),('cl_682'),('cl_683'),('cl_684'),('cl_685'),('cl_686'),('cl_687'),('cl_688'),('cl_689'),('cl_690'),('cl_691'),('cl_692'),('cl_693'),('cl_694'),('cl_695'),('cl_696'),('cl_697'),('cl_698'),('cl_699'),('cl_700'),('cl_701'),('cl_702'),('cl_703'),('cl_704'),('cl_705'),('cl_706'),('cl_707'),('cl_708'),('cl_709'),('cl_710'),('cl_711'),('cl_712'),('cl_713'),('cl_714'),('cl_715'),('cl_716'),('cl_717'),('cl_718'),('cl_719'),('cl_720'),('cl_721'),('cl_722'),('cl_723'),('cl_724'),('cl_725'),('cl_726'),('cl_727'),('cl_728'),('cl_729'),('cl_730'),('cl_731'),('cl_732'),('cl_733'),('cl_734'),('cl_735'),('cl_736'),('cl_737'),('cl_738'),('cl_739'),('cl_740'),('cl_741'),('cl_742'),('cl_743'),('cl_744'),('cl_745'),('cl_746'),('cl_747'),('cl_748'),('cl_749'),('cl_750'),('cl_751'),('cl_752'),('cl_753'),('cl_754'),('cl_755'),('cl_756'),('cl_757'),('cl_758'),('cl_759'),('cl_760'),('cl_761'),('cl_762'),('cl_763'),('cl_764'),('cl_765'),('cl_766'),('cl_767'),('cl_768'),('cl_769'),('cl_770'),('cl_771'),('cl_772'),('cl_773'),('cl_774'),('cl_775'),('cl_776'),('cl_777'),('cl_778'),('cl_779'),('cl_780'),('cl_781'),('cl_782'),('cl_783'),('cl_784'),('cl_785'),('cl_786'),('cl_787'),('cl_788'),('cl_789'),('cl_790'),('cl_791'),('cl_792'),('cl_793'),('cl_794'),('cl_795'),('cl_796'),('cl_797'),('cl_798'),('cl_799'),('cl_800'),('cl_801'),('cl_802'),('cl_803'),('cl_804'),('cl_805'),('cl_806'),('cl_807'),('cl_808'),('cl_809'),('cl_810'),('cl_811'),('cl_812'),('cl_813'),('cl_814'),('cl_815'),('cl_816'),('cl_817'),('cl_818'),('cl_819'),('cl_820'),('cl_821'),('cl_822'),('cl_823'),('cl_824'),('cl_825'),('cl_826'),('cl_827'),('cl_828'),('cl_829'),('cl_830'),('cl_831'),('cl_832'),('cl_833'),('cl_834'),('cl_835'),('cl_836'),('cl_837'),('cl_838'),('cl_839'),('cl_840'),('cl_841'),('cl_842'),('cl_843'),('cl_844'),('cl_845'),('cl_846'),('cl_847'),('cl_848'),('cl_849'),('cl_850'),('cl_851'),('cl_852'),('cl_853'),('cl_854'),('cl_855'),('cl_856'),('cl_857'),('cl_858'),('cl_859'),('cl_860'),('cl_861'),('cl_862'),('cl_863'),('cl_864'),('cl_865'),('cl_866'),('cl_867'),('cl_868'),('cl_869'),('cl_870'),('cl_871'),('cl_872'),('cl_873'),('cl_874'),('cl_875'),('cl_876'),('cl_877'),('cl_878'),('cl_879'),('cl_880'),('cl_881'),('cl_882'),('cl_883'),('cl_884'),('cl_885'),('cl_886'),('cl_887'),('cl_888'),('cl_889'),('cl_890'),('cl_891'),('cl_892'),('cl_893'),('cl_894'),('cl_895'),('cl_896'),('cl_897'),('cl_898'),('cl_899'),('cl_900'),('cl_901'),('cl_902'),('cl_903'),('cl_904'),('cl_905'),('cl_906'),('cl_907'),('cl_908'),('cl_909'),('cl_910'),('cl_911'),('cl_912'),('cl_913'),('cl_914'),('cl_915'),('cl_916'),('cl_917'),('cl_918'),('cl_919'),('cl_920'),('cl_921'),('cl_922'),('cl_923'),('cl_924'),('cl_925'),('cl_926'),('cl_927'),('cl_928'),('cl_929'),('cl_930'),('cl_931'),('cl_932'),('cl_933'),('cl_934'),('cl_935'),('cl_936'),('cl_937'),('cl_938'),('cl_939'),('cl_940'),('cl_941'),('cl_942'),('cl_943'),('cl_944'),('cl_945'),('cl_946'),('cl_947'),('cl_948'),('cl_949'),('cl_950'),('cl_951'),('cl_952'),('cl_953'),('cl_954'),('cl_955'),('cl_956'),('cl_957'),('cl_958'),('cl_959'),('cl_960'),('cl_961'),('cl_962'),('cl_963'),('cl_964'),('cl_965'),('cl_966'),('cl_967'),('cl_968'),('cl_969'),('cl_970'),('cl_971'),('cl_972'),('cl_973'),('cl_974'),('cl_975'),('cl_976'),('cl_977'),('cl_978'),('cl_979'),('cl_980'),('cl_981'),('cl_982'),('cl_983'),('cl_984'),('cl_985'),('cl_986'),('cl_987'),('cl_988'),('cl_989'),('cl_990'),('cl_991'),('cl_992'),('cl_993'),('cl_994'),('cl_995'),('cl_996'),('cl_997'),('cl_998'),('cl_999')
>
>
> # check for total count
> *hive> select count(*) from bucket_x;*
> OK
> *1000 // <-- returned 1. correct.*
>
> # check for count specific classifier (without tablesample)
> *hive> select count(*) from bucket_x where classifier='cl_900';*
> Query ID = irteam_20160515201754_d381aff8-16ef-48be-b829-f1a01a530521
> Total jobs = 1
> Launching Job 1 out of 1
>
>
> Status: Running (Executing on YARN cluster with App id
> application_1462971998082_0025)
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>  KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      2          2        0        0       1
>     0
> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>     0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.63 s
>
>
> --------------------------------------------------------------------------------
> OK
> *1 // <-- returned 1. correct.*
> Time taken: 8.064 seconds, Fetched: 1 row(s)
>
>
> # count specific classifier (with tablesample)
>
> *hive> select count(*) from bucket_x tablesample(bucket 1 out of 256 on
> classifier='cl_900') where classifier='cl_900';*
> Query ID = irteam_20160515201913_91166686-b98c-40a4-990b-690c41c69c61
> Total jobs = 1
> Launching Job 1 out of 1
>
>
> Status: Running (Executing on YARN cluster with App id
> application_1462971998082_0025)
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>  KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0       0
>     0
> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>     0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.81 s
>
>
> --------------------------------------------------------------------------------
> OK
> *0 // <--- it returns Zero. WRONG! it should return 1 row.*
> Time taken: 4.216 seconds, Fetched: 1 row(s)
>
> # so I checked whole tablesampled data.
> *hive> select * from bucket_x tablesample(bucket 1 out of 256 on
> classifier='cl_900');*
> OK
> cl_974
> cl_336
> cl_457
> ...
> cl_852
> cl_698
> cl_731
> Time taken: 0.053 seconds, Fetched: 999 row(s)
> *// <-- it returned 999 ROWS *
> *// ??? 999 ???*
> *// exactly except what I want to search*.
>
>
> I think I am doing totally wrong tablesample in case of string value.
> Any idea?
>
> 2016-05-15 2:56 GMT+09:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>
>> This is your code
>>
>> SELECT COUNT(*) FROM X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' AND classifier='bbb'
>>
>> Well I have a table dummy with 1 billion rows imported from Oracle as ORC
>> format
>>
>> hive> show create table dummy;
>> OK
>> CREATE TABLE `dummy`(
>>   `id` int,
>>   `clustered` int,
>>   `scattered` int,
>>   `randomised` int,
>>   `random_string` varchar(50),
>>   `small_vc` varchar(10),
>>   `padding` varchar(10))
>>
>>
>>
>> *CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>> LOCATION
>>   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
>> TBLPROPERTIES (
>>   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
>>   'numFiles'='1',
>>   'numRows'='100000000',
>>   'orc.bloom.filter.columns'='ID',
>>   'orc.bloom.filter.fpp'='0.05',
>>   'orc.compress'='SNAPPY',
>>   'orc.create.index'='true',
>>   'orc.row.index.stride'='10000',
>>   'orc.stripe.size'='16777216',
>>   'rawDataSize'='0',
>>   'totalSize'='5662644579',
>>   'transient_lastDdlTime'='1463245925')
>>
>>
>> If I turn on the plan for the following two cases. First a simple case
>>
>> hive>
>>
>> *EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Spark
>>       Edges:
>>         Reducer 2 <- Map 1 (GROUP, 1)
>>       DagName:
>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
>>       Vertices:
>>         Map 1
>>             Map Operator Tree:
>>                 TableScan
>>                   alias: dummy
>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>> Basic stats: COMPLETE Column stats: NONE
>>                   Filter Operator
>>
>> *                   predicate: (id = 20) (type: boolean)  *
>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>> Column stats: NONE
>>                     Select Operator
>>                       Statistics: Num rows: 50000000 Data size:
>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>                       Group By Operator
>>                         aggregations: count(1)
>>                         mode: hash
>>                         outputColumnNames: _col0
>>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                         Reduce Output Operator
>>                           sort order:
>>                           Statistics: Num rows: 1 Data size: 8 Basic
>> stats: COMPLETE Column stats: NONE
>>                           value expressions: _col0 (type: bigint)
>>         Reducer 2
>>             Reduce Operator Tree:
>>               Group By Operator
>>                 aggregations: count(VALUE._col0)
>>                 mode: mergepartial
>>                 outputColumnNames: _col0
>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                 File Output Operator
>>                   compressed: false
>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>                       serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>       Processor Tree:
>>         ListSink
>> Time taken: 0.064 seconds, Fetched: 51 row(s)
>>
>>
>> Now we try with tablesample but  assigning predicate values inside the
>> bracket as below because you are looking in bucket 1 for those values and
>> you want optimizer to know that.
>>
>>
>>
>> *hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1 OUT
>> OF 256 ON ID = 10)*    > ;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Spark
>>       Edges:
>>         Reducer 2 <- Map 1 (GROUP, 1)
>>       DagName:
>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
>>       Vertices:
>>         Map 1
>>             Map Operator Tree:
>>                 TableScan
>>                   alias: dummy
>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>> Basic stats: COMPLETE Column stats: NONE
>>                   Filter Operator
>>
>> *                   predicate: (((hash((id = 10)) & 2147483647
>> <2147483647>) % 256) = 0) (type: boolean)*
>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>> Column stats: NONE
>>                     Select Operator
>>                       Statistics: Num rows: 50000000 Data size:
>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>                       Group By Operator
>>                         aggregations: count(1)
>>                         mode: hash
>>                         outputColumnNames: _col0
>>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                         Reduce Output Operator
>>                           sort order:
>>                           Statistics: Num rows: 1 Data size: 8 Basic
>> stats: COMPLETE Column stats: NONE
>>                           value expressions: _col0 (type: bigint)
>>         Reducer 2
>>             Reduce Operator Tree:
>>               Group By Operator
>>                 aggregations: count(VALUE._col0)
>>                 mode: mergepartial
>>                 outputColumnNames: _col0
>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                 File Output Operator
>>                   compressed: false
>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>                       serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>       Processor Tree:
>>         ListSink
>>
>>
>> Otherwise I don't see much happening
>>
>> hive>  EXPLAIN SELECT COUNT(1) FROM dummy
>>     > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
>>     > WHERE ID = 10;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Spark
>>       Edges:
>>         Reducer 2 <- Map 1 (GROUP, 1)
>>       DagName:
>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
>>       Vertices:
>>         Map 1
>>             Map Operator Tree:
>>                 TableScan
>>                   alias: dummy
>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>> Basic stats: COMPLETE Column stats: NONE
>>                   Filter Operator
>>                     predicate: (false and (id = 10)) (type: boolean)
>>                     Statistics: Num rows: 25000000 Data size: 1415661184
>> Basic stats: COMPLETE Column stats: NONE
>>                     Select Operator
>>                       Statistics: Num rows: 25000000 Data size:
>> 1415661184 Basic stats: COMPLETE Column stats: NONE
>>                       Group By Operator
>>                         aggregations: count(1)
>>                         mode: hash
>>                         outputColumnNames: _col0
>>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                         Reduce Output Operator
>>                           sort order:
>>                           Statistics: Num rows: 1 Data size: 8 Basic
>> stats: COMPLETE Column stats: NONE
>>                           value expressions: _col0 (type: bigint)
>>         Reducer 2
>>             Reduce Operator Tree:
>>               Group By Operator
>>                 aggregations: count(VALUE._col0)
>>                 mode: mergepartial
>>                 outputColumnNames: _col0
>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                 File Output Operator
>>                   compressed: false
>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>> COMPLETE Column stats: NONE
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>                       serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>       Processor Tree:
>>         ListSink
>>
>> In general in my experience bucketing in ORC is the only area where ORC
>> tables come handy.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 May 2016 at 13:38, no jihun <jeesim2@gmail.com> wrote:
>>
>>> ah, as i mentioned
>>> both field type of action_id and classifier is STRING. and I can not
>>> change the type.
>>>
>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>> STORED AS ORC
>>>
>>> I use two fields for hash then bucketing because each one field is not
>>> so well distributed.
>>>
>>> my concern is not about the strong hash source but about How can I
>>> tablesample to the a bucket by field value what provided by 'where clause'
>>>
>>> when I clustered by string fields which one is right for tablesample?
>>> 1. provide fields
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>
>>> 2. provide values
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
>>> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mich.talebzadeh@gmail.com>님이
>>> 작성:
>>>
>>> Is action_id can be created as a numeric column:
>>>>
>>>> CREATE TABLE X ( action_id bigint,  ..)
>>>>
>>>> Bucketing or hash partitioning best works on numeric columns with high
>>>> cardinality (say a primary key).
>>>>
>>>> From my old notes:
>>>>
>>>> Bucketing in Hive refers to hash partitioning where a hashing function
>>>> is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>>>> algorithm to prevent data from clustering within specific partitions.
>>>> Hashing is very effective if the column selected for bucketing has very
>>>> high selectivity like an ID column where selectivity (select
>>>> count(distinct(column))/count(column) ) = 1.  In this case, the
>>>> created partitions/ files will be as evenly sized as possible. In a
>>>> nutshell bucketing is a method to get data evenly distributed over many
>>>> partitions/files.  One should define the number of buckets by a power of
>>>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>>>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>>>> a join between two tables that are bucketed on the same column with the
>>>> same number of buckets (anyone has tried this?)
>>>>
>>>>
>>>>
>>>> One more things. When one defines the number of buckets at table
>>>> creation level in Hive, the number of partitions/files will be fixed. In
>>>> contrast, with partitioning you do not have this limitation.
>>>>
>>>> can you do
>>>>
>>>> show create table X
>>>>
>>>> and send the output. please.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 May 2016 at 12:23, no jihun <jeesim2@gmail.com> wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> I want to ask the correct bucketing and tablesample way.
>>>>>
>>>>> There is a table X which I created by
>>>>>
>>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>>> STORED AS ORC
>>>>>
>>>>> Then I inserted 500M of rows into X by
>>>>>
>>>>> set hive.enforce.bucketing=true;
>>>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>>>
>>>>> Then I want to count or search some rows with condition. roughly,
>>>>>
>>>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>>>
>>>>> But I'd better to USE tablesample as I clustered X (action_id,
>>>>> classifier). So, the better query will be
>>>>>
>>>>> SELECT COUNT(*) FROM X
>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>> WHERE action_id='aaa' AND classifier='bbb'
>>>>>
>>>>> Is there any wrong above? But I can't not find any performance gain
>>>>> between these two query.
>>>>>
>>>>> query1 and RESULT( with no tablesample.)
>>>>>
>>>>> SELECT COUNT(*)) from X
>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>> FAILED  KILLED
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>>> 0       0
>>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>> 15.35 s
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> It scans full data.
>>>>>
>>>>> query 2 and RESULT
>>>>>
>>>>> SELECT COUNT(*)) from X
>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>> FAILED  KILLED
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>>> 0       0
>>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>> 15.82     s
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> It ALSO scans full data.
>>>>>
>>>>> query 2 RESULT WHAT I EXPECTED.
>>>>>
>>>>> Result what I expected is something like...
>>>>> (use 1 map and relatively faster than without tabmesample)
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>> FAILED  KILLED
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> Map 1 ..........   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>>> 0       0
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>>> 3.xx     s
>>>>> ------------------------------------------------------------
>>>>> --------------------
>>>>>
>>>>> Values of action_id and classifier are well distributed and there is
>>>>> no skewed data.
>>>>>
>>>>> So I want to ask you what will be a correct query that prune and
>>>>> target specific bucket by multiple column?
>>>>>
>>>>
>>>>
>>
>
>
> --
> ----------------------------------------------
> Jihun No ( 노지훈 )
> ----------------------------------------------
> Twitter          : @nozisim
> Facebook       : nozisim
> Website         : http://jeesim2.godohosting.com
>
> ---------------------------------------------------------------------------------
> Market Apps   : android market products.
> <https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>
>

Mime
View raw message