hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From no jihun <jees...@gmail.com>
Subject Re: clustered bucket and tablesample
Date Mon, 16 May 2016 05:12:51 GMT
Thank you so much Talebzadeh.

I have filed an issue . https://issues.apache.org/jira/browse/HIVE-13766


And I changed clustering column to int, based on string column hash.

something like


String actionClassifier = ""my_action,cl_900";

MessageDigest md = MessageDigest.getInstance(MD_5);

byte byteData[] = md.digest(actionClassifier.getBytes());

BigInteger bigInteger = new BigInteger(byteData);

return Math.abs(bigInteger.abs().intValue()); // <---- cluster will use
this value.


Thanks again.


2016. 5. 16. 오전 5:55에 "Mich Talebzadeh" <mich.talebzadeh@gmail.com>님이 작성:

> Hi,
>
> OK I tried your table.
>
> 0: jdbc:hive2://rhes564:10010/default> describe formatted bucket_x;
> OK
>
> +-------------------------------+-----------------------------------------------------------+-----------------------------+--+
> |           col_name            |
> data_type                         |           comment           |
>
> +-------------------------------+-----------------------------------------------------------+-----------------------------+--+
> | # col_name                    |
> data_type                                                 |
> comment                     |
> |                               |
> NULL                                                      |
> NULL                        |
> | classifier                    |
> string
> |                             |
> |                               |
> NULL                                                      |
> NULL                        |
> | # Detailed Table Information  |
> NULL                                                      |
> NULL                        |
> | Database:                     |
> test                                                      |
> NULL                        |
> | Owner:                        |
> hduser                                                    |
> NULL                        |
> | CreateTime:                   | Sun May 15 19:10:52 BST
> 2016                              | NULL                        |
> | LastAccessTime:               |
> UNKNOWN                                                   |
> NULL                        |
> | Retention:                    |
> 0                                                         |
> NULL                        |
> | Location:                     |
> hdfs://rhes564:9000/user/hive/warehouse/test.db/bucket_x  |
> NULL                        |
> | Table Type:                   |
> MANAGED_TABLE                                             |
> NULL                        |
> | Table Parameters:             |
> NULL                                                      |
> NULL                        |
> |                               |
> COLUMN_STATS_ACCURATE                                     |
> {\"BASIC_STATS\":\"true\"}  |
> |                               |
> numFiles                                                  |
> 256                         |
> |                               |
> numRows                                                   |
> 1000                        |
> |                               |
> rawDataSize                                               |
> 93336                       |
> |                               |
> totalSize                                                 |
> 57025                       |
> |                               |
> transient_lastDdlTime                                     |
> 1463338829                  |
> |                               |
> NULL                                                      |
> NULL                        |
> | # Storage Information         |
> NULL                                                      |
> NULL                        |
> | SerDe Library:                |
> org.apache.hadoop.hive.ql.io.orc.OrcSerde                 |
> NULL                        |
> | InputFormat:                  |
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat           |
> NULL                        |
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat          |
> NULL                        |
> | Compressed:                   |
> No                                                        |
> NULL                        |
> | Num Buckets:                  |
> 256                                                       |
> NULL                        |
> | Bucket Columns:               |
> [classifier]                                              |
> NULL                        |
> | Sort Columns:                 |
> []                                                        |
> NULL                        |
> | Storage Desc Params:          |
> NULL                                                      |
> NULL                        |
> |                               |
> serialization.format                                      |
> 1                           |
>
> +-------------------------------+-----------------------------------------------------------+-----------------------------+--+
>
>
> select count(1) from bucket_x;
> INFO  : OK
> +-------+--+
> |  c0   |
> +-------+--+
> | 1000  |
>
> Now let us try and iterate through all those 256 buckets for the
> following. Create the script ${IN_FILE} and run it against Hive. In my case
> is pretty fast as I use Hive on Spark engine
>
> function genrandomnumber {
> integer BUCKETNUMBER=1
> integer BUCKETS=256
> while ((BUCKETNUMBER <= BUCKETS))
> do
>    echo "SELECT ${BUCKETNUMBER} AS BucketNumber, COUNT(1) AS Occurance
> FROM bucket_x tablesample(BUCKET ${BUCKETNUMBER} OUT of ${BUCKETS} ON
> classifier='cl_900');" >> ${IN_FILE}
>    ((BUCKETNUMBER = BUCKETNUMBER + 1))
> done
> }
>
> And the results don't make sense!
>
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 1             | 999        |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 2             | 1          |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 3             | 0          |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 4             | 0          |
>
> So apparently Bucket 1 has 999 entry and bucket 2 has 1.
>
> Let us try it for classifier='cl_103'
>
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 1             | 999        |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 2             | 1          |
> +---------------+------------+--+
> | bucketnumber  | occurance  |
> +---------------+------------+--+
> | 3             | 0          |
> +---------------+------------+--+
> The same crap. To me hash partitioning on a string column is
> unpredictable. With integer it is fine. I believe there is an underlying
> bug in here. Other alternative is to an integer as a surrogate column for
> hash partitioning. like a seqiuence in Oracle or identity in Sybase/MSSQL
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 May 2016 at 12:29, no jihun <jeesim2@gmail.com> wrote:
>
>> OK Talebzadeh thanks.
>>
>> Have you ever tried tablesample with string value hash?
>> something like 'tablesample( bucket 1 out of 256 on
>> some_field='somedata....')'
>>
>>
>> I wrote a full scenario.
>>
>> # table creating
>> Time taken: 0.155 seconds, Fetched: 36 row(s)
>> hive> CREATE TABLE `bucket_x` (
>>     >   `classifier` string)
>>     > CLUSTERED BY ( classifier)
>>     > INTO 256 BUCKETS
>>     > STORED AS ORC;
>> OK
>>
>>
>> *# check option before data insert;*
>> hive> set hive.enforce.bucketing;
>> hive.enforce.bucketing=true
>>
>>
>> *# insert 1,000 data*
>> (also number of distinct value is 1,000)
>>
>> hive > insert into bucket_x
>> values ('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),('cl_109'),('cl_110'),('cl_111'),('cl_112'),('cl_113'),('cl_114'),('cl_115'),('cl_116'),('cl_117'),('cl_118'),('cl_119'),('cl_120'),('cl_121'),('cl_122'),('cl_123'),('cl_124'),('cl_125'),('cl_126'),('cl_127'),('cl_128'),('cl_129'),('cl_130'),('cl_131'),('cl_132'),('cl_133'),('cl_134'),('cl_135'),('cl_136'),('cl_137'),('cl_138'),('cl_139'),('cl_140'),('cl_141'),('cl_142'),('cl_143'),('cl_144'),('cl_145'),('cl_146'),('cl_147'),('cl_148'),('cl_149'),('cl_150'),('cl_151'),('cl_152'),('cl_153'),('cl_154'),('cl_155'),('cl_156'),('cl_157'),('cl_158'),('cl_159'),('cl_160'),('cl_161'),('cl_162'),('cl_163'),('cl_164'),('cl_165'),('cl_166'),('cl_167'),('cl_168'),('cl_169'),('cl_170'),('cl_171'),('cl_172'),('cl_173'),('cl_174'),('cl_175'),('cl_176'),('cl_177'),('cl_178'),('cl_179'),('cl_180'),('cl_181'),('cl_182'),('cl_183'),('cl_184'),('cl_185'),('cl_186'),('cl_187'),('cl_188'),('cl_189'),('cl_190'),('cl_191'),('cl_192'),('cl_193'),('cl_194'),('cl_195'),('cl_196'),('cl_197'),('cl_198'),('cl_199'),('cl_200'),('cl_201'),('cl_202'),('cl_203'),('cl_204'),('cl_205'),('cl_206'),('cl_207'),('cl_208'),('cl_209'),('cl_210'),('cl_211'),('cl_212'),('cl_213'),('cl_214'),('cl_215'),('cl_216'),('cl_217'),('cl_218'),('cl_219'),('cl_220'),('cl_221'),('cl_222'),('cl_223'),('cl_224'),('cl_225'),('cl_226'),('cl_227'),('cl_228'),('cl_229'),('cl_230'),('cl_231'),('cl_232'),('cl_233'),('cl_234'),('cl_235'),('cl_236'),('cl_237'),('cl_238'),('cl_239'),('cl_240'),('cl_241'),('cl_242'),('cl_243'),('cl_244'),('cl_245'),('cl_246'),('cl_247'),('cl_248'),('cl_249'),('cl_250'),('cl_251'),('cl_252'),('cl_253'),('cl_254'),('cl_255'),('cl_256'),('cl_257'),('cl_258'),('cl_259'),('cl_260'),('cl_261'),('cl_262'),('cl_263'),('cl_264'),('cl_265'),('cl_266'),('cl_267'),('cl_268'),('cl_269'),('cl_270'),('cl_271'),('cl_272'),('cl_273'),('cl_274'),('cl_275'),('cl_276'),('cl_277'),('cl_278'),('cl_279'),('cl_280'),('cl_281'),('cl_282'),('cl_283'),('cl_284'),('cl_285'),('cl_286'),('cl_287'),('cl_288'),('cl_289'),('cl_290'),('cl_291'),('cl_292'),('cl_293'),('cl_294'),('cl_295'),('cl_296'),('cl_297'),('cl_298'),('cl_299'),('cl_300'),('cl_301'),('cl_302'),('cl_303'),('cl_304'),('cl_305'),('cl_306'),('cl_307'),('cl_308'),('cl_309'),('cl_310'),('cl_311'),('cl_312'),('cl_313'),('cl_314'),('cl_315'),('cl_316'),('cl_317'),('cl_318'),('cl_319'),('cl_320'),('cl_321'),('cl_322'),('cl_323'),('cl_324'),('cl_325'),('cl_326'),('cl_327'),('cl_328'),('cl_329'),('cl_330'),('cl_331'),('cl_332'),('cl_333'),('cl_334'),('cl_335'),('cl_336'),('cl_337'),('cl_338'),('cl_339'),('cl_340'),('cl_341'),('cl_342'),('cl_343'),('cl_344'),('cl_345'),('cl_346'),('cl_347'),('cl_348'),('cl_349'),('cl_350'),('cl_351'),('cl_352'),('cl_353'),('cl_354'),('cl_355'),('cl_356'),('cl_357'),('cl_358'),('cl_359'),('cl_360'),('cl_361'),('cl_362'),('cl_363'),('cl_364'),('cl_365'),('cl_366'),('cl_367'),('cl_368'),('cl_369'),('cl_370'),('cl_371'),('cl_372'),('cl_373'),('cl_374'),('cl_375'),('cl_376'),('cl_377'),('cl_378'),('cl_379'),('cl_380'),('cl_381'),('cl_382'),('cl_383'),('cl_384'),('cl_385'),('cl_386'),('cl_387'),('cl_388'),('cl_389'),('cl_390'),('cl_391'),('cl_392'),('cl_393'),('cl_394'),('cl_395'),('cl_396'),('cl_397'),('cl_398'),('cl_399'),('cl_400'),('cl_401'),('cl_402'),('cl_403'),('cl_404'),('cl_405'),('cl_406'),('cl_407'),('cl_408'),('cl_409'),('cl_410'),('cl_411'),('cl_412'),('cl_413'),('cl_414'),('cl_415'),('cl_416'),('cl_417'),('cl_418'),('cl_419'),('cl_420'),('cl_421'),('cl_422'),('cl_423'),('cl_424'),('cl_425'),('cl_426'),('cl_427'),('cl_428'),('cl_429'),('cl_430'),('cl_431'),('cl_432'),('cl_433'),('cl_434'),('cl_435'),('cl_436'),('cl_437'),('cl_438'),('cl_439'),('cl_440'),('cl_441'),('cl_442'),('cl_443'),('cl_444'),('cl_445'),('cl_446'),('cl_447'),('cl_448'),('cl_449'),('cl_450'),('cl_451'),('cl_452'),('cl_453'),('cl_454'),('cl_455'),('cl_456'),('cl_457'),('cl_458'),('cl_459'),('cl_460'),('cl_461'),('cl_462'),('cl_463'),('cl_464'),('cl_465'),('cl_466'),('cl_467'),('cl_468'),('cl_469'),('cl_470'),('cl_471'),('cl_472'),('cl_473'),('cl_474'),('cl_475'),('cl_476'),('cl_477'),('cl_478'),('cl_479'),('cl_480'),('cl_481'),('cl_482'),('cl_483'),('cl_484'),('cl_485'),('cl_486'),('cl_487'),('cl_488'),('cl_489'),('cl_490'),('cl_491'),('cl_492'),('cl_493'),('cl_494'),('cl_495'),('cl_496'),('cl_497'),('cl_498'),('cl_499'),('cl_500'),('cl_501'),('cl_502'),('cl_503'),('cl_504'),('cl_505'),('cl_506'),('cl_507'),('cl_508'),('cl_509'),('cl_510'),('cl_511'),('cl_512'),('cl_513'),('cl_514'),('cl_515'),('cl_516'),('cl_517'),('cl_518'),('cl_519'),('cl_520'),('cl_521'),('cl_522'),('cl_523'),('cl_524'),('cl_525'),('cl_526'),('cl_527'),('cl_528'),('cl_529'),('cl_530'),('cl_531'),('cl_532'),('cl_533'),('cl_534'),('cl_535'),('cl_536'),('cl_537'),('cl_538'),('cl_539'),('cl_540'),('cl_541'),('cl_542'),('cl_543'),('cl_544'),('cl_545'),('cl_546'),('cl_547'),('cl_548'),('cl_549'),('cl_550'),('cl_551'),('cl_552'),('cl_553'),('cl_554'),('cl_555'),('cl_556'),('cl_557'),('cl_558'),('cl_559'),('cl_560'),('cl_561'),('cl_562'),('cl_563'),('cl_564'),('cl_565'),('cl_566'),('cl_567'),('cl_568'),('cl_569'),('cl_570'),('cl_571'),('cl_572'),('cl_573'),('cl_574'),('cl_575'),('cl_576'),('cl_577'),('cl_578'),('cl_579'),('cl_580'),('cl_581'),('cl_582'),('cl_583'),('cl_584'),('cl_585'),('cl_586'),('cl_587'),('cl_588'),('cl_589'),('cl_590'),('cl_591'),('cl_592'),('cl_593'),('cl_594'),('cl_595'),('cl_596'),('cl_597'),('cl_598'),('cl_599'),('cl_600'),('cl_601'),('cl_602'),('cl_603'),('cl_604'),('cl_605'),('cl_606'),('cl_607'),('cl_608'),('cl_609'),('cl_610'),('cl_611'),('cl_612'),('cl_613'),('cl_614'),('cl_615'),('cl_616'),('cl_617'),('cl_618'),('cl_619'),('cl_620'),('cl_621'),('cl_622'),('cl_623'),('cl_624'),('cl_625'),('cl_626'),('cl_627'),('cl_628'),('cl_629'),('cl_630'),('cl_631'),('cl_632'),('cl_633'),('cl_634'),('cl_635'),('cl_636'),('cl_637'),('cl_638'),('cl_639'),('cl_640'),('cl_641'),('cl_642'),('cl_643'),('cl_644'),('cl_645'),('cl_646'),('cl_647'),('cl_648'),('cl_649'),('cl_650'),('cl_651'),('cl_652'),('cl_653'),('cl_654'),('cl_655'),('cl_656'),('cl_657'),('cl_658'),('cl_659'),('cl_660'),('cl_661'),('cl_662'),('cl_663'),('cl_664'),('cl_665'),('cl_666'),('cl_667'),('cl_668'),('cl_669'),('cl_670'),('cl_671'),('cl_672'),('cl_673'),('cl_674'),('cl_675'),('cl_676'),('cl_677'),('cl_678'),('cl_679'),('cl_680'),('cl_681'),('cl_682'),('cl_683'),('cl_684'),('cl_685'),('cl_686'),('cl_687'),('cl_688'),('cl_689'),('cl_690'),('cl_691'),('cl_692'),('cl_693'),('cl_694'),('cl_695'),('cl_696'),('cl_697'),('cl_698'),('cl_699'),('cl_700'),('cl_701'),('cl_702'),('cl_703'),('cl_704'),('cl_705'),('cl_706'),('cl_707'),('cl_708'),('cl_709'),('cl_710'),('cl_711'),('cl_712'),('cl_713'),('cl_714'),('cl_715'),('cl_716'),('cl_717'),('cl_718'),('cl_719'),('cl_720'),('cl_721'),('cl_722'),('cl_723'),('cl_724'),('cl_725'),('cl_726'),('cl_727'),('cl_728'),('cl_729'),('cl_730'),('cl_731'),('cl_732'),('cl_733'),('cl_734'),('cl_735'),('cl_736'),('cl_737'),('cl_738'),('cl_739'),('cl_740'),('cl_741'),('cl_742'),('cl_743'),('cl_744'),('cl_745'),('cl_746'),('cl_747'),('cl_748'),('cl_749'),('cl_750'),('cl_751'),('cl_752'),('cl_753'),('cl_754'),('cl_755'),('cl_756'),('cl_757'),('cl_758'),('cl_759'),('cl_760'),('cl_761'),('cl_762'),('cl_763'),('cl_764'),('cl_765'),('cl_766'),('cl_767'),('cl_768'),('cl_769'),('cl_770'),('cl_771'),('cl_772'),('cl_773'),('cl_774'),('cl_775'),('cl_776'),('cl_777'),('cl_778'),('cl_779'),('cl_780'),('cl_781'),('cl_782'),('cl_783'),('cl_784'),('cl_785'),('cl_786'),('cl_787'),('cl_788'),('cl_789'),('cl_790'),('cl_791'),('cl_792'),('cl_793'),('cl_794'),('cl_795'),('cl_796'),('cl_797'),('cl_798'),('cl_799'),('cl_800'),('cl_801'),('cl_802'),('cl_803'),('cl_804'),('cl_805'),('cl_806'),('cl_807'),('cl_808'),('cl_809'),('cl_810'),('cl_811'),('cl_812'),('cl_813'),('cl_814'),('cl_815'),('cl_816'),('cl_817'),('cl_818'),('cl_819'),('cl_820'),('cl_821'),('cl_822'),('cl_823'),('cl_824'),('cl_825'),('cl_826'),('cl_827'),('cl_828'),('cl_829'),('cl_830'),('cl_831'),('cl_832'),('cl_833'),('cl_834'),('cl_835'),('cl_836'),('cl_837'),('cl_838'),('cl_839'),('cl_840'),('cl_841'),('cl_842'),('cl_843'),('cl_844'),('cl_845'),('cl_846'),('cl_847'),('cl_848'),('cl_849'),('cl_850'),('cl_851'),('cl_852'),('cl_853'),('cl_854'),('cl_855'),('cl_856'),('cl_857'),('cl_858'),('cl_859'),('cl_860'),('cl_861'),('cl_862'),('cl_863'),('cl_864'),('cl_865'),('cl_866'),('cl_867'),('cl_868'),('cl_869'),('cl_870'),('cl_871'),('cl_872'),('cl_873'),('cl_874'),('cl_875'),('cl_876'),('cl_877'),('cl_878'),('cl_879'),('cl_880'),('cl_881'),('cl_882'),('cl_883'),('cl_884'),('cl_885'),('cl_886'),('cl_887'),('cl_888'),('cl_889'),('cl_890'),('cl_891'),('cl_892'),('cl_893'),('cl_894'),('cl_895'),('cl_896'),('cl_897'),('cl_898'),('cl_899'),('cl_900'),('cl_901'),('cl_902'),('cl_903'),('cl_904'),('cl_905'),('cl_906'),('cl_907'),('cl_908'),('cl_909'),('cl_910'),('cl_911'),('cl_912'),('cl_913'),('cl_914'),('cl_915'),('cl_916'),('cl_917'),('cl_918'),('cl_919'),('cl_920'),('cl_921'),('cl_922'),('cl_923'),('cl_924'),('cl_925'),('cl_926'),('cl_927'),('cl_928'),('cl_929'),('cl_930'),('cl_931'),('cl_932'),('cl_933'),('cl_934'),('cl_935'),('cl_936'),('cl_937'),('cl_938'),('cl_939'),('cl_940'),('cl_941'),('cl_942'),('cl_943'),('cl_944'),('cl_945'),('cl_946'),('cl_947'),('cl_948'),('cl_949'),('cl_950'),('cl_951'),('cl_952'),('cl_953'),('cl_954'),('cl_955'),('cl_956'),('cl_957'),('cl_958'),('cl_959'),('cl_960'),('cl_961'),('cl_962'),('cl_963'),('cl_964'),('cl_965'),('cl_966'),('cl_967'),('cl_968'),('cl_969'),('cl_970'),('cl_971'),('cl_972'),('cl_973'),('cl_974'),('cl_975'),('cl_976'),('cl_977'),('cl_978'),('cl_979'),('cl_980'),('cl_981'),('cl_982'),('cl_983'),('cl_984'),('cl_985'),('cl_986'),('cl_987'),('cl_988'),('cl_989'),('cl_990'),('cl_991'),('cl_992'),('cl_993'),('cl_994'),('cl_995'),('cl_996'),('cl_997'),('cl_998'),('cl_999')
>>
>>
>> # check for total count
>> *hive> select count(*) from bucket_x;*
>> OK
>> *1000 // <-- returned 1. correct.*
>>
>> # check for count specific classifier (without tablesample)
>> *hive> select count(*) from bucket_x where classifier='cl_900';*
>> Query ID = irteam_20160515201754_d381aff8-16ef-48be-b829-f1a01a530521
>> Total jobs = 1
>> Launching Job 1 out of 1
>>
>>
>> Status: Running (Executing on YARN cluster with App id
>> application_1462971998082_0025)
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>>  KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED      2          2        0        0       1
>>       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>>       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.63
>> s
>>
>> --------------------------------------------------------------------------------
>> OK
>> *1 // <-- returned 1. correct.*
>> Time taken: 8.064 seconds, Fetched: 1 row(s)
>>
>>
>> # count specific classifier (with tablesample)
>>
>> *hive> select count(*) from bucket_x tablesample(bucket 1 out of 256 on
>> classifier='cl_900') where classifier='cl_900';*
>> Query ID = irteam_20160515201913_91166686-b98c-40a4-990b-690c41c69c61
>> Total jobs = 1
>> Launching Job 1 out of 1
>>
>>
>> Status: Running (Executing on YARN cluster with App id
>> application_1462971998082_0025)
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>>  KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED      1          1        0        0       0
>>       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0       0
>>       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.81
>> s
>>
>> --------------------------------------------------------------------------------
>> OK
>> *0 // <--- it returns Zero. WRONG! it should return 1 row.*
>> Time taken: 4.216 seconds, Fetched: 1 row(s)
>>
>> # so I checked whole tablesampled data.
>> *hive> select * from bucket_x tablesample(bucket 1 out of 256 on
>> classifier='cl_900');*
>> OK
>> cl_974
>> cl_336
>> cl_457
>> ...
>> cl_852
>> cl_698
>> cl_731
>> Time taken: 0.053 seconds, Fetched: 999 row(s)
>> *// <-- it returned 999 ROWS *
>> *// ??? 999 ???*
>> *// exactly except what I want to search*.
>>
>>
>> I think I am doing totally wrong tablesample in case of string value.
>> Any idea?
>>
>> 2016-05-15 2:56 GMT+09:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>>
>>> This is your code
>>>
>>> SELECT COUNT(*) FROM X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>> WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> Well I have a table dummy with 1 billion rows imported from Oracle as
>>> ORC format
>>>
>>> hive> show create table dummy;
>>> OK
>>> CREATE TABLE `dummy`(
>>>   `id` int,
>>>   `clustered` int,
>>>   `scattered` int,
>>>   `randomised` int,
>>>   `random_string` varchar(50),
>>>   `small_vc` varchar(10),
>>>   `padding` varchar(10))
>>>
>>>
>>>
>>> *CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
>>>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>> STORED AS INPUTFORMAT
>>>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>> OUTPUTFORMAT
>>>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>> LOCATION
>>>   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
>>> TBLPROPERTIES (
>>>   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
>>>   'numFiles'='1',
>>>   'numRows'='100000000',
>>>   'orc.bloom.filter.columns'='ID',
>>>   'orc.bloom.filter.fpp'='0.05',
>>>   'orc.compress'='SNAPPY',
>>>   'orc.create.index'='true',
>>>   'orc.row.index.stride'='10000',
>>>   'orc.stripe.size'='16777216',
>>>   'rawDataSize'='0',
>>>   'totalSize'='5662644579',
>>>   'transient_lastDdlTime'='1463245925')
>>>
>>>
>>> If I turn on the plan for the following two cases. First a simple case
>>>
>>> hive>
>>>
>>> *EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-0 depends on stages: Stage-1
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Spark
>>>       Edges:
>>>         Reducer 2 <- Map 1 (GROUP, 1)
>>>       DagName:
>>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
>>>       Vertices:
>>>         Map 1
>>>             Map Operator Tree:
>>>                 TableScan
>>>                   alias: dummy
>>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>>> Basic stats: COMPLETE Column stats: NONE
>>>                   Filter Operator
>>>
>>> *                   predicate: (id = 20) (type: boolean)  *
>>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>>> Column stats: NONE
>>>                     Select Operator
>>>                       Statistics: Num rows: 50000000 Data size:
>>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>>                       Group By Operator
>>>                         aggregations: count(1)
>>>                         mode: hash
>>>                         outputColumnNames: _col0
>>>                         Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                         Reduce Output Operator
>>>                           sort order:
>>>                           Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                           value expressions: _col0 (type: bigint)
>>>         Reducer 2
>>>             Reduce Operator Tree:
>>>               Group By Operator
>>>                 aggregations: count(VALUE._col0)
>>>                 mode: mergepartial
>>>                 outputColumnNames: _col0
>>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                 File Output Operator
>>>                   compressed: false
>>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>                       serde:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>       Processor Tree:
>>>         ListSink
>>> Time taken: 0.064 seconds, Fetched: 51 row(s)
>>>
>>>
>>> Now we try with tablesample but  assigning predicate values inside the
>>> bracket as below because you are looking in bucket 1 for those values and
>>> you want optimizer to know that.
>>>
>>>
>>>
>>> *hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1
>>> OUT OF 256 ON ID = 10)*    > ;
>>> OK
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-0 depends on stages: Stage-1
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Spark
>>>       Edges:
>>>         Reducer 2 <- Map 1 (GROUP, 1)
>>>       DagName:
>>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
>>>       Vertices:
>>>         Map 1
>>>             Map Operator Tree:
>>>                 TableScan
>>>                   alias: dummy
>>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>>> Basic stats: COMPLETE Column stats: NONE
>>>                   Filter Operator
>>>
>>> *                   predicate: (((hash((id = 10)) & 2147483647
>>> <2147483647>) % 256) = 0) (type: boolean)*
>>> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
>>> Column stats: NONE
>>>                     Select Operator
>>>                       Statistics: Num rows: 50000000 Data size:
>>> 2831322368 Basic stats: COMPLETE Column stats: NONE
>>>                       Group By Operator
>>>                         aggregations: count(1)
>>>                         mode: hash
>>>                         outputColumnNames: _col0
>>>                         Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                         Reduce Output Operator
>>>                           sort order:
>>>                           Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                           value expressions: _col0 (type: bigint)
>>>         Reducer 2
>>>             Reduce Operator Tree:
>>>               Group By Operator
>>>                 aggregations: count(VALUE._col0)
>>>                 mode: mergepartial
>>>                 outputColumnNames: _col0
>>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                 File Output Operator
>>>                   compressed: false
>>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>                       serde:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>       Processor Tree:
>>>         ListSink
>>>
>>>
>>> Otherwise I don't see much happening
>>>
>>> hive>  EXPLAIN SELECT COUNT(1) FROM dummy
>>>     > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
>>>     > WHERE ID = 10;
>>> OK
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-0 depends on stages: Stage-1
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Spark
>>>       Edges:
>>>         Reducer 2 <- Map 1 (GROUP, 1)
>>>       DagName:
>>> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
>>>       Vertices:
>>>         Map 1
>>>             Map Operator Tree:
>>>                 TableScan
>>>                   alias: dummy
>>>                   Statistics: Num rows: 100000000 Data size: 5662644736
>>> Basic stats: COMPLETE Column stats: NONE
>>>                   Filter Operator
>>>                     predicate: (false and (id = 10)) (type: boolean)
>>>                     Statistics: Num rows: 25000000 Data size:
>>> 1415661184 Basic stats: COMPLETE Column stats: NONE
>>>                     Select Operator
>>>                       Statistics: Num rows: 25000000 Data size:
>>> 1415661184 Basic stats: COMPLETE Column stats: NONE
>>>                       Group By Operator
>>>                         aggregations: count(1)
>>>                         mode: hash
>>>                         outputColumnNames: _col0
>>>                         Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                         Reduce Output Operator
>>>                           sort order:
>>>                           Statistics: Num rows: 1 Data size: 8 Basic
>>> stats: COMPLETE Column stats: NONE
>>>                           value expressions: _col0 (type: bigint)
>>>         Reducer 2
>>>             Reduce Operator Tree:
>>>               Group By Operator
>>>                 aggregations: count(VALUE._col0)
>>>                 mode: mergepartial
>>>                 outputColumnNames: _col0
>>>                 Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                 File Output Operator
>>>                   compressed: false
>>>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
>>> COMPLETE Column stats: NONE
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>                       serde:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>       Processor Tree:
>>>         ListSink
>>>
>>> In general in my experience bucketing in ORC is the only area where ORC
>>> tables come handy.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 May 2016 at 13:38, no jihun <jeesim2@gmail.com> wrote:
>>>
>>>> ah, as i mentioned
>>>> both field type of action_id and classifier is STRING. and I can not
>>>> change the type.
>>>>
>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>> STORED AS ORC
>>>>
>>>> I use two fields for hash then bucketing because each one field is not
>>>> so well distributed.
>>>>
>>>> my concern is not about the strong hash source but about How can I
>>>> tablesample to the a bucket by field value what provided by 'where clause'
>>>>
>>>> when I clustered by string fields which one is right for tablesample?
>>>> 1. provide fields
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>
>>>> 2. provide values
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
>>>> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mich.talebzadeh@gmail.com>님이
>>>> 작성:
>>>>
>>>> Is action_id can be created as a numeric column:
>>>>>
>>>>> CREATE TABLE X ( action_id bigint,  ..)
>>>>>
>>>>> Bucketing or hash partitioning best works on numeric columns with high
>>>>> cardinality (say a primary key).
>>>>>
>>>>> From my old notes:
>>>>>
>>>>> Bucketing in Hive refers to hash partitioning where a hashing function
>>>>> is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>>>>> algorithm to prevent data from clustering within specific partitions.
>>>>> Hashing is very effective if the column selected for bucketing has very
>>>>> high selectivity like an ID column where selectivity (select
>>>>> count(distinct(column))/count(column) ) = 1.  In this case, the
>>>>> created partitions/ files will be as evenly sized as possible. In a
>>>>> nutshell bucketing is a method to get data evenly distributed over many
>>>>> partitions/files.  One should define the number of buckets by a power
of
>>>>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>>>>> will help concurrency in Hive. It may even allow a partition wise join
i.e.
>>>>> a join between two tables that are bucketed on the same column with the
>>>>> same number of buckets (anyone has tried this?)
>>>>>
>>>>>
>>>>>
>>>>> One more things. When one defines the number of buckets at table
>>>>> creation level in Hive, the number of partitions/files will be fixed.
In
>>>>> contrast, with partitioning you do not have this limitation.
>>>>>
>>>>> can you do
>>>>>
>>>>> show create table X
>>>>>
>>>>> and send the output. please.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 May 2016 at 12:23, no jihun <jeesim2@gmail.com> wrote:
>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I want to ask the correct bucketing and tablesample way.
>>>>>>
>>>>>> There is a table X which I created by
>>>>>>
>>>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>>>> STORED AS ORC
>>>>>>
>>>>>> Then I inserted 500M of rows into X by
>>>>>>
>>>>>> set hive.enforce.bucketing=true;
>>>>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>>>>
>>>>>> Then I want to count or search some rows with condition. roughly,
>>>>>>
>>>>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>>>>
>>>>>> But I'd better to USE tablesample as I clustered X (action_id,
>>>>>> classifier). So, the better query will be
>>>>>>
>>>>>> SELECT COUNT(*) FROM X
>>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>>> WHERE action_id='aaa' AND classifier='bbb'
>>>>>>
>>>>>> Is there any wrong above? But I can't not find any performance gain
>>>>>> between these two query.
>>>>>>
>>>>>> query1 and RESULT( with no tablesample.)
>>>>>>
>>>>>> SELECT COUNT(*)) from X
>>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>>> FAILED  KILLED
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> Map 1 ..........   SUCCEEDED    256        256        0
>>>>>> 0       0       0
>>>>>> Reducer 2 ......   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED
TIME:
>>>>>> 15.35 s
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> It scans full data.
>>>>>>
>>>>>> query 2 and RESULT
>>>>>>
>>>>>> SELECT COUNT(*)) from X
>>>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>>> FAILED  KILLED
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> Map 1 ..........   SUCCEEDED    256        256        0
>>>>>> 0       0       0
>>>>>> Reducer 2 ......   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED
TIME:
>>>>>> 15.82     s
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> It ALSO scans full data.
>>>>>>
>>>>>> query 2 RESULT WHAT I EXPECTED.
>>>>>>
>>>>>> Result what I expected is something like...
>>>>>> (use 1 map and relatively faster than without tabmesample)
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>>>> FAILED  KILLED
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> Map 1 ..........   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> Reducer 2 ......   SUCCEEDED      1          1        0
>>>>>> 0       0       0
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED
TIME:
>>>>>> 3.xx     s
>>>>>> ------------------------------------------------------------
>>>>>> --------------------
>>>>>>
>>>>>> Values of action_id and classifier are well distributed and there
is
>>>>>> no skewed data.
>>>>>>
>>>>>> So I want to ask you what will be a correct query that prune and
>>>>>> target specific bucket by multiple column?
>>>>>>
>>>>>
>>>>>
>>>
>>
>>
>> --
>> ----------------------------------------------
>> Jihun No ( 노지훈 )
>> ----------------------------------------------
>> Twitter          : @nozisim
>> Facebook       : nozisim
>> Website         : http://jeesim2.godohosting.com
>>
>> ---------------------------------------------------------------------------------
>> Market Apps   : android market products.
>> <https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>
>>
>
>

Mime
View raw message