hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "anishek (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-16865) Handle replication bootstrap of large databases
Date Wed, 14 Jun 2017 10:00:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048991#comment-16048991
] 

anishek edited comment on HIVE-16865 at 6/14/17 9:59 AM:
---------------------------------------------------------

h4. Bootstrap Replication Dump
* db metadata one at a time
* function metadata one at a time
* get all tableNames — which should not be overwhelming 	
* tables is also requested one a time. — all partition definitions are loaded in one go
for a table and we dont expect a table with more than 5-10 partition definition columns
* the partition object  themselves will be done in batches via the PartitionIteratable
* only problem seems to be when writing data to _files where in we load all the file status
objects per partition ( for partitioned tables) and per table otherwise , in memory. this
might lead  OOM cases :: decision : this is not a problem as for split computation we will
do the same, where we have not faced this issue.
* we create replCopyTask that will create the _files for all tables / partitions etc during
analysis time and then go to execution engine, this will  lead to lot of objects stored in
memory given the above scale targets. *possibly* 
** move the dump enclosed in a task itself which manage its own thread pools to subsequently
analyze/dump tables in execution phase, this will lead to possible blurring of demarcation
of execution vs analysis phase within hive. 
** Another mode might be to provide lazy incremental task, from analysis to execution phase,
such that both phases run simultaneously rather than one completing before another is started,
this will lead to significant change in code to allow the same and currently only seems to
be required only for replication.
** we might have to do the same for _incremental replication dump_ too as the _*from*_ and
_*to*_ event ids might have millions of events will all of them being inserts, though the
creation of _files is handled differently here where in we write the files along with metadata,
we should be able to do the same for bootstrap replication also rather than creating replcopy
task. this would mean the  replCopyTask should effectively be only used during load time.
The only problem using this approach is that since the process is single threaded we are going
to dump data sequentially and it might take long time, unless we do some threading in ReplicationSemanticAnalyzer
to dump tables with some parallel since there is no dependency between tables when dumping
them, a similar approach might be required for partitions also within tables. 

h4.Bootstrap Replication Load
* list all the table metadata files per db. For massive databases we will load a per above
on the order of a million filestatus objects in memory. This seems to significant higher order
of objects loaded than probably during split computation and hence might need to look at it.
 most probably move to {code}org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus>
listFiles(Path f, boolean recursive){code}
* a task will be created for each type of operation, in case of bootstrap one task per table
/ partition / function /database, hence we will encounter the last problem in _*Bootstrap
Replication Dump*_ 



h4. Additional thoughts
* Since there can be multiple instance of metastores, from an integration w.r.t beacon for
replication, would it be better to have a dedicated metastore instance for replication related
workload(at least for bootstrap), since the execution of tasks will take place on the metastore
instance it might be better served for the customer to have one metastore for replication
and others to handle normal workloads. This can be achieved, I think, based on how the URL's
are configured on HS2 client/orchestration engine of replication . 
* On calling distcp in replcopytask can we log the sourcepath to destpath else if there are
problems during copying we wont know the actual paths.
* On replica warehouse since replication tasks will run alongside normal execution of other
hive tasks assuming there are multiple db's on replica, how do we constraint resource allocation
for replication vs normal task ? how do we manage this such that we dont lag behind replication
significantly ? 



was (Author: anishek):
h4. Bootstrap Replication Dump
* db metadata one at a time
* function metadata one at a time
* get all tableNames — which should not be overwhelming 	
* tables is also requested one a time. — all partition definitions are loaded in one go
for a table and we dont expect a table with more than 5-10 partition definition columns
* the partition object  themselves will be done in batches via the PartitionIteratable
* only problem seems to be when writing data to _files where in we load all the file status
objects per partition ( for partitioned tables) and per table otherwise , in memory. this
might lead  OOM cases :: decision : this is not a problem as for split computation we will
do the same, where we have not faced this issue.
* we create replCopyTask that will create the _files for all tables / partitions etc during
analysis time and then go to execution engine, this will  lead to lot of objects stored in
memory given the above scale targets. *possibly* 
** move the dump enclosed in a task itself which manage its own thread pools to subsequently
analyze/dump tables in execution phase, this will lead to possible blurring of demarcation
of execution vs analysis phase within hive. 
** Another mode might be to provide lazy incremental task, from analysis to execution phase,
such that both phases run simultaneously rather than one completing before another is started,
this will lead to significant change in code to allow the same and currently only seems to
be required only for replication.
** we might have to do the same for _incremental replication dump_ too as the _*from*_ and
_*to*_ event ids might have millions of events will all of them being inserts, though the
creation of _files is handled differently here where in we write the files along with metadata,
we should be able to do the same for bootstrap replication also rather than creating replcopy
task. this would mean the  replCopyTask should effectively be only used during load time.
The only problem using this approach is that since the process is single threaded we are going
to dump data sequentially and it might take long time, unless we do some threading in ReplicationSemanticAnalyzer
to dump tables with some parallel since there is no dependency between tables when dumping
them, a similar approach might be required for partitions also within tables. 

h4.Bootstrap Replication Load
* list all the table metadata files per db. For massive databases we will load a per above
on the order of a million filestatus objects in memory. This seems to significant higher order
of objects loaded than probably during split computation and hence might need to look at it.
 most probably move to {code}org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus>
listFiles(Path f, boolean recursive){code}
* a task will be created for each type of operation, in case of bootstrap one task per table
/ partition / function /database, hence we will encounter the last problem in _*Bootstrap
Replication Dump*_ 



h4. Additional thoughts
* Since there can be multiple instance of metastores, from an integration w.r.t beacon for
replication, would it be better to have a dedicated metastore instance for replication related
workload(at least for bootstrap), since the execution of tasks will take place on the metastore
instance it might be better served for the customer to have one metastore for replication
and others to handle normal workloads. This can be achieved, I think, based on how the URL's
are configured on HS2/beacon side. 
* On calling distcp in replcopytask can we log the sourcepath to destpath else if there are
problems during copying we wont know the actual paths.
* On replica warehouse since replication tasks will run alongside normal execution of other
hive tasks assuming there are multiple db's on replica, how do we constraint resource allocation
for replication vs normal task ? how do we manage this such that we dont lag behind replication
significantly ? 


> Handle replication bootstrap of large databases
> -----------------------------------------------
>
>                 Key: HIVE-16865
>                 URL: https://issues.apache.org/jira/browse/HIVE-16865
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>    Affects Versions: 3.0.0
>            Reporter: anishek
>            Assignee: anishek
>             Fix For: 3.0.0
>
>
> for larger databases make sure that we can handle replication bootstrap.
> * Assuming large database can have close to million tables or a few tables with few hundred
thousand partitions. 
> *  for function replication if a primary warehouse has large number of custom functions
defined such that the same binary file in corporates most of these functions then on the replica
warehouse there might be a problem in loading all these functions as we will have the same
jar on primary copied over for each function such that each function will have a local copy
of the jar, loading all these jars might lead to excessive memory usage. 
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message