[ https://issues.apache.org/jira/browse/PHOENIX4912?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Bin Shi updated PHOENIX4912:

Description:
The current implementation of table sampling is based on the assumption "Every two consecutive
guide posts contains the equal number of rows" which isn't accurate in practice, and once
we collect multiple versions of cells and the deleted rows, the thing will become worse.
In details, the current implementation of table sampling is (see BaseResultIterators.getParallelScan()
which calls sampleScans(...) at the end of function) as described below:
# Iterate all parallel scans generated;
# For each scan, if getHashHode(start row key of the scan) MOD 100 < tableSamplingRate
(See TableSamplerPredicate.java) then pick this scan; otherwise discard this scan.
The problem can be formalized as: We have a group of scans and each scan is defined as <the
start row key denoted as Ki, the count of rows denoted as Ci>. Now we want to randomly
pick X groups so that the sum of count of rows in the selected groups is close to Y, where
Y = the total count of rows of all scans denoted as T * table sampling rate denoted as R (0
<= R <= 100) / 100.00.
To resolve the above problem, one of algorithms that we can consider are described below.
The core idea is to adjust T, R, Y after each pick, so the new problem is a child problem
of the original problem.
{code:java}
ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R) {
ArrayList<Scan> pickedScans = new ArrayList<Scan>();
Y = T * R / 100.00;
for (scan<Ki, Ci> in scans) {
if (Y <= 0) break;
if (getHashCode(Ki) MOD 100 < R) {
// then pick this scan, and adjust T, R, Y accordingly
pickedScans.Add(scan);
T = Ci;
Y = Ci;
if (T != 0 && Y > 0) {
R = 100.00 * Y / T;
}
}
}
return pickedScans;
}
{code}
was:
The current implementation of table sampling is based on the assumption "Every two consecutive
guide posts contains the equal number of rows" which isn't accurate in practice, and once
we collect multiple versions of cells and the deleted rows, the thing will become worse.
In details, the current implementation of table sampling is (see BaseResultIterators.getParallelScan()
which calls sampleScans(...) at the end of function) as described below:
# Iterate all parallel scans generated;
# For each scan, if getHashHode(start row key of the scan) MOD 100 < tableSamplingRate
(See TableSamplerPredicate.java) then pick this scan; otherwise discard this scan.
The problem can be formalized as: We have a group of scans and each scan is defined as <the
start row key denoted as Ki, the count of rows denoted as Ci>. Now we want to randomly
pick X groups so that the sum of count of rows in the selected groups is close to Y, where
Y = the total count of rows of all scans denoted as T * table sampling rate denoted as R (0
<= R <= 100).
To resolve the above problem, one of algorithms that we can consider are described below.
The core idea is to adjust T, R, Y after each pick, so the new problem is a child problem
of the original problem.
{code:java}
ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R) {
ArrayList<Scan> pickedScans = new ArrayList<Scan>();
Y = T * R / 100.00;
for (scan<Ki, Ci> in scans) {
if (Y <= 0) break;
if (getHashCode(Ki) MOD 100 < R) {
// then pick this scan, and adjust T, R, Y accordingly
pickedScans.Add(scan);
T = Ci;
Y = Ci;
if (T != 0 && Y > 0) {
R = 100.00 * Y / T;
}
}
}
return pickedScans;
}
{code}
> Make Table Sampling algorithm to accommodate to the imbalance row distribution across
guide posts
> 
>
> Key: PHOENIX4912
> URL: https://issues.apache.org/jira/browse/PHOENIX4912
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 5.0.0, 4.15.0
> Reporter: Bin Shi
> Assignee: Bin Shi
> Priority: Major
>
> The current implementation of table sampling is based on the assumption "Every two consecutive
guide posts contains the equal number of rows" which isn't accurate in practice, and once
we collect multiple versions of cells and the deleted rows, the thing will become worse.
> In details, the current implementation of table sampling is (see BaseResultIterators.getParallelScan()
which calls sampleScans(...) at the end of function) as described below:
> # Iterate all parallel scans generated;
> # For each scan, if getHashHode(start row key of the scan) MOD 100 < tableSamplingRate
(See TableSamplerPredicate.java) then pick this scan; otherwise discard this scan.
> The problem can be formalized as: We have a group of scans and each scan is defined
as <the start row key denoted as Ki, the count of rows denoted as Ci>. Now we want
to randomly pick X groups so that the sum of count of rows in the selected groups is close
to Y, where Y = the total count of rows of all scans denoted as T * table sampling rate denoted
as R (0 <= R <= 100) / 100.00.
> To resolve the above problem, one of algorithms that we can consider are described below.
The core idea is to adjust T, R, Y after each pick, so the new problem is a child problem
of the original problem.
> {code:java}
> ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R) {
> ArrayList<Scan> pickedScans = new ArrayList<Scan>();
> Y = T * R / 100.00;
> for (scan<Ki, Ci> in scans) {
> if (Y <= 0) break;
> if (getHashCode(Ki) MOD 100 < R) {
> // then pick this scan, and adjust T, R, Y accordingly
> pickedScans.Add(scan);
> T = Ci;
> Y = Ci;
> if (T != 0 && Y > 0) {
> R = 100.00 * Y / T;
> }
> }
> }
> return pickedScans;
> }
> {code}

This message was sent by Atlassian JIRA
(v7.6.3#76005)
