hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillermo Ortiz <konstt2...@gmail.com>
Subject Re: Scan vs Parallel scan.
Date Fri, 12 Sep 2014 07:33:38 GMT
yes, that is. I have changed the HBase version to 0.98

I got the start and stop keys with this method:
private List<RegionScanner> generatePartitions() {
        List<RegionScanner> regionScanners = new ArrayList<RegionScanner>();
        byte[] startKey;
        byte[] stopKey;
        HConnection connection = null;
        HBaseAdmin hbaseAdmin = null;
        try {
            connection = HConnectionManager.
createConnection(HBaseConfiguration.create());
            hbaseAdmin = new HBaseAdmin(connection);
            List<HRegionInfo> regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
            RegionScanner regionScanner = null;
            for (HRegionInfo region : regions) {

                startKey = region.getStartKey();
                stopKey = region.getEndKey();

                regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
                // regionScanner = createRegionScanner(startKey, stopKey);
                if (regionScanner != null) {
                    regionScanners.add(regionScanner);
                }
            }

And I execute the RegionScanner with this:
public List<Result> call() throws Exception {
        HConnection connection =
HConnectionManager.createConnection(HBaseConfiguration.create());
        HTableInterface table =
connection.getTable(configuration.getTable());

    Scan scan = new Scan(startKey, stopKey);
        scan.setBatch(configuration.getBatch());
        scan.setCaching(configuration.getCaching());
        ResultScanner resultScanner = table.getScanner(scan);

        List<Result> results = new ArrayList<Result>();
        for (Result result : resultScanner) {
            results.add(result);
        }

        connection.close();
        table.close();

        return results;
    }

They implement Callable.


2014-09-12 9:26 GMT+02:00 Michael Segel <michael_segel@hotmail.com>:

> Lets take a step back….
>
> Your parallel scan is having the client create N threads where in each
> thread, you’re doing a partial scan of the table where each partial scan
> takes the first and last row of each region?
>
> Is that correct?
>
> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2000@gmail.com> wrote:
>
> > I was checking a little bit more about,, I checked the cluster and data
> is
> > store in three different regions servers, each one in a differente node.
> > So, I guess the threads go to different hard-disks.
> >
> > If someone has an idea or suggestion.. why it's faster a single scan than
> > this implementation. I based on this implementation
> > https://github.com/zygm0nt/hbase-distributed-search
> >
> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com>:
> >
> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
> >> there is not difference.
> >> I disabled the table and disabled the blockcache for that family and I
> put
> >> scan.setBlockcache(false) as well for both cases.
> >>
> >> I think that it's not possible that I executing an complete scan for
> each
> >> thread since my data are the type:
> >> 000001 f:q value=1
> >> 000002 f:q value=2
> >> 000003 f:q value=3
> >> ...
> >>
> >> I add all the values and get the same result on a single scan than a
> >> distributed, so, I guess that DistributedScan did well.
> >> The count from the hbase shell takes about 10-15seconds, I don't
> remember,
> >> but like 4x  of the scan time.
> >> I'm not using any filter for the scans.
> >>
> >> This is the way I calculate number of regions/scans
> >> private List<RegionScanner> generatePartitions() {
> >>        List<RegionScanner> regionScanners = new
> >> ArrayList<RegionScanner>();
> >>        byte[] startKey;
> >>        byte[] stopKey;
> >>        HConnection connection = null;
> >>        HBaseAdmin hbaseAdmin = null;
> >>        try {
> >>            connection =
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>            hbaseAdmin = new HBaseAdmin(connection);
> >>            List<HRegionInfo> regions =
> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>            RegionScanner regionScanner = null;
> >>            for (HRegionInfo region : regions) {
> >>
> >>                startKey = region.getStartKey();
> >>                stopKey = region.getEndKey();
> >>
> >>                regionScanner = new RegionScanner(startKey, stopKey,
> >> scanConfiguration);
> >>                // regionScanner = createRegionScanner(startKey,
> stopKey);
> >>                if (regionScanner != null) {
> >>                    regionScanners.add(regionScanner);
> >>                }
> >>            }
> >>
> >> I did some test for a tiny table and I think that the range for each
> scan
> >> works fine. Although, I though that it was interesting that the time
> when I
> >> execute distributed scan is about 6x.
> >>
> >> I'm going to check about the hard disks, but I think that ti's right.
> >>
> >>
> >>
> >>
> >> 2014-09-11 7:50 GMT+02:00 lars hofhansl <larsh@apache.org>:
> >>
> >>> Which version of HBase?
> >>> Can you show us the code?
> >>>
> >>>
> >>> Your parallel scan with caching 100 takes about 6x as long as the
> single
> >>> scan, which is suspicious because you say you have 6 regions.
> >>> Are you sure you're not accidentally scanning all the data in each of
> >>> your parallel scans?
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Guillermo Ortiz <konstt2000@gmail.com>
> >>> To: "user@hbase.apache.org" <user@hbase.apache.org>
> >>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>> Subject: Scan vs Parallel scan.
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I developed an distributed scan, I create an thread for each region.
> After
> >>> that, I've tried to get some times Scan vs DistributedScan.
> >>> I have disabled blockcache in my table. My cluster has 3 region servers
> >>> with 2 regions each one, in total there are 100.000 rows and execute a
> >>> complete scan.
> >>>
> >>> My partitions are
> >>> -01666 -> request 16665
> >>> 016666-033332 -> request 16666
> >>> 033332-049998 -> request 16666
> >>> 049998-066664 -> request 16666
> >>> 066664-083330 -> request 16666
> >>> 083330- -> request 16671
> >>>
> >>>
> >>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2
> ->
> >>> Caching 10
> >>>
> >>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2
> ->
> >>> Caching 100
> >>>
> >>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2
> ->
> >>> Caching 1000
> >>>
> >>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> >>> Caching 1
> >>>
> >>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> >>> Caching 100
> >>>
> >>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> >>> Caching 1000
> >>>
> >>> Parallel scan works much worse than simple scan,, and I don't know why
> >>> it's
> >>> so fast,, it's really much faster than execute an "count" from hbase
> >>> shell,
> >>> what it doesn't look pretty notmal. The only time that it works better
> >>> parallel is when I execute a normal scan with caching 1.
> >>>
> >>> Any clue about it?
> >>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message