cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shenghua(Daniel) Wan" <wansheng...@gmail.com>
Subject Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server
Date Wed, 28 Jan 2015 07:51:14 GMT
For clarification, please checkout the source code I got from C* v2.0.11

in AbstractColumnFamilyInputFormat  getSplits(JobContext context)
line 125 and 168

        // cannonical ranges and nodes holding replicas
        List<TokenRange> masterRangeNodes = getRangeMap(conf);

 for (TokenRange range : masterRangeNodes)
            {
                if (jobRange == null)
                {
                    // for each range, pick a live owner and ask it to
compute bite-sized splits
                    splitfutures.add(executor.submit(new
SplitCallable(range, conf)));
                }

My understanding for this part of source code is for each token range, it
will create a connection to the server.


On Tue, Jan 27, 2015 at 11:21 PM, Huiliang Zhang <zhlntu@gmail.com> wrote:

> In that case, each node will have 256/3 connections at most. Still 256
> mappers. Someone please correct me if I am wrong.
>
> On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan <
> wanshenghua@gmail.com> wrote:
>
>> Hi, Huiliang,
>> Great to hear from you, again!
>> Image you have 3 nodes, replication factor=1, and using default number of
>> tokens. You will have 3*256 mappers... In that case, you will be soon out
>> of mappers or reach the limit.
>>
>>
>> On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang <zhlntu@gmail.com>
>> wrote:
>>
>>> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
>>> will not share connections. So, it needs at least 256 connections to read
>>> all. But all 256 connections should not be set up at the same time unless
>>> you have 256 mappers running at the same time.
>>>
>>> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
>>> wanshenghua@gmail.com> wrote:
>>>
>>>> By default, each C* node is set with 256 tokens. On a local 1-node C*
>>>> server, my hadoop drop creates 256 connections to the server. Is there any
>>>> way to control this behavior? e.g. reduce the number of connections to a
>>>> pre-configured gap.
>>>>
>>>> I debugged C* source code and found the client asks for partition
>>>> ranges, or virtual nodes. Then the client was told by server there were 257
>>>> ranges, corresponding to 257 column family splits.
>>>>
>>>> Here is a snapshot of my logs
>>>>
>>>> 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
>>>> ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
>>>> ...
>>>> totally 257 splits.
>>>>
>>>> The problem is the user might only want all the data via a "select *"
>>>> like statement. It seems that 257 connections to query the rows are
>>>> necessary. However, is there any way to prohibit 257 concurrent
>>>> connections?
>>>>
>>>> My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which
>>>> has same behavior.
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>>
>>>> Regards,
>>>> Shenghua (Daniel) Wan
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


-- 

Regards,
Shenghua (Daniel) Wan

Mime
View raw message