accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: 回复:how can i optimize scan speed when use batch scan ?
Date Thu, 15 Jan 2015 04:05:57 GMT
Great, glad you found your records now!

Setting tserver.metadata.readhead.concurrent.max, 
tserver.readhead.concurrent.max, and tserver.scan.files.open.max to 
65536 is excessive. 100 is probably more than you will actually use with 
this code.

On the monitor's overview page, there are a number of graphs. The bottom 
two graphs will show the hit-rate of the index and data block caches. 
Ensuring that the hit rates are near 100% would also help with your read 
performance and tell you if you need to increase tserver.cache.data.size 
and tserver.cache.index.size more.

Eric mentioned earlier, you might want to look at enabling bloom filters 
on your table as it would help with row lookups:

In the shell:

 > config -t your_table -s table.bloom.enabled=true
 > compact -t your_table

And then, rerun your queries. See the examples page on bloom filters for 
more information http://accumulo.apache.org/1.6/examples/bloom.html

Lu.Qin wrote:
> Thanks for you help !
>
>
> I compare the speed about exact and followingKey like this,is it right?
>
> Scanner scan = conn.createScanner("", new Authorizations());
>
> List<Range> list = new ArrayList<Range>();
>
> for (Map.Entry<Key, Value> entry : scan) {
>
> if (list.size() == resultNum * threadNum) {
>
> break;
>
> }
>
> Key indexKey = entry.getKey();
>
> Key rowKey = new Key(indexKey.getColumnQualifier());
>
> Text followRow = rowKey.followingKey(PartialKey.ROW).getRow();
>
> list.add(new Range(rowKey.getRow(), followRow));
>
> // list.add(Range.exact(entry.getKey().getColumnQualifier()));
>
> }
>
> scan.close();
>
>
> But i find that it not have big different,I make the list has 5000
> range,and it cost about 13s when I use it by BatchScanner in two ways.
>
>
> I change my config in accumulo-site.xml,and now the results=0 is not found.
>
>
> This is my accumulo-site.xml:
>
> <property>
>
> <name>tserver.cache.data.size</name>
>
> <value>4G</value>
>
> </property>
>
>
> <property>
>
> <name>tserver.cache.index.size</name>
>
> <value>16G</value>
>
> </property>
>
>
> <property>
>
> <name>tserver.memory.maps.native.enabled</name>
>
> <value>true</value>
>
> </property>
>
>
> <property>
>
> <name>tserver.metadata.readhead.concurrent.max</name>
>
> <value>65536</value>
>
> </property>
>
>
> <property>
>
> <name>tserver.readhead.concurrent.max</name>
>
> <value>65536</value>
>
> </property>
>
>
> <property>
>
> <name>tserver.scan.files.open.max</name>
>
> <value>65536</value>
>
> </property>
>
>
> <property>
>
> <name>table.cache.block.enable</name>
>
> <value>true</value>
>
> </property>
>
>
> <property>
>
> <name>table.cache.index.enable</name>
>
> <value>true</value>
>
> </property>
>
>
> Is it ok?
>
>
> Thanks
>
>
> 原始邮件
> *发件人:* Josh Elser<josh.elser@gmail.com>
> *收件人:* user<user@accumulo.apache.org>
> *发送时间:* 2015年1月14日(周三) 11:13
> *主题:* Re: 回复:how can i optimize scan speed when use batch scan ?
>
> Thanks! That's very helpful.
>
> You probably meant to do the following:
>
> Key indexKey = entry.getKey();
> Key rowKey = new Key(indexKey.getColumnQualifier());
> Text followingRow = rowKey.followingKey(PartialKey.ROW).getRow();
> list.add(new Range(k.getRow(), followingRow);
>
> Range.exact(row) will only match a Key which has that exact row ID
> (empty column family and qualifier). The above will match all keys with
> the provided row ID (all column families and qualifiers).
>
> Does that make sense (and hopefully work)?
>
> 覃璐 wrote:
>>
>>  this is the code how I get the row ids which in ColumnQualify:
>>
>>
>>  Scanner scan = conn.createScanner(“t1", new Authorizations());
>>
>>  List<Range>  list = new ArrayList<Range>();
>>
>>  for (Map.Entry<Key, Value>  entry : scan) {
>>
>>  if (list.size() == resultNum * threadNum) {
>>
>>  break;
>>
>>  }
>>
>>  list.add(Range.exact(entry.getKey().getColumnQualifier()));
>>
>>  }
>>
>>  scan.close();
>>
>>
>>  and then I use the row ids to scan data.
>>
>>  BatchScanner bs = null;
>>
>>  try {
>>
>>  bs = conn.createBatchScanner("test.new_index", new Authorizations(), 10);
>>
>>  } catch (TableNotFoundException e) {
>>
>>  e.printStackTrace();
>>
>>  }
>>
>>  bs.setRanges(list);
>>
>>
>>  原始邮件
>>  *发件人:* Josh Elser<josh.elser@gmail.com  <mailto:josh.elser@gmail.com>>
>>  *收件人:* user<user@accumulo.apache.org  <mailto:user@accumulo.apache.org>>
>>  *发送时间:* 2015年1月14日(周三) 10:32
>>  *主题:* Re: 回复:how can i optimize scan speed when use batch scan ?
>>
>>  You might need to set tserver.cache.data.size to a larger value.
>>  Depending on the amount of data, you might just churn through the cache
>>  without getting much benefit. I think you have to restart Accumulo after
>>  changing this property.
>>
>>  Can you show us the code you used to try to scan for a row ID and the
>>  data in the table you expected to be returned that wasn't?
>>
>>  覃璐 wrote:
>>>   Yes,I received all results what I want when the program end.
>>>
>>>   But I do not know why the scan received 0 result when I ensure a exists
>>>   row id?
>>>
>>>   I config the table.cache.block.enable=true,but I do not found distinct
>>>   change.
>>>
>>>   Thanks
>>>
>>>
>>>   原始邮件
>>>   *发件人:* Eric Newton<eric.newton@gmail.com  <mailto:eric.newton@gmail.com>
  <mailto:eric.newton@gmail.com  <mailto:eric.newton@gmail.com>>>
>>>   *收件人:*user@accumulo.apache.org  <mailto:user@accumulo.apache.org>
  <mailto:user@accumulo.apache.org  <mailto:user@accumulo.apache.org>><user@accumulo.apache.org
 <mailto:user@accumulo.apache.org>   <mailto:user@accumulo.apache.org  <mailto:user@accumulo.apache.org>>>
>>>   *发送时间:* 2015年1月14日(周三) 00:17
>>>   *主题:* Re: 回复:how can i optimize scan speed when use batch scan ?
>>>
>>>   You should have received at least 1390 Key/Value pairs (#results=1390).
>>>
>>>   If your application has many exact RowID look-ups, you may want to
>>>   investigate Bloom filters.
>>>
>>>   Consider turning on data block caching to reduce latency on future look-ups.
>>>
>>>   -Eric
>>>
>>>
>>>   On Mon, Jan 12, 2015 at 8:15 PM, 覃璐<luq.java@gmail.com  <mailto:luq.java@gmail.com>
  <mailto:luq.java@gmail.com  <mailto:luq.java@gmail.com>>
>>>   <mailto:luq.java@gmail.com  <mailto:luq.java@gmail.com>   <mailto:luq.java@gmail.com
 <mailto:luq.java@gmail.com>>>>   wrote:
>>>
>>>       i am sorry i do not know about the image.
>>>
>>>       the log is this:
>>>
>>>
>>>       [17:50:38] TRACE
>>>       [org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator]
>>>       [org.apache.accumulo.core.util.OpTimer.start(OpTimer.java:39)]
>>>       [21521] - tid=65 oid=675 Continuing multi scan,
>>>       scanid=-152589127623326551
>>>
>>>       [17:50:38] TRACE
>>>       [org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator]
>>>       [org.apache.accumulo.core.util.OpTimer.stop(OpTimer.java:49)]
>>>       [21544] - tid=65 oid=675 Got more multi scan results, #results=1390
>>>       scanID=-152589127623326551 in 0.023 secs
>>>
>>>       [17:50:38] TRACE
>>>       [org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator]
>>>       [org.apache.accumulo.core.util.OpTimer.start(OpTimer.java:39)]
>>>       [21546] - tid=65 oid=676 Continuing multi scan,
>>>       scanid=-152589127623326551
>>>
>>>       [17:50:38] TRACE
>>>       [org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator]
>>>       [org.apache.accumulo.core.util.OpTimer.stop(OpTimer.java:49)]
>>>       [21555] - tid=45 oid=644 Got more multi scan results, #results=0
>>>       scanID=-4477962012178388198 in 1.002 secs
>>>
>>>       [17:50:38] TRACE
>>>       [org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator]
>>>       [org.apache.accumulo.core.util.OpTimer.start(OpTimer.java:39)]
>>>       [21555] - tid=45 oid=677 Continuing multi scan,
>>>       scanid=-4477962012178388198
>>>
>>>       [17:50:38] TRACE
>>>       [org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator]
>>>       [org.apache.accumulo.core.util.OpTimer.stop(OpTimer.java:49)]
>>>       [21596] - tid=57 oid=645 Got more multi scan results, #results=0
>>>       scanID=-8718025066902358141 in 1.003 secs
>>>
>>>       [17:50:38] TRACE
>>>       [org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator]
>>>       [org.apache.accumulo.core.util.OpTimer.start(OpTimer.java:39)]
>>>       [21596] - tid=57 oid=678 Continuing multi scan,
>>>       scanid=-8718025066902358141
>>>
>>>
>>>       the scan spend long time but has no result.
>>>
>>>
>>>       i use 1.6.1,and the config output is this:
>>>
>>>
>>>       default | table.balancer ............................ |
>>>       org.apache.accumulo.server.master.balancer.DefaultLoadBalancer
>>>
>>>       default | table.bloom.enabled ....................... | false
>>>
>>>       default | table.bloom.error.rate .................... | 0.5%
>>>
>>>       default | table.bloom.hash.type ..................... | murmur
>>>
>>>       default | table.bloom.key.functor ................... |
>>>       org.apache.accumulo.core.file.keyfunctor.RowFunctor
>>>
>>>       default | table.bloom.load.threshold ................ | 1
>>>
>>>       default | table.bloom.size .......................... | 1048576
>>>
>>>       default | table.cache.block.enable .................. | false
>>>
>>>       default | table.cache.index.enable .................. | true
>>>
>>>       default | table.classpath.context ................... |
>>>
>>>       default | table.compaction.major.everything.idle .... | 1h
>>>
>>>       default | table.compaction.major.ratio .............. | 3
>>>
>>>       default | table.compaction.minor.idle ............... | 5m
>>>
>>>       default | table.compaction.minor.logs.threshold ..... | 3
>>>
>>>       table | table.constraint.1 ........................ |
>>>       org.apache.accumulo.core.constraints.DefaultKeySizeConstraint
>>>
>>>       default | table.failures.ignore ..................... | false
>>>
>>>       default | table.file.blocksize ...................... | 0B
>>>
>>>       default | table.file.compress.blocksize ............. | 100K
>>>
>>>       default | table.file.compress.blocksize.index ....... | 128K
>>>
>>>       default | table.file.compress.type .................. | gz
>>>
>>>       default | table.file.max ............................ | 15
>>>
>>>       default | table.file.replication .................... | 0
>>>
>>>       default | table.file.type ........................... | rf
>>>
>>>       default | table.formatter ........................... |
>>>       org.apache.accumulo.core.util.format.DefaultFormatter
>>>
>>>       default | table.groups.enabled ...................... |
>>>
>>>       default | table.interepreter ........................ |
>>>       org.apache.accumulo.core.util.interpret.DefaultScanInterpreter
>>>
>>>       table | table.iterator.majc.vers .................. |
>>>       20,org.apache.accumulo.core.iterators.user.VersioningIterator
>>>
>>>       table | table.iterator.majc.vers.opt.maxVersions .. | 1
>>>
>>>       table | table.iterator.minc.vers .................. |
>>>       20,org.apache.accumulo.core.iterators.user.VersioningIterator
>>>
>>>       table | table.iterator.minc.vers.opt.maxVersions .. | 1
>>>
>>>       table | table.iterator.scan.vers .................. |
>>>       20,org.apache.accumulo.core.iterators.user.VersioningIterator
>>>
>>>       table | table.iterator.scan.vers.opt.maxVersions .. | 1
>>>
>>>       default | table.majc.compaction.strategy ............ |
>>>       org.apache.accumulo.tserver.compaction.DefaultCompactionStrategy
>>>
>>>       default | table.scan.max.memory ..................... | 512K
>>>
>>>       default | table.security.scan.visibility.default .... |
>>>
>>>       default | table.split.threshold ..................... | 1G
>>>
>>>       default | table.walog.enabled ....................... | true
>>>
>>>
>>>       and my tablet server is 4 core,32G.
>>>
>>>
>>>       Thanks
>>>
>>>
>>>       原始邮件
>>>       *发件人:* Josh Elser<josh.elser@gmail.com  <mailto:josh.elser@gmail.com>
  <mailto:josh.elser@gmail.com  <mailto:josh.elser@gmail.com>>   <mailto:josh.elser@gmail.com
 <mailto:josh.elser@gmail.com>   <mailto:josh.elser@gmail.com  <mailto:josh.elser@gmail.com>>>>
>>>       *收件人:* user<user@accumulo.apache.org  <mailto:user@accumulo.apache.org>
  <mailto:user@accumulo.apache.org  <mailto:user@accumulo.apache.org>>
>>>       <mailto:user@accumulo.apache.org  <mailto:user@accumulo.apache.org>
  <mailto:user@accumulo.apache.org  <mailto:user@accumulo.apache.org>>>>
>>>       *发送时间:* 2015年1月12日(周一) 23:52
>>>       *主题:* Re: 回复:how can i optimize scan speed when use batch scan
?
>>>
>>>       FYI, images don't (typically) come across on the mailing list. Use some
>>>       external hosting and provide the link if it's important, please.
>>>
>>>       How many tabletservers do you have? What version of Accumulo are you
>>>       running? Can you share the output of `config -t your_table_name`?
>>>
>>>       Thanks.
>>>
>>>       覃璐 wrote:
>>>       >    i look the trace log
>>>       >
>>>       >
>>>       >    why it receive 0 result and spend so long?
>>>       >
>>>       >
>>>       >    原始邮件
>>>       >    *发件人:* 覃璐<luq.java@gmail.com  <mailto:luq.java@gmail.com>
  <mailto:luq.java@gmail.com  <mailto:luq.java@gmail.com>>    <mailto:luq.java@gmail.com
 <mailto:luq.java@gmail.com>   <mailto:luq.java@gmail.com  <mailto:luq.java@gmail.com>>>>
>>>       >    *收件人:* user<user@accumulo.apache.org  <mailto:user@accumulo.apache.org>
  <mailto:user@accumulo.apache.org  <mailto:user@accumulo.apache.org>>    <mailto:user@accumulo.apache.org
 <mailto:user@accumulo.apache.org>   <mailto:user@accumulo.apache.org  <mailto:user@accumulo.apache.org>>>>
>>>       >    *发送时间:* 2015年1月12日(周一) 17:05
>>>       >    *主题:* how can i optimize scan speed when use batch scan ?
>>>       >
>>>       >    hi all.
>>>       >
>>>       >    now i have code like this:
>>>       >
>>>       >    List<Range>    rangeList=…..;
>>>       >    BatchScanner bs=conn.createBatchScanner();
>>>       >    bs.setRanges(rangeList);
>>>       >
>>>       >
>>>       >    the rangeList has many ranges about 1000,and every range has a
random
>>>       >    row id when i use Range.exact(new Text(…)),
>>>       >    but the speed is so slowly,it maybe spend 2-3s,how can i optimize
it ?
>>>       >
>>>       >    thanks
>>>
>>>

Mime
View raw message