incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dragos cernahoschi <dragos.cernahos...@gmail.com>
Subject Re: CASSANDRA-1472 (bitmap indexes)
Date Mon, 15 Nov 2010 11:57:23 GMT
I've tested 0.7-beta3 branch index feature without the 1472 patch. The
queries on more than one column works better than the patched version, but
definitely not correctly.

1.
- query on 3 columns, start key 1, row count 1 => no results
- query on same columns, start key 1, row count 10 => 8 results

2.
- same query, start key 1, row count 2 => 1 result
- query again, start key = max (keys from prev query) + 1, row count 2
=>*time out, infinite cycle
*

3. Is there any example on the pagination feature? (without knowing the
expected number of rows).

Will the get_indexed_slices return an empty list when there is no more
results?

- query on 1 column, start key 1, row count 1000 => ok
- same query, start key = max (keys from prev query) + 1, row count 1000 =>
ok
...
- *at some point the max (keys from prev query) < startkey and my pagination
loop runs forever*

Maybe I'm missing something on this.

4.
- query on 1 column, row count 1000 => ok
- query on 3 columns, row count 100 => time out (there is no infinite loop,
the thread eventually terminates)

Dragos

On Sun, Nov 14, 2010 at 2:34 AM, Stu Hood <stuhood@gmail.com> wrote:

> > Is it worth testing 0.7-branch-without-1472 to make sure of that?
> Dragos: if you have time, this would be helpful. If you already have a KEYS
> index created, you shouldn't need to re-load the data, as the file format
> hasn't changed.
>
> Thanks,
> Stu
>
> On Sat, Nov 13, 2010 at 4:40 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
>
> > Is it worth testing 0.7-branch-without-1472 to make sure of that?
> >
> > On Fri, Nov 12, 2010 at 10:28 AM, Stu Hood <stuhood@gmail.com> wrote:
> > > Great, thanks for the variable Dragos: I'm fairly sure I broke this in
> > the
> > > refactoring I did in 1472 to fit in a second index type.
> > >
> > >
> > > On Fri, Nov 12, 2010 at 4:03 AM, dragos cernahoschi <
> > > dragos.cernahoschi@gmail.com> wrote:
> > >
> > >> I confirm: the KEYS indexes have the same behavior as the KEYS_BITMAP
> > >> indexes: time out/succeed on the same queries.
> > >>
> > >> By the way, the insert of my data set with KEYS_BITMAP is much faster
> > than
> > >> KEYS (about 5.5 times) and less gc intensive.
> > >>
> > >> Dragos
> > >>
> > >> On Tue, Nov 9, 2010 at 8:05 PM, Stu Hood <stu.hood@rackspace.com>
> > wrote:
> > >>
> > >> > Interesting, thanks for the info.
> > >> >
> > >> > Perhaps the limitation is that index queries involving multiple
> > clauses
> > >> are
> > >> > currently implemented using brute-force filtering rather than an
> index
> > >> join?
> > >> > The bitmap indexes have native support for this type of join, but
> it's
> > >> not
> > >> > being used yet.
> > >> >
> > >> > To confirm: have you tried the same scenario with KEYS indexes? They
> > use
> > >> > the same codepath for multiple index expressions, and should
> > experience
> > >> the
> > >> > same timeouts. Also, can you rerun the KEYS_BITMAP test with DEBUG
> > >> logging
> > >> > enabled, to ensure that we aren't going into some kind of infinite
> > loop?
> > >> >
> > >> > Thanks for the help,
> > >> > Stu
> > >> >
> > >> > -----Original Message-----
> > >> > From: "dragos cernahoschi" <dragos.cernahoschi@gmail.com>
> > >> > Sent: Tuesday, November 9, 2010 11:50am
> > >> > To: dev@cassandra.apache.org
> > >> > Subject: Re: CASSANDRA-1472 (bitmap indexes)
> > >> >
> > >> > I'm running the query on three columns with cardinalities: 22, 17
> and
> > >> > 10.
> > >> > Interesting, if combining columns with cardinalities:
> > >> >
> > >> > 22 + 17 => no exception
> > >> > 22 + 10 => no exception
> > >> > 10 + 17 => timed out exception
> > >> > 22 + 17 + 10 => timed out exception
> > >> >
> > >> >
> > >> > On Tue, Nov 9, 2010 at 6:29 PM, Stu Hood <stu.hood@rackspace.com>
> > wrote:
> > >> >
> > >> > > Can you tell me a little bit about your key distribution? How
many
> > >> unique
> > >> > > values are indexed (the cardinality)?
> > >> > >
> > >> > > Until the OrBiC projection I mention on 1472 is implemented,
the
> > >> > > bitmap
> > >> > > secondary indexes will perform terribly for high cardinality
> > datasets.
> > >> > >
> > >> > > Thanks!
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: "dragos cernahoschi" <dragos.cernahoschi@gmail.com>
> > >> > > Sent: Tuesday, November 9, 2010 10:14am
> > >> > > To: dev@cassandra.apache.org
> > >> > > Subject: Re: CASSANDRA-1472 (bitmap indexes)
> > >> > >
> > >> > > Meantime the number of SSTable(s) reduced to just 7. Initially
the
> > >> > > compaction thread suffered the same problem of "too many open
> files"
> > >> and
> > >> > > couldn't do any compaction.
> > >> > >
> > >> > > But I'm still not able to run my tests: TimedOutException :(
> > >> > >
> > >> > > On Tue, Nov 9, 2010 at 5:51 PM, Stu Hood <stu.hood@rackspace.com>
> > >> wrote:
> > >> > >
> > >> > > > Hmm, 500 sstables is definitely a degenerate case: did you
> disable
> > >> > > > compaction? By default, Cassandra strives to keep the sstable
> > count
> > >> > below
> > >> > > > ~32, since accesses to separate sstables require seeks.
> > >> > > >
> > >> > > > In this case, the query will seek 500 times to check the
> secondary
> > >> > index
> > >> > > > for each sstable: if it finds matches it will need to seek
to
> find
> > >> them
> > >> > > in
> > >> > > > the primary index, and seek again for the data file.
> > >> > > >
> > >> > > > -----Original Message-----
> > >> > > > From: "dragos cernahoschi" <dragos.cernahoschi@gmail.com>
> > >> > > > Sent: Tuesday, November 9, 2010 5:33am
> > >> > > > To: dev@cassandra.apache.org
> > >> > > > Subject: Re: CASSANDRA-1472 (bitmap indexes)
> > >> > > >
> > >> > > > There are about 500 SSTables (12GB of data including index
data,
> > >> > > > statistics...) The source data file had about 3GB/26 million
> rows.
> > >> > > >
> > >> > > > I only test with EQ expressions for now.
> > >> > > >
> > >> > > > Increasing the file limit resolved the problem, but now
I'm
> > getting
> > >> > > > TimedOutException(s) from thrift when "querying" even with
slice
> > >> > > > size
> > >> > of
> > >> > > 1.
> > >> > > > Is my machine too small (core 2 duo 2.93 2GB RAM Ubuntu
10.04)
> for
> > >> such
> > >> > a
> > >> > > > test?
> > >> > > >
> > >> > > > I really have some interesting sets of data to test indexes
with
> > and
> > >> I
> > >> > > want
> > >> > > > to make a comparison between ordinary indexes and bitmap
> indexes.
> > >> > > >
> > >> > > > Thank you,
> > >> > > > Dragos
> > >> > > >
> > >> > > > On Mon, Nov 8, 2010 at 6:42 PM, Stu Hood <
> stu.hood@rackspace.com>
> > >> > wrote:
> > >> > > >
> > >> > > > > Dragos,
> > >> > > > >
> > >> > > > > How many SSTables did you have on disk, and were any
of your
> > index
> > >> > > > > expressions GT(E)/LT(E)?
> > >> > > > >
> > >> > > > > I expect that you are bumping into a limitation of
the current
> > >> > > > > implementation: it opens up to 128 file-handles per
SSTable in
> > the
> > >> > > worst
> > >> > > > > case for a GT/LT query (one per index bucket).
> > >> > > > >
> > >> > > > > A future version might remove that requirement, but
for now,
> you
> > >> > should
> > >> > > > > probably bump the file handle limit on your machine
to at
> least
> > >> 2^16.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Stu
> > >> > > > >
> > >> > > > >
> > >> > > > > -----Original Message-----
> > >> > > > > From: "dragos cernahoschi" <dragos.cernahoschi@gmail.com>
> > >> > > > > Sent: Monday, November 8, 2010 10:05am
> > >> > > > > To: dev@cassandra.apache.org
> > >> > > > > Subject: CASSANDRA-1472 (bitmap indexes)
> > >> > > > >
> > >> > > > > Hi,
> > >> > > > >
> > >> > > > > I've got an exception during the following test:
> > >> > > > >
> > >> > > > > test machine: core 2 duo 2.93 2GB RAM Ubuntu 10.04
> > >> > > > >
> > >> > > > > test scenario:
> > >> > > > > - 1 column family
> > >> > > > > - about 15 columns
> > >> > > > > - 7 indexed columns (bitmap)
> > >> > > > > - 26 million rows (insert operation went fine)
> > >> > > > > - thrift "query" on 3 of the indexed columns with
> > >> get_indexed_slices
> > >> > > > > (count:
> > >> > > > > 100)
> > >> > > > > - got the following exception:
> > >> > > > >
> > >> > > > > 10/11/08 17:52:40 ERROR service.AbstractCassandraDaemon:
Fatal
> > >> > > exception
> > >> > > > in
> > >> > > > > thread Thread[ReadStage:3,5,main]
> > >> > > > > java.io.IOError: java.io.FileNotFoundException:
> > >> > > > > /home/dragos/cassandra/data/keyspace/visit-e-814-4-Bitidx.db
> > (Too
> > >> > many
> > >> > > > open
> > >> > > > > files)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.sstable.bitidx.SegmentIterator.open(SegmentIterator.java:78)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.sstable.bitidx.BitmapIndexReader.openBin(BitmapIndexReader.java:226)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.sstable.bitidx.BitmapIndexReader.iterator(BitmapIndexReader.java:214)
> > >> > > > >    at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> org.apache.cassandra.io.sstable.SSTableReader.scan(SSTableReader.java:523)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.secindex.KeysBitmapIndex.iterator(KeysBitmapIndex.java:103)
> > >> > > > >    at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.ColumnFamilyStore.scan(ColumnFamilyStore.java:1371)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:41)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:51)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > >> > > > >    at java.lang.Thread.run(Thread.java:662)
> > >> > > > > Caused by: java.io.FileNotFoundException:
> > >> > > > > /home/dragos/cassandra/data/keyspace/visit-e-814-4-Bitidx.db
> > (Too
> > >> > many
> > >> > > > open
> > >> > > > > files)
> > >> > > > >    at java.io.FileInputStream.open(Native Method)
> > >> > > > >    at java.io.FileInputStream.<init>(FileInputStream.java:106)
> > >> > > > >    at
> > >> > > > >
> > >> > >
> > >>
> org.apache.avro.file.SeekableFileInput.<init>(SeekableFileInput.java:29)
> > >> > > > >    at
> > >> > > org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:38)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.sstable.bitidx.SegmentIterator.open(SegmentIterator.java:72)
> > >> > > > >    ... 10 more
> > >> > > > > 10/11/08 17:52:40 ERROR service.AbstractCassandraDaemon:
Fatal
> > >> > > exception
> > >> > > > in
> > >> > > > > thread Thread[ReadStage:2,5,main]
> > >> > > > > java.io.IOError: java.io.FileNotFoundException:
> > >> > > > > /home/dragos/cassandra/data/keyspace/visit-e-1018-Index.db
> (Too
> > >> many
> > >> > > open
> > >> > > > > files)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.util.BufferedSegmentedFile.getSegment(BufferedSegmentedFile.java:68)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.util.SegmentedFile$SegmentIterator.next(SegmentedFile.java:129)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.util.SegmentedFile$SegmentIterator.next(SegmentedFile.java:1)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:455)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:572)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:49)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:72)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:84)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1190)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1082)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1052)
> > >> > > > >    at
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.db.ColumnFamilyStore.scan(ColumnFamilyStore.java:1378)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:41)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:51)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > >> > > > >    at java.lang.Thread.run(Thread.java:662)
> > >> > > > > Caused by: java.io.FileNotFoundException:
> > >> > > > > /home/dragos/cassandra/data/keyspace/visit-e-1018-Index.db
> (Too
> > >> many
> > >> > > open
> > >> > > > > files)
> > >> > > > >    at java.io.RandomAccessFile.open(Native Method)
> > >> > > > >    at
> java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
> > >> > > > >    at
> java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:142)
> > >> > > > >    at
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> org.apache.cassandra.io.util.BufferedSegmentedFile.getSegment(BufferedSegmentedFile.java:62)
> > >> > > > >    ... 16 more
> > >> > > > >
> > >> > > > > The same test worked fine with 1 million rows.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >>
> > >
> >
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of Riptano, the source for professional Cassandra support
> > http://riptano.com
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message