cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nate Sammons <NSamm...@ften.com>
Subject RE: Secondary index issue, unable to query for records that should be there
Date Tue, 08 Nov 2011 16:53:23 GMT
Here is a simple test that shows the problem.  My setup is:


-          DSE 1.0.3 on Ubuntu 11.04, JDK 1.6.0_29 on x86_64, installed from the DataStax
debian repo (yesterday)

-          Hector 1.0-1 (from maven)

Attached is a CLI file to create the keyspace and CF, and a java file to insert data and do
some queries.


This creates the following CF:

create column family IndexTest with
  key_validation_class = UTF8Type
  and comparator = UTF8Type
  and column_metadata = [
      {column_name:year, validation_class:IntegerType, index_type: KEYS},
      {column_name:month, validation_class:IntegerType, index_type: KEYS},
      {column_name:day, validation_class:IntegerType, index_type: KEYS},
      {column_name:hour, validation_class:IntegerType, index_type: KEYS},
      {column_name:minute, validation_class:IntegerType, index_type: KEYS},
      {column_name:data, validation_class:UTF8Type}
  ];


Then inserts 5 rows per minute value, with the following values for year/month/day/hour/minute:

                Year: 2011
                Month: 1, 2
                Day: 1-15
                Hour: 1-23
                Minute: 1-59

For a total of 203,550 rows.  For queries it just picks some known values for year/month/day/hour/minute
at random and looks for rows, there should be 5 rows per combination.

Row keys are of the form YEAR-MONTH-DAY-HOUR-MINUTE-NUM (where NUM is 1-5).


Now once that data is inserted, using the CLI I can find records such as the following:


[default@Test] get IndexTest[2011-1-8-18-30--1];
=> (column=data, value=xvktwirapi0qs0ta29w9rchbdc2omsuv0k2chjqp9pmaodlj9ngecllaa8eq3nnx66p591b2a06mry4rpsvkd54ji5pbxikpc6mxj4czi4nuuxgoasibjd5yk65hdtqe8a0uq3yxnw81dgq6hkx8wnbs177rwo51xtkwuhwizoc0gul92pvo6tfivjgdschd9fjzfu4v1d1uxhih3argr1mp4i1h6fqybfv2utlzdzzqczq3ruu90647prrnqwdw1zqmd46ia175a929ltx2hoz8sv6rs817zm2myhp3wekfk3flnuniqgtpth7g5fns8q3oc8qde5btivt1j99gc1h2kxjbek1p448t1hs91lh9r6yrg1douj53sn7d81bnwp4nnbmz01dbr46fae1b9ter0zljet2nl1x751no6pdt64k2mdh0un01gerfihak6vn0wdvgzuv9soji3pwgnffkw2zvm5q0jlp1uf9nmy7gzswydpxwtvc35c6jw64d,
timestamp=1320769482652005)
=> (column=day, value=8, timestamp=1320769482652002)
=> (column=hour, value=18, timestamp=1320769482652003)
=> (column=minute, value=30, timestamp=1320769482652004)
=> (column=month, value=1, timestamp=1320769482652001)
=> (column=year, value=2011, timestamp=1320769482652000)
Returned 6 results.


However a CQL query to find that same record fails:

[default@Test] get IndexTest where year=2011 and month=1 and day=8 and hour=18 and minute=30;

0 Row Returned.
[default@Test] get IndexTest where year=2011 and month=1 and day=8 and hour=18;

0 Row Returned.
[default@Test] get IndexTest where year=2011 and month=1 and day=8;

0 Row Returned.
[default@Test] get IndexTest where year=2011 and month=1;


Similar results using CQLSH:

cqlsh> select * from IndexTest where year=2011 and month=1 and day=8 and hour=18 and minute=30;
cqlsh> select * from IndexTest where year=2011 and month=1 and day=8 and hour=18;
cqlsh> select * from IndexTest where year=2011 and month=1 and day=8;

(no results in any of those cases).




However, some data does show up through CQL (I omitted the column data for brevity):

[default@Test] get IndexTest where year=2011 and month=2 and day=8 and hour=18 and minute=30;
-------------------
RowKey: 2011-2-8-18-30--1
-------------------
RowKey: 2011-2-8-18-30--4
-------------------
RowKey: 2011-2-8-18-30--5
-------------------
RowKey: 2011-2-8-18-30--2
-------------------
RowKey: 2011-2-8-18-30--3

5 Rows Returned.


So it seems like (in this case), month=1 is not working, but month=2 does work (along with
the other parts of the expression).  I havn't tried this a bunch of times to see if this is
always the case, but it seems to be.


When running those queries using Hector, in the debugger the QueryResult's get() method returns
null (which should have rows).



Thanks,

-nate



From: Jake Luciani [mailto:jakers@gmail.com]
Sent: Tuesday, November 08, 2011 8:56 AM
To: user@cassandra.apache.org
Subject: Re: Secondary index issue, unable to query for records that should be there

Hi Nate,

Could you try running it with debug enabled on the logs? it will give more insite into what's
going on.

-Jake

On Tue, Nov 8, 2011 at 3:45 PM, Nate Sammons <NSammons@ften.com<mailto:NSammons@ften.com>>
wrote:
This is against a single server, not a cluster.  Replication factor for the keyspace is set
to 1, CL is the default for Hector, which I think is QUORUM.

I'm trying to get a simple test together that shows this.  Does anyone know if multiple indexes
like this are efficient?

Thanks,

-nate


From: Riyad Kalla [mailto:rkalla@gmail.com<mailto:rkalla@gmail.com>]
Sent: Monday, November 07, 2011 4:31 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Secondary index issue, unable to query for records that should be there

Nate, is this all against a single Cassandra server, or do you have a ring setup? If you do
have a ring setup, what is your replicationfactor set to? Also what ConsistencyLevel are you
writing with when storing the values?

-R
On Mon, Nov 7, 2011 at 2:43 PM, Nate Sammons <NSammons@ften.com<mailto:NSammons@ften.com>>
wrote:
Hello,

I'm experimenting with Cassandra (DataStax Enterprise 1.0.3), and I've got a CF with several
secondary indexes to try out some options.  Right now I have the following to create my CF
using the CLI:

create column family MyTest with
  key_validation_class = UTF8Type
  and comparator = UTF8Type
  and column_metadata = [
      -- absolute timestamp for this message, also indexed year/month/day/hour/minute
      -- index these as they are low cardinality
      {column_name:messageTimestamp, validation_class:LongType},
      {column_name:messageYear, validation_class:IntegerType, index_type: KEYS},
      {column_name:messageMonth, validation_class:IntegerType, index_type: KEYS},
      {column_name:messageDay, validation_class:IntegerType, index_type: KEYS},
      {column_name:messageHour, validation_class:IntegerType, index_type: KEYS},
      {column_name:messageMinute, validation_class:IntegerType, index_type: KEYS},

                ... other non-indexed columns defined

  ];


So when I insert data, I calculate a year/month/day/hour/minute and set these values on a
Hector ColumnFamilyUpdater instance and update that way.  Then later I can query from the
command line with CQL such as:

                get MyTest where messageYear=2011 and messageMonth=6 and messageDay=1 and
messageHour=13 and messageMinute=44;

etc.  This generally works, however at some point queries that I know should return data no
longer return any rows.

So for instance, part way through my test (inserting 250K rows), I can query for what should
be there and get data back such as the above query, but later that same query returns 0 rows.
 Similarly, with fewer clauses in the expression, like this:

                get MyTest where messageYear=2011 and messageMonth=6;

Will also return 0 rows.


???????
Any idea what could be going wrong?  I'm not getting any exceptions in my client during the
write, and I don't see anything in the logs (no errors anyway).



A second question - is what I'm doing insane?  I'm not sure that performance on CQL queries
with multiple indexed columns is good (does Cassandra intelligently use all available indexes
on these queries?)



Thanks,

-nate




--
http://twitter.com/tjake

Mime
View raw message