incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Future of Blur Query Language
Date Sun, 26 Aug 2012 14:01:03 GMT
On Sat, Aug 25, 2012 at 4:48 PM, Tim Tutt <tim.tutt@gmail.com> wrote:
> Aaron,
>
> Just for a little clarification on your example, when you say JOIN, are you
> actually just talking about a union of two sets or are you actually
> referring to the relational type of join where the intent is to merge them
> into a single record? If it's the former, wouldn't a simple OR suffice?

Well it's a little different in the Lucene world, but in essence it
would be the latter.  However the result is not a single Record but
rather a Row that contains the 2 Records.

Take a look at this link:
http://lucene.apache.org/core/3_6_1/api/contrib-join/org/apache/lucene/search/join/package-summary.html

Blur uses the Index-time joins, but it's an internal piece of code.
Blur doesn't actually use this contrib although maybe it should.

>
> Provided that I am in fact missing something, here are my thoughts on the
> query language:
>
> A common theme that I have seen across the board with commercial
> search/discovery products is the creation of a query language modeled after
> SQL with varying limitations. This tends to be fairly effective as the
> learning curve is not too steep for users who have experience writing SQL
> queries and dealing with relational databases. Additionally, these users
> normally find a way to live with the limitations of the language and find
> ways around the problems they are trying to solve as the language is
> typically advanced enough to be creative.
>
> Such a language, however, does not lend it self well to the less advanced
> end users of your product. Perhaps in certain cases this is acceptable as
> you will always have some advanced user available, but in the cases where
> these advanced users are in limited supply the learning curve becomes
> steeper as the technical ability and know-how decreases.

I agree with your assessment of a SQL-like language, my fear in making
this the standard for all queries in Blur is the extra syntax the
language would require.  For example:

"select * from test_table where super = 'test';"

But this really isn't correct because in sql this would mean an exact
match and you would have to index the data in several different ways
to make super = 'test' work.  Instead it should be something like:

"select * from test_table where super like 'test';"

However in Lucene syntax and CQL it's just:

"test"

Also I like the separation of what to result from the query, as well
as where to start, how many to fetch, etc.

Blur has a JDBC project, perhaps both can be used.  We could use SQL
as a control language for passing what to select, sort by, etc and let
CQL be the query language.  So from thrift you would simply run
queries the same way we do now except that the simpleQuery would be
CQL instead of Lucene syntax.  And from the JDBC project you could
control the fetches, selectors, etc from SQL.

In Thrift:
BlurQuery bq = new BlurQuery();
bq.simpleQuery = new SimpleQuery();
bq.simpleQuery.query = "test";
bq.superOn = true;
bq.selector = new Selector();

client.query("test",bq);

In SQL/JDBC

select * from test where q("test");

or something like that...  Just throwing out ideas.

>
> In taking a brief look at the spec for CQL, I tend to agree with your
> assessment that it is the best option as it looks like it has the ability
> to be flexible enough to fit both cases. It is possible that you will run
> into limitations with the queries that your more advanced users are
> interested in, but perhaps those are the cases where Blur is not a fit.
>
>
> Tim
>
> On Sat, Aug 25, 2012 at 2:49 PM, Aaron McCurry <amccurry@gmail.com> wrote:
>
>> I would to start a thread on the topic of the future of Blur's query
>> language.  Currently the "simpleQuery" is just a normal Lucene based
>> syntax with a little magic to figure out the joins (via the
>> SuperQuery) that the user probably intended.  Of course this guess
>> work gets it wrong sometimes.  Let me explain with an example:
>>
>> Given the query with superOn:
>>
>> +cf1.field1:value1 +cf1.field2.value2
>>
>> The current implementation will ASSUME that you want to find where
>> "cf1.field1" contains "value1" and where "cf1.field2" contains
>> "value2" in the same Record because the column family is the same.
>> i.e. NO JOIN across records
>>
>> But perhaps the user really does want a join, meaning that the user
>> wants to find any Row that contains one or more Records that have a
>> field "cf1.field1" that contains "value1" and one or more Records in
>> the same Row (but not necessarily in the same Record) that contains a
>> field "cf1.field2" that contains "value2".  i.e. JOIN
>>
>> Given that current implementation, the only way to force the JOIN is
>> to do something like:
>>
>> +(+cf1.field1:value1 nocf.nofield:somevalue) +(+cf1.field2.value2
>> nocf.nofield:somevalue)
>>
>> This will trick the parser into creating 2 separate join query
>> (SuperQuery) objects and perform the JOIN.
>>
>>
>> THIS IS UGLY.
>>
>> Here are the current criteria for a query language:
>> - The ability to support any Lucene query type (Boolean, Term, Fuzzy,
>> Span, etc.)
>> - User defined query type should be supported, extensible
>> - The query language should be compatible with any programming
>> language so that the current thrift RPC can continue to be utilized
>>
>> Here are options that I have been thinking about:
>>
>> Option 1:
>> Somehow extend the current Lucene Query syntax to support these "new"
>> features.  The biggest issue I have with this is that we would be
>> creating yet another query language that users would have to learn.
>> Also I think that allowing users to extend the query language by
>> adding there own types would required a rewrite of the Lucene
>> implemented query parser.  So even starting with the Lucene query
>> language would be a lot of work.
>>
>> Option 2:
>> Some limited version of SQL or SQL like syntax, basically supporting
>> normal SQL with limited join support (probably only natural joins).
>> This would be nice, because most users understand SQL.  But because
>> Blur can not support all the various operations that SQL can provide
>> this will probably be frustrating to users.  And they will need to
>> learn what Blur SQL will provide and any special Blur only syntax.  So
>> this would again be like inventing another query language.
>>
>> Option 3:
>> CQL (http://en.wikipedia.org/wiki/Contextual_Query_Language) not to be
>> confused with Cassandra Query Language.  Currently I like this option
>> the best, because it has built-in extensibility as well as the normal
>> options needed for a search engine.  Boolean, fuzzy, wildcard, etc.
>>
>> I really would like to get other's opinions here and any other options.
>>  Thanks!
>>
>> Aaron
>>

Mime
View raw message