incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Future of Blur Query Language
Date Sun, 26 Aug 2012 23:43:51 GMT
I like this option a lot.  The only drawback I see with it is that use
of the API will get more complicated, they would longer need to form
one string in one field.  Now it will have to be carried as multiple
arguments.  Maybe we should do something as simple as adding a JOIN
keyword to the Lucene syntax?

+JOIN(+test-family.testcol1:value1) +JOIN(+test-family.testcol3:value234123)

Or perhaps join is the wrong word.

+ROW(+test-family.testcol1:value1) +ROW(+test-family.testcol3:value234123)

ROW because that's what's being emitted from that part of the query?  Not sure.

Thoughts?

Aaron

On Sun, Aug 26, 2012 at 2:28 PM, Garrett Barton
<garrett.barton@gmail.com> wrote:
> Ahh, the lucene search time joins from the contrib
> (http://lucene.apache.org/core/3_6_1/api/contrib-join/org/apache/lucene/search/join/package-summary.html)
> is what I was thinking of. I like that style as a good interum upgrade
> to support these record joins. How hard would it be to do something
> like:
>
> BlurQuery blurQuery = new BlurQuery();
> blurQuery.simpleQuery = new SimpleQuery();
> //blank since there's nothing other than the joins
> blurQuery.simpleQuery.queryStr = "";
> blurQuery.addJoin("+test-family.testcol1:value1");
> blurQuery.addJoin(+test-family.testcol3:value234123");
> blurQuery.simpleQuery.superQueryOn = true;
> blurQuery.simpleQuery.type = ScoreType.SUPER;
> blurQuery.fetch = 10;
> blurQuery.minimumNumberOfResults = Long.MAX_VALUE;
> blurQuery.maxQueryTime = Long.MAX_VALUE;
> blurQuery.uuid = 1;
>
> This way one would not have to guess at users intent.  I think this is
> a cleaner workaround than the ugly query until something nicer comes
> along.
>
> If one was to go down the either of the ?ql approaches I think the
> default BlurResult should be the one from blur-jdbc and actually be
> what one would expect from a database returning the same query.  Not
> sold on it, I always find sql very limiting.
>
> On Sun, Aug 26, 2012 at 10:28 AM, Aaron McCurry <amccurry@gmail.com> wrote:
>> Look at IndexManagerTest.testQueryWithJoin
>>
>> I have attached an easier to read version of the data.
>>
>> And here is the query to "trick" it in to doing the join:
>>
>> +(+test-family.testcol1:value1 nojoin) +(+test-family.testcol3:value234123)
>>
>> On Sun, Aug 26, 2012 at 9:40 AM, Aaron McCurry <amccurry@gmail.com> wrote:
>>> I'm trying to create a real example working of the problem, and I have
>>> found a bug.  Basically the query never finishes, very strange.  Not
>>> sure if it's Lucene or my query at this point.  Once I have an example
>>> I will post it and add it as an unit test in the IndexManagerTest.
>>>
>>> Aaron
>>>
>>> On Sat, Aug 25, 2012 at 7:04 PM, Garrett Barton
>>> <garrett.barton@gmail.com> wrote:
>>>> Can we get this test case working to show the problem?
>>>>
>>>>         private static void testJoin(Iface client, String table) throws
>>>> BlurException, TException {
>>>>                 RowMutation mutation = new RowMutation();
>>>>                 mutation.table = table;
>>>>                 mutation.waitToBeVisible = true;
>>>>                 mutation.rowId = "row1";
>>>>                 mutation.addToRecordMutations(newRecordMutation("cf1",
>>>>                                 "recordid1", newColumn("col1","value1")));
>>>>                 mutation.addToRecordMutations(newRecordMutation("cf1",
>>>>                                 "recordid2", newColumn("col2","value2")));
>>>>                 mutation.rowMutationType = RowMutationType.REPLACE_ROW;
>>>>                 client.mutate(mutation);
>>>>
>>>>                 List<String> joinTest = new ArrayList<String>();
>>>>                 joinTest.add("+cf1.col1:value1");
>>>>                 joinTest.add("+cf1.col2:value2");
>>>>                 joinTest.add("+cf1.col1:value1 +cf1.col2:value2");
>>>>                 joinTest.add("+(+cf1.col1:value1 nocf.nofield:somevalue)
>>>> +(+cf1.col2.value2 nocf.nofield:somevalue)");
>>>>                 joinTest.add("+(+cf1.col1:value1) +(cf1.bla:bla +cf1.col2.value2)");
>>>>
>>>>                 for(String q : joinTest)
>>>>                         System.out.println(q + " hits: " + hits(client,table,
q, true));
>>>>         }
>>>>
>>>>         private static long hits(Iface client, String table, String queryStr,
>>>> boolean superQuery) throws BlurException, TException {
>>>>                 BlurQuery bq = new BlurQuery();
>>>>                 SimpleQuery sq = new SimpleQuery();
>>>>                 sq.queryStr = queryStr;
>>>>                 sq.superQueryOn = superQuery;
>>>>                 bq.simpleQuery = sq;
>>>>                 BlurResults query = client.query(table, bq);
>>>>                 return query.totalResults;
>>>>         }
>>>>
>>>>
>>>> Running I get:
>>>> +cf1.col1:value1 hits: 1
>>>> +cf1.col2:value2 hits: 1
>>>> +cf1.col1:value1 +cf1.col2:value2 hits: 0
>>>> +(+cf1.col1:value1 nocf.nofield:somevalue) +(+cf1.col2.value2
>>>> nocf.nofield:somevalue) hits: 0
>>>> +(+cf1.col1:value1) +(cf1.bla:bla +cf1.col2.value2) hits: 0
>>>>
>>>> Whats the trick to get the join to work?
>>>>
>>>> Honestly my first instinct in to turn the record joins into a list
>>>> passed in to the simple query if one wants to move into record joining
>>>> vs default inter record joining of the same cf.  Will ponder the other
>>>> options some more. :)
>>>>
>>>> ~Garrett
>>>>
>>>> On Sat, Aug 25, 2012 at 4:48 PM, Tim Tutt <tim.tutt@gmail.com> wrote:
>>>>> Aaron,
>>>>>
>>>>> Just for a little clarification on your example, when you say JOIN, are
you
>>>>> actually just talking about a union of two sets or are you actually
>>>>> referring to the relational type of join where the intent is to merge
them
>>>>> into a single record? If it's the former, wouldn't a simple OR suffice?
>>>>>
>>>>> Provided that I am in fact missing something, here are my thoughts on
the
>>>>> query language:
>>>>>
>>>>> A common theme that I have seen across the board with commercial
>>>>> search/discovery products is the creation of a query language modeled
after
>>>>> SQL with varying limitations. This tends to be fairly effective as the
>>>>> learning curve is not too steep for users who have experience writing
SQL
>>>>> queries and dealing with relational databases. Additionally, these users
>>>>> normally find a way to live with the limitations of the language and
find
>>>>> ways around the problems they are trying to solve as the language is
>>>>> typically advanced enough to be creative.
>>>>>
>>>>> Such a language, however, does not lend it self well to the less advanced
>>>>> end users of your product. Perhaps in certain cases this is acceptable
as
>>>>> you will always have some advanced user available, but in the cases where
>>>>> these advanced users are in limited supply the learning curve becomes
>>>>> steeper as the technical ability and know-how decreases.
>>>>>
>>>>> In taking a brief look at the spec for CQL, I tend to agree with your
>>>>> assessment that it is the best option as it looks like it has the ability
>>>>> to be flexible enough to fit both cases. It is possible that you will
run
>>>>> into limitations with the queries that your more advanced users are
>>>>> interested in, but perhaps those are the cases where Blur is not a fit.
>>>>>
>>>>>
>>>>> Tim
>>>>>
>>>>> On Sat, Aug 25, 2012 at 2:49 PM, Aaron McCurry <amccurry@gmail.com>
wrote:
>>>>>
>>>>>> I would to start a thread on the topic of the future of Blur's query
>>>>>> language.  Currently the "simpleQuery" is just a normal Lucene based
>>>>>> syntax with a little magic to figure out the joins (via the
>>>>>> SuperQuery) that the user probably intended.  Of course this guess
>>>>>> work gets it wrong sometimes.  Let me explain with an example:
>>>>>>
>>>>>> Given the query with superOn:
>>>>>>
>>>>>> +cf1.field1:value1 +cf1.field2.value2
>>>>>>
>>>>>> The current implementation will ASSUME that you want to find where
>>>>>> "cf1.field1" contains "value1" and where "cf1.field2" contains
>>>>>> "value2" in the same Record because the column family is the same.
>>>>>> i.e. NO JOIN across records
>>>>>>
>>>>>> But perhaps the user really does want a join, meaning that the user
>>>>>> wants to find any Row that contains one or more Records that have
a
>>>>>> field "cf1.field1" that contains "value1" and one or more Records
in
>>>>>> the same Row (but not necessarily in the same Record) that contains
a
>>>>>> field "cf1.field2" that contains "value2".  i.e. JOIN
>>>>>>
>>>>>> Given that current implementation, the only way to force the JOIN
is
>>>>>> to do something like:
>>>>>>
>>>>>> +(+cf1.field1:value1 nocf.nofield:somevalue) +(+cf1.field2.value2
>>>>>> nocf.nofield:somevalue)
>>>>>>
>>>>>> This will trick the parser into creating 2 separate join query
>>>>>> (SuperQuery) objects and perform the JOIN.
>>>>>>
>>>>>>
>>>>>> THIS IS UGLY.
>>>>>>
>>>>>> Here are the current criteria for a query language:
>>>>>> - The ability to support any Lucene query type (Boolean, Term, Fuzzy,
>>>>>> Span, etc.)
>>>>>> - User defined query type should be supported, extensible
>>>>>> - The query language should be compatible with any programming
>>>>>> language so that the current thrift RPC can continue to be utilized
>>>>>>
>>>>>> Here are options that I have been thinking about:
>>>>>>
>>>>>> Option 1:
>>>>>> Somehow extend the current Lucene Query syntax to support these "new"
>>>>>> features.  The biggest issue I have with this is that we would be
>>>>>> creating yet another query language that users would have to learn.
>>>>>> Also I think that allowing users to extend the query language by
>>>>>> adding there own types would required a rewrite of the Lucene
>>>>>> implemented query parser.  So even starting with the Lucene query
>>>>>> language would be a lot of work.
>>>>>>
>>>>>> Option 2:
>>>>>> Some limited version of SQL or SQL like syntax, basically supporting
>>>>>> normal SQL with limited join support (probably only natural joins).
>>>>>> This would be nice, because most users understand SQL.  But because
>>>>>> Blur can not support all the various operations that SQL can provide
>>>>>> this will probably be frustrating to users.  And they will need to
>>>>>> learn what Blur SQL will provide and any special Blur only syntax.
 So
>>>>>> this would again be like inventing another query language.
>>>>>>
>>>>>> Option 3:
>>>>>> CQL (http://en.wikipedia.org/wiki/Contextual_Query_Language) not
to be
>>>>>> confused with Cassandra Query Language.  Currently I like this option
>>>>>> the best, because it has built-in extensibility as well as the normal
>>>>>> options needed for a search engine.  Boolean, fuzzy, wildcard, etc.
>>>>>>
>>>>>> I really would like to get other's opinions here and any other options.
>>>>>>  Thanks!
>>>>>>
>>>>>> Aaron
>>>>>>

Mime
View raw message