lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-8925) Add gatherNodes Streaming Expression to support breadth first traversals
Date Tue, 12 Apr 2016 19:11:25 GMT

     [ https://issues.apache.org/jira/browse/SOLR-8925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joel Bernstein updated SOLR-8925:
---------------------------------
    Attachment: SOLR-8925.patch

Patch with first very simple first test case. Shows the basic machinery working.

> Add gatherNodes Streaming Expression to support breadth first traversals
> ------------------------------------------------------------------------
>
>                 Key: SOLR-8925
>                 URL: https://issues.apache.org/jira/browse/SOLR-8925
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>             Fix For: 6.1
>
>         Attachments: SOLR-8925.patch, SOLR-8925.patch
>
>
> The gatherNodes Streaming Expression is a flexible general purpose breadth first graph
traversal. It uses the same parallel join under the covers as (SOLR-8888) but is much more
generalized and can be used for a wide range of use cases.
> Sample syntax:
> {code}
>  gatherNodes(friends,
>              gatherNodes(friends,
>                          search(articles, q=“body:(queryA)”, fl=“author”),
>                          walk ="author->user”,
>                          gather="friend"),
>              walk=“friend->user”,
>              gather="friend",
>              scatter=“roots, branches, leaves”)
> {code}
> The expression above is evaluated as follows:
> 1) The inner search() expression is evaluated on the *articles* collection, emitting
a Stream of Tuples with the author field populated.
> 2) The inner gatherNodes() expression reads the Tuples form the search() stream and traverses
to the *friends* collection by performing a distributed join between articles.author and friends.user
field.  It gathers the value from the *friend* field during the join.
> 3) The inner gatherNodes() expression then emits the *friend* Tuples. By default the
gatherNodes function emits only the leaves which in this case are the *friend* tuples.
> 4) The outer gatherNodes() expression reads the *friend* Tuples and Traverses again in
the "friends" collection, this time performing the join between *friend* Tuples  emitted in
step 3. This collects the friend of friends.
> 5) The outer gatherNodes() expression emits the entire graph that was collected. This
is controlled by the "scatter" parameter. In the example the *root* nodes are the authors,
the *branches* are the author's friends and the *leaves* are the friend of friends.
> This traversal is fully distributed and cross collection.
> *Aggregations* are also supported during the traversal. This can be useful for making
recommendations based on co-occurance counts: Sample syntax:
> {code}
> top(
>       gatherNodes(baskets,
>                   search(baskets, q=“prodid:X”, fl=“basketid”, rows=“500”,
sort=“random_7897987 asc”),
>                   walk =“basketid->basketid”,
>                   gather=“prodid”,
>                   fl=“prodid, price”,
>                   count(*),
>                   avg(price)),
>       n=4,
>       sort=“count(*) desc, avg(price) asc”)
> {code}
> In the expression above, the inner search() function searches the basket collection for
500 random basketId's that have the prodid X.
> gatherNodes then traverses the basket collection and gathers all the prodid's for the
selected basketIds.
> It also aggregates the counts and average price for each productid collected. The count
reflects the co-occurance count for each prodid gathered and prodid X. The outer *top* expression
selects the top 4 prodid's emitted from gatherNodes, based the co-occurance count and avg
price.
> Like all streaming expressions the gatherNodes expression can be combined with other
streaming expressions. For example the following expression uses a hashJoin to intersect the
network of friends rooted to authors found with different queries:
> {code}
> hashInnerJoin(
>                       gatherNodes(friends,
>                                   gatherNodes(friends,
>                                               search(articles, q=“body:(queryA)”,
fl=“author”),
>                                               walk ="author->user”,
>                                               gather="friend"),
>                                   walk=“friend->user”,
>                                   gather="friend",
>                                   scatter=“branches, leaves”),
>                        gatherNodes(friends,
>                                   gatherNodes(friends,
>                                               search(articles, q=“body:(queryB)”,
fl=“author”),
>                                               walk ="author->user”,
>                                               gather="friend"),
>                                   walk=“friend->user”,
>                                   gather="friend",
>                                   scatter=“branches, leaves”),
>                       on=“friend”
>          )
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message