lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Datta (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3759) Support joining in a distributed environment.
Date Thu, 10 Dec 2015 16:31:11 GMT


Ashish Datta commented on LUCENE-3759:

Hello Erick,
I would be glad to present a case for this if it helps. Let me know if it helps. If it does
not sound like a useful use-case, perhaps I could use some other tool.
Here's a quick overview of the use-case: 
The requirement I have is in analytics. Search results need to be exact and we're basically
'counting' things precisely, not approximating. The no. of facets is not large but their combinations
are large in number(hence the strong case for Solr).
The number of distinct data containers(collections) is small but their sizes are large and
denormalizing or keeping data in single servers are not feasible options.
Therefore joins are becoming inevitable as data grows and starts to need many servers to store
it due to size constraints and computing efficiency.
Right now, the only option I have is to use a glue language to collect the 'from' terms from
the many 'shards' across servers, send queries with these terms to the 'to' collection shards
on several servers again, apply rules to aggregate them centrally, manage timeouts and other
artificial issues created by this data division and sent the aggregated data for visualisations
or other processing.
As you can see, the charm and pull of Lucene's speed is getting dampened by the unnecessary
data complexity and dependence on programming in a glue language , recording the number and
types of shards on each server and making queries to the right targets. Redundancy/failover
is another pain to handle besides managing increasing servers. 

Everything I have written is already possible and avaliable in Solr except that it's not on
a distributed manner ?

Solr is a beautiful tool that can easily do everything I need if my data were not needed to
be distributed across machine as in my case ! 
If I denormalize this kind of data, I might end up making it 3-4x it's size, which obviously
I dont want to do.
If Solr managed to take away this pain, it would be the ideal scalable solution for all search
applications and analytic applications which have multiple large, data sets with limits to
In my case, I know the data very well and have a good grip on the combinations of facets needed
to configure a distributed system if it just allowed joins with true sharding.

I really think that adding this will bring in lots of distributed computing use-cases into
the ambit of Solr. There's no telling the amount of efforts it will save for people like me,
 and not have everybody devising the own distributed computing management  scheme when a common
one could solve it for all. 

Let me know if this sounds like a reasonable use-case. Besides my own use-case, I'm sure there
would be a lot of people who probably dont use solr due to this missing feature.

PS : Sorry for getting carried away and the long mail ;-(

> Support joining in a distributed environment.
> ---------------------------------------------
>                 Key: LUCENE-3759
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/join
>            Reporter: Martijn van Groningen
> Add two more methods in JoinUtil to support joining in a distributed manner.
> * Method to retrieve all from values.
> * Method to create a TermsQuery based on a set of from terms.
> With these two methods distributed joining can be supported following these steps:
> # Retrieve from values from each shard
> # Merge the retrieved from values. 
> # Create a TermsQuery based on the merged from terms and send this query to all shards.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message