lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Della Bitta <>
Subject Re: Storing tweets For WC2014
Date Fri, 16 May 2014 17:13:58 GMT
Some of the data providers for Twitter offer a search API. Depending on
what you're doing, you might not even need to host this yourself.

My company does do search and analytics over tweets, but by the time we end
up indexing them, we've winnowed down the initial set to 10% of what we've
initially ingested, which itself is a fraction of the total set of tweets
as our data provider has let us filter for the ones that have the keywords
we want.

Our news index approaches the size of what you're talking about within an
order of magnitude (where 'news' is really an index of sentences taken from
news reports, along with metadata about the document the news came from).
Overall, we're hosting about 310 million records (give or take depending
where in the sharding cycle we're on) in a cluster of 5 AWS i2.xlarge boxes.

This setup indexes from our feeds in real time, which means there's no mass
loading. Additionally, we generally do bulk data collection across only 3
days of data, so if you're looking to do a mess of reporting against your
full set, take that into consideration.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <> | g+:<>
w: <>

On Fri, May 9, 2014 at 1:39 PM, Cool Techi <> wrote:

> Hi,
> We have a requirement from one of our customers to provide search and
> analytics on the upcoming Soccer World cup, given the sheer volume of
> tweet's that would be generated at such an event I cannot imagine what
> would be required to store this in solr.
> It would be great if there can be some pointer's on the scale or hardware
> required, number of shards that should be created etc. Some requirement,
> All the tweets should be searchable (approximately 100million tweets/date
>  * 60 Days of event). All fields on tweets should be searchable/facet on
> numeric and date fields. Facets would be run on TwitterId's (unique users),
> tweet created on date, Location, Sentiment (some fields which we generate)
> If anyone has attempted anything like this it would be helpful.
> Regards,Rohit

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message