lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <>
Subject RE: A question bout google search index?
Date Thu, 10 Jun 2010 08:51:56 GMT
Most of the implementation of Google's search index is kept secret by Google.
Based on publicly available information, the indexes are quite different - 
Google uses its BigTable and MapReduce technologies to efficiently distribute the index.
There are similar efforts in the Lucene ecosystem - Solr Cloud is an advanced one,
Which is currently in development. 
As Google's scoring algorithm uses hundreds of signals, I guess they store data pertinent
to these signals in the index.
Lucene's index holds relatively few pieces of information about every document (posting lists,
term vectors, 
Sometimes norms and payloads).
I believe there are other differences as well, 
But one could only guess what they are...

-----Original Message-----
From: luocanrao [] 
Sent: Wednesday, June 09, 2010 5:18 PM
Subject: A question bout google search index?

A news bout google search index. Index system of Lucene can also support
realtime search, 

Is there some difference between them?


With Caffeine, we analyze the web in small portions and update our search
index on a continuous basis, globally. As we find new pages, or new
information on existing pages, we can add these straight to the index. That
means you can find fresher information than ever before-no matter when or
where it was published.


Caffeine lets us index web pages on an enormous scale. In fact, every second
Caffeine processes hundreds of thousands of pages in parallel. If this were
a pile of paper it would grow three miles taller every second. Caffeine
takes up nearly 100 million gigabytes of storage in one database and adds
new information at a rate of hundreds of thousands of gigabytes per day. You
would need 625,000 of the largest iPods to store that much information; if
these were stacked end-to-end they would go for more than 40 miles

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message