lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <>
Subject Re: Is Lucene a good choice for PB scale mailbox search?
Date Thu, 26 Nov 2009 09:57:18 GMT
If you are planning on using lucene only for searching then you don't
need to store much data at all - just the message id or whatever you
use to identify messages.  And there won't be much point in
compressing that.

If on the other hand you plan on storing data in lucene, perhaps for
displaying hits on a web page, you might want to compress it.  That
will save some space but at the cost of some performance at indexing
and retrieval time.  If you are storing, say, From:, To: and Subject:
for display in search results and message body only displayed when
they want to view the message, you could leave the first three
uncompressed and compress the message body.

Personally, I only use compression in indexes storing large fields but
with low search/retrieval rate.  But my indexes are only a few Gb in

Lucene's handling of compressed fields is changing in 3.0 - see the
release notes or 2.9 javadocs for Field.Store.html#COMPRESS


On Thu, Nov 26, 2009 at 1:34 AM, fulin tang <> wrote:
> Thanks all for the good suggestions !
> But any idea of the storage? How can we make the indexes as small as possible?
> We know compressing is the only way, but when and where to compress is
> best for search?
> Thanks all again!
> 2009/11/24 Kay Kay <>:
>> fulin tang wrote:
>>> We are going to add full-text search for our mailbox service .
>>> The problem is we have more than 1 PB mails there , and obviously we
>>> don't want to add another PB storage for search service , so we hope
>>> the index data will be small enough for storage while the search keeps
>>> fast .
>>> The lucky is that every user just search with mails of their own , so
>>> we can split the data into a lot of indexes instead of keeping them in
>>> a big one .
>> If it is going to be sharded by the 'To' or 'Cc' list - then potentially the
>> mail information is going to be duplicated proportional to the number of
>> people in an email thread. Selecting some other dimension like time, for
>> sharding  might be useful to begin with.
>>> So, after all these concerns ,  the question is , is lucene a good
>>> choice for this ? or which is the right way to do this ? Does anyone
>>> have done this  before ?
>> With PB of storage - check out solr sharding / katta for prior work in this
>> arena.
>>> All opinions and comments are welcome !
>>> fulin
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> --
> 梦的开始挣扎于城市的边缘
> 心的远方执着在脚步的瞬间
> 我的宿命埋藏了寂寞的永远
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message