lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: strange problem of PForDelta decoder
Date Mon, 03 Jan 2011 20:03:50 GMT
Here's the paper:

I haven't read it yet...

In general I don't like tying concurrency w/in a single search to
index segments; I'd rather they be (relatively?) independent.  EG an
optimized index would then force single thread queries, which is

In Lucene it should be easy to chop up the maxDoc(), ie if I have 10M
docs and I want to use 4 threads I just give each thread a chunk of
2.5M docs.  Each thread would first skip to its start doc, and then go
from there.

I'm not sure we'd want to share a single collector vs private
collector per thread and then merge in the end (this is a similar
tradeoff as we've discussed on the per-segment collectors).


2011/1/1 Li Li <>:
>   I sent a mail to MG4J group and Sebastiano Vigna recommended the paper
> Reducing query latencies in web search using fine-grained parallelism.
> World Wide Web, 12(4):441-460, 2009.
>   I read it roughly. But there are some questions
>     it says:
>    it first coalesces all disk reads in a single process, and then
> distributes the index data among
> the parallel threads. So when the index server receives a query, it
> loads from storage (disk or cache)
> the required posting lists, initializes the query execution, and then
> spawns lightweight threads, one per core.
> Each thread receives an equal-sized subset of document IDs to scan,
> together covering the entire index
> partition. All threads execute the same code on the same query, but
> with private data structures.
> The only writable shared data structure is a heap of the top-scoring
> hits, protected by a lock. At the end of the threads'
> execution, the heap contains the highest-scoring hits in the entire
> index partition, which is then transmitted to the query
> integrator as usual. Since the index contains skip-lists that permit
> near-random-access to any document ID, and since
> hits are generally distributed evenly in the index, our measurements
> show that all threads complete their work with
> little variability, within 1-2% of each other.
>  I have some questions
>  1.Since the index contains skip-lists that permit
> near-random-access to any document ID
>   skip list can be near random access?(especially when it's in hard disk)
>  2.For "or query", it's easy. e.g. we search "search" and "engine",
> we using one main thread to get postings the
> the 2 terms and divide their postings into 2 groups(e.g. even docIds
> and odd docIds) and using 2 threads to score
> them and finally merge it(or using locked priority queue)
>     but for "and query", we usally don't decode all the docIDs
> because we can skip many documents. especially when
> searching low frequent terms with high frequent terms. only a small
> number of docIDs of high frequent term is decoded.
>   btw. I think or query is often much slower than and query. If we can
> parallel or query well, it's also very useful.
> On Dec 31, 2010, at 7:25 AM, Li Li wrote:
>>    Which one is used in MG4J to support multithreads searching? Are
> 2010/12/31 Li Li <>:
>> is there anyone familiar with MG4J(
>> it says Multithreading. Indices can be queried and scored concurrently.
>> maybe we can learn something from it.
>> 2010/12/31 Li Li <>:
>>> plus
>>> 2 means search a term need seek many times for tis(if it's not cached in tii)
>>> 2010/12/31 Li Li <>:
>>>> searching multi segments is a alternative solution but it has some
>>>> disadvantages.
>>>> 1. idf is not global?(I am not familiar with its implementation) maybe
>>>> it's easy to solve it by share global idf
>>>> 2. each segments will has it's own tii and tis files, which may make
>>>> search slower(that's why optimization of
>>>> index is neccessary)
>>>> 3. one term's  docList is distributed in many files rather than one.
>>>> more than one frq files means
>>>> hard disk must seek different tracks, it's time consuming. if there is
>>>> only one segment, the are likely
>>>> stored in a single track.
>>>> 2010/12/31 Earwin Burrfoot <>:
>>>>>>>>until we fix Lucene to run a single search concurrently (which
>>>>>>>>badly need to do).
>>>>>> I am interested in this idea.(I have posted it before) do you have
>>>>>> resources such as papers or tech articles about it?
>>>>>> I have tried but it need to modify index format dramatically and
we use
>>>>>> solr distributed search to relieve the problem of response time.
so finally
>>>>>> give it up.
>>>>>> lucene4's index format is more flexible that it supports customed
>>>>>> and it's now on development, I think it's good time to take it into
>>>>>> consideration
>>>>>> that let it support multithread searching for a single query.
>>>>>> I have a naive solution. dividing docList into many groups
>>>>>> e.g grouping docIds by it's even or odd
>>>>>> term1 df1=4  docList =  0  4  8  10
>>>>>> term1 df2=4  docList = 1  3  9  11
>>>>>> term2 df1=4  docList = 0  6  8  12
>>>>>> term2 df2=4  docList = 3  9  11 15
>>>>>>   then we can use 2 threads to search topN docs on even group and
odd group
>>>>>> and finally merge their results into a single on just like solr
>>>>>> distributed search.
>>>>>> But it's better than solr distributed search.
>>>>>>   First, it's in a single process and data communication between
>>>>>> threads is much
>>>>>> faster than network.
>>>>>>   Second, each threads process the same number of documents.For
>>>>>> distributed
>>>>>> search, one shard may process 7 documents and another shard may 1
>>>>>> Even if we can make each shard have the same document number. we
can not
>>>>>> make it uniformly for each term.
>>>>>>    e.g. shard1 has doc1 doc2
>>>>>>           shard2 has doc3 doc4
>>>>>>    but term1 may only occur in doc1 and doc2
>>>>>>    while term2 may only occur in doc3 and doc4
>>>>>>    we may modify it
>>>>>>           shard1 doc1 doc3
>>>>>>           shard2 doc2 doc4
>>>>>>    it's good for term1 and term2
>>>>>>    but term3 may occur in doc1 and doc3...
>>>>>>    So I think it's fine-grained distributed in index while solr
>>>>>> distributed search is coarse-
>>>>>> grained.
>>>>> This is just crazy :)
>>>>> The simple way is just to search different segments in parallel.
>>>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized
>>>>> large segments (and small ones don't count, they're small!).
>>>>> If you're bound on squeezing out that extra millisecond (and making
>>>>> your life miserable along the way), you can search a single segment
>>>>> with multiple threads (by dividing it in even chunks, and then doing
>>>>> skipTo to position your iterators to the beginning of each chunk).
>>>>> First approach is really easy to implement. Second one is harder, but
>>>>> still doesn't require you to cook the number of CPU cores available
>>>>> into your index!
>>>>> It's the law of diminishing returns at play here. You're most likely
>>>>> to search in parallel over mostly memory-resident index
>>>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
>>>>> tend to slow down considerably on parallel sequential reads, so you
>>>>> already have pretty decent speed.
>>>>> Searching different segments in parallel (with BSMP) makes you several
>>>>> times faster.
>>>>> Searching in parallel within a segment requires some weird hacks, but
>>>>> has maybe a few percent advantage over previous solution.
>>>>> Sharding posting lists requires a great deal of weird hacks, makes
>>>>> index machine-bound, and boosts speed by another couple of percent.
>>>>> Sounds worthless.
>>>>> --
>>>>> Kirill Zakharenko/Кирилл Захаренко (
>>>>> Phone: +7 (495) 683-567-4
>>>>> ICQ: 104465785
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message