lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: strange problem of PForDelta decoder
Date Mon, 03 Jan 2011 20:03:50 GMT
Here's the paper:

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.8091

I haven't read it yet...

In general I don't like tying concurrency w/in a single search to
index segments; I'd rather they be (relatively?) independent.  EG an
optimized index would then force single thread queries, which is
crazy...

In Lucene it should be easy to chop up the maxDoc(), ie if I have 10M
docs and I want to use 4 threads I just give each thread a chunk of
2.5M docs.  Each thread would first skip to its start doc, and then go
from there.

I'm not sure we'd want to share a single collector vs private
collector per thread and then merge in the end (this is a similar
tradeoff as we've discussed on the per-segment collectors).

Mike

2011/1/1 Li Li <fancyerii@gmail.com>:
>   I sent a mail to MG4J group and Sebastiano Vigna recommended the paper
> Reducing query latencies in web search using fine-grained parallelism.
> World Wide Web, 12(4):441-460, 2009.
>   I read it roughly. But there are some questions
>     it says:
>    it first coalesces all disk reads in a single process, and then
> distributes the index data among
> the parallel threads. So when the index server receives a query, it
> loads from storage (disk or cache)
> the required posting lists, initializes the query execution, and then
> spawns lightweight threads, one per core.
> Each thread receives an equal-sized subset of document IDs to scan,
> together covering the entire index
> partition. All threads execute the same code on the same query, but
> with private data structures.
> The only writable shared data structure is a heap of the top-scoring
> hits, protected by a lock. At the end of the threads'
> execution, the heap contains the highest-scoring hits in the entire
> index partition, which is then transmitted to the query
> integrator as usual. Since the index contains skip-lists that permit
> near-random-access to any document ID, and since
> hits are generally distributed evenly in the index, our measurements
> show that all threads complete their work with
> little variability, within 1-2% of each other.
>
>  I have some questions
>  1.Since the index contains skip-lists that permit
> near-random-access to any document ID
>   skip list can be near random access?(especially when it's in hard disk)
>
>  2.For "or query", it's easy. e.g. we search "search" and "engine",
> we using one main thread to get postings the
> the 2 terms and divide their postings into 2 groups(e.g. even docIds
> and odd docIds) and using 2 threads to score
> them and finally merge it(or using locked priority queue)
>     but for "and query", we usally don't decode all the docIDs
> because we can skip many documents. especially when
> searching low frequent terms with high frequent terms. only a small
> number of docIDs of high frequent term is decoded.
>
>   btw. I think or query is often much slower than and query. If we can
> parallel or query well, it's also very useful.
>
> On Dec 31, 2010, at 7:25 AM, Li Li wrote:
>
>>    Which one is used in MG4J to support multithreads searching? Are
>
> 2010/12/31 Li Li <fancyerii@gmail.com>:
>> is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/)
>> it says Multithreading. Indices can be queried and scored concurrently.
>> maybe we can learn something from it.
>>
>> 2010/12/31 Li Li <fancyerii@gmail.com>:
>>> plus
>>> 2 means search a term need seek many times for tis(if it's not cached in tii)
>>>
>>> 2010/12/31 Li Li <fancyerii@gmail.com>:
>>>> searching multi segments is a alternative solution but it has some
>>>> disadvantages.
>>>> 1. idf is not global?(I am not familiar with its implementation) maybe
>>>> it's easy to solve it by share global idf
>>>> 2. each segments will has it's own tii and tis files, which may make
>>>> search slower(that's why optimization of
>>>> index is neccessary)
>>>> 3. one term's  docList is distributed in many files rather than one.
>>>> more than one frq files means
>>>> hard disk must seek different tracks, it's time consuming. if there is
>>>> only one segment, the are likely
>>>> stored in a single track.
>>>>
>>>>
>>>> 2010/12/31 Earwin Burrfoot <earwin@gmail.com>:
>>>>>>>>until we fix Lucene to run a single search concurrently (which
we
>>>>>>>>badly need to do).
>>>>>> I am interested in this idea.(I have posted it before) do you have
some
>>>>>> resources such as papers or tech articles about it?
>>>>>> I have tried but it need to modify index format dramatically and
we use
>>>>>> solr distributed search to relieve the problem of response time.
so finally
>>>>>> give it up.
>>>>>> lucene4's index format is more flexible that it supports customed
codecs
>>>>>> and it's now on development, I think it's good time to take it into
>>>>>> consideration
>>>>>> that let it support multithread searching for a single query.
>>>>>> I have a naive solution. dividing docList into many groups
>>>>>> e.g grouping docIds by it's even or odd
>>>>>> term1 df1=4  docList =  0  4  8  10
>>>>>> term1 df2=4  docList = 1  3  9  11
>>>>>>
>>>>>> term2 df1=4  docList = 0  6  8  12
>>>>>> term2 df2=4  docList = 3  9  11 15
>>>>>>   then we can use 2 threads to search topN docs on even group and
odd group
>>>>>> and finally merge their results into a single on just like solr
>>>>>> distributed search.
>>>>>> But it's better than solr distributed search.
>>>>>>   First, it's in a single process and data communication between
>>>>>> threads is much
>>>>>> faster than network.
>>>>>>   Second, each threads process the same number of documents.For
solr
>>>>>> distributed
>>>>>> search, one shard may process 7 documents and another shard may 1
document
>>>>>> Even if we can make each shard have the same document number. we
can not
>>>>>> make it uniformly for each term.
>>>>>>    e.g. shard1 has doc1 doc2
>>>>>>           shard2 has doc3 doc4
>>>>>>    but term1 may only occur in doc1 and doc2
>>>>>>    while term2 may only occur in doc3 and doc4
>>>>>>    we may modify it
>>>>>>           shard1 doc1 doc3
>>>>>>           shard2 doc2 doc4
>>>>>>    it's good for term1 and term2
>>>>>>    but term3 may occur in doc1 and doc3...
>>>>>>    So I think it's fine-grained distributed in index while solr
>>>>>> distributed search is coarse-
>>>>>> grained.
>>>>> This is just crazy :)
>>>>>
>>>>> The simple way is just to search different segments in parallel.
>>>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized
>>>>> large segments (and small ones don't count, they're small!).
>>>>> If you're bound on squeezing out that extra millisecond (and making
>>>>> your life miserable along the way), you can search a single segment
>>>>> with multiple threads (by dividing it in even chunks, and then doing
>>>>> skipTo to position your iterators to the beginning of each chunk).
>>>>>
>>>>> First approach is really easy to implement. Second one is harder, but
>>>>> still doesn't require you to cook the number of CPU cores available
>>>>> into your index!
>>>>>
>>>>> It's the law of diminishing returns at play here. You're most likely
>>>>> to search in parallel over mostly memory-resident index
>>>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
>>>>> tend to slow down considerably on parallel sequential reads, so you
>>>>> already have pretty decent speed.
>>>>> Searching different segments in parallel (with BSMP) makes you several
>>>>> times faster.
>>>>> Searching in parallel within a segment requires some weird hacks, but
>>>>> has maybe a few percent advantage over previous solution.
>>>>> Sharding posting lists requires a great deal of weird hacks, makes
>>>>> index machine-bound, and boosts speed by another couple of percent.
>>>>> Sounds worthless.
>>>>>
>>>>> --
>>>>> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
>>>>> Phone: +7 (495) 683-567-4
>>>>> ICQ: 104465785
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message