hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: distributed search
Date Tue, 05 Dec 2006 16:49:21 GMT


Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> All,
>>
>> We have two version of a type of index splitter.  The first version 
>> would run an indexing job and then using the completed index as input 
>> would read the number of documents in the index and take a requested 
>> split size.  From this it used a custom index input format to create 
>> splits according to document id.   We would run a job that would map 
>> out index urls as keys and documents with their ids wrapped in a 
>> SerializableWritable object as the values.  Then inside of a second 
>> job using the index as input we would have a MapRunner that would 
>> read the other supporting databases (linkdb, segments) and map all 
>> objects as ObjectWritables.  Then on the reduce we had a custom 
>> Output and OutputFormat that took all of the objects and wrote out 
>> the databases and indexes into each split.
>>
>> There was a problem with this first approach though in that writing 
>> out an index from a previously serialized document would lose any 
>> fields that are not stored (which is most).  So we went with a second 
>> approach.
>
> Yes, that's the nature of the beast - I sort of hoped that you 
> implemented a true splitter, which directly splits term posting lists 
> according to doc id. This is possible, and doesn't require using 
> stored fields - the only problem being that someone well acquainted 
> with Lucene internals needs to write it .. ;)
If you can explain more about this we can start taking a look in this 
direction.  I would love to get this working by a type of doc id.
>
>>
>> The second approach takes a number of splits and runs through an 
>> indexing job on the fly.  It calls the indexing and scoring filters.  
>> It uses the linkdb, crawldb, and segments as input.  As it indexes is 
>> also splits the databases and indexes into the number of reduce tasks 
>> so that the final output is multiple splits each hold a part of the 
>> index and its supporting databases.  Each of  the databases holds 
>> only the information for the urls that are in its part of the index.  
>> These parts can then be pushed to separate search servers.  This type 
>> of splitting works well but you can NOT define a specific number of 
>> documents or urls per split and sometimes one split will have alot 
>> more urls than another if you
>
> Why? it depends on the partitioner, or the output format that you are 
> using. E.g. in SegmentMerger I implemented an OutputFormat that 
> produces several segments simultaneously, writing to their respective 
> parts depending on a value of metadata. This value may be used to 
> switch sequentially between output "slices" so that urls are spread 
> evenly across the "slices".
We were using the default HashPartitioner using urls as keys but yes we 
are currently looking into custom Partioners to solve the problem of too 
uneven distribution.  Correct me if I am wrong but I though two machines 
couldn't write to the same file at the same time.  For example if 
machine X is writing part-00001 then machine Y can't also write to 
part-00001 in the same job.
>
>> are indexing some sites that have alot of pages (i.e. wikipedia or 
>> cnn archives).  This is currently how our system works.  We fetch, 
>> invert links, run through some other processes, and then index and 
>> split on the fly.  Then we use python scripts to pull each split 
>> directly from the DFS to each search server and then start the search 
>> servers.
>>
>> We are still working on the splitter because the ideal approach would 
>> be to be able to specify a number of documents per split as well as 
>> to group by different keys, not just url.  I would be happy to share 
>> the current code but it is highly integrated so I would need to pull 
>> it out of our code base first.  It would be best if I could send it 
>> to someone, say Andrzej, to take a look at first.
>
> Unless I'm missing something, SegmentMerger.SegmentOutputFormat should 
> satisfy these requirements, you would just need to modify the job 
> implementation ...
 From what I am seeing the SegmentMerger only handles segments.  The 
splitter handle segments, linkdb, and does indexing on the fly.  The 
idea being that I didn't want to have to split and then index segments 
individually and I only wanted the linkdb and segments to hold the 
information they needed for each split.

Mime
View raw message