hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: distributed search
Date Tue, 05 Dec 2006 09:29:25 GMT
Dennis Kubes wrote:
> All,
> We have two version of a type of index splitter.  The first version 
> would run an indexing job and then using the completed index as input 
> would read the number of documents in the index and take a requested 
> split size.  From this it used a custom index input format to create 
> splits according to document id.   We would run a job that would map 
> out index urls as keys and documents with their ids wrapped in a 
> SerializableWritable object as the values.  Then inside of a second 
> job using the index as input we would have a MapRunner that would read 
> the other supporting databases (linkdb, segments) and map all objects 
> as ObjectWritables.  Then on the reduce we had a custom Output and 
> OutputFormat that took all of the objects and wrote out the databases 
> and indexes into each split.
> There was a problem with this first approach though in that writing 
> out an index from a previously serialized document would lose any 
> fields that are not stored (which is most).  So we went with a second 
> approach.

Yes, that's the nature of the beast - I sort of hoped that you 
implemented a true splitter, which directly splits term posting lists 
according to doc id. This is possible, and doesn't require using stored 
fields - the only problem being that someone well acquainted with Lucene 
internals needs to write it .. ;)

> The second approach takes a number of splits and runs through an 
> indexing job on the fly.  It calls the indexing and scoring filters.  
> It uses the linkdb, crawldb, and segments as input.  As it indexes is 
> also splits the databases and indexes into the number of reduce tasks 
> so that the final output is multiple splits each hold a part of the 
> index and its supporting databases.  Each of  the databases holds only 
> the information for the urls that are in its part of the index.  These 
> parts can then be pushed to separate search servers.  This type of 
> splitting works well but you can NOT define a specific number of 
> documents or urls per split and sometimes one split will have alot 
> more urls than another if you

Why? it depends on the partitioner, or the output format that you are 
using. E.g. in SegmentMerger I implemented an OutputFormat that produces 
several segments simultaneously, writing to their respective parts 
depending on a value of metadata. This value may be used to switch 
sequentially between output "slices" so that urls are spread evenly 
across the "slices".

> are indexing some sites that have alot of pages (i.e. wikipedia or cnn 
> archives).  This is currently how our system works.  We fetch, invert 
> links, run through some other processes, and then index and split on 
> the fly.  Then we use python scripts to pull each split directly from 
> the DFS to each search server and then start the search servers.
> We are still working on the splitter because the ideal approach would 
> be to be able to specify a number of documents per split as well as to 
> group by different keys, not just url.  I would be happy to share the 
> current code but it is highly integrated so I would need to pull it 
> out of our code base first.  It would be best if I could send it to 
> someone, say Andrzej, to take a look at first.

Unless I'm missing something, SegmentMerger.SegmentOutputFormat should 
satisfy these requirements, you would just need to modify the job 
implementation ...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

View raw message