hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Walters <c...@powerset.com>
Subject Re: distributed search
Date Tue, 05 Dec 2006 01:46:43 GMT

You should consider splitting based on a function that will give you a more
uniform distribution (e.g.: MD5(url)). That way, you should see much less of
a variation in the number of documents per partition.


On 12/4/06 3:29 PM, "Dennis Kubes" <nutch-dev@dragonflymc.com> wrote:

> All,
> We have two version of a type of index splitter.  The first version
> would run an indexing job and then using the completed index as input
> would read the number of documents in the index and take a requested
> split size.  From this it used a custom index input format to create
> splits according to document id.   We would run a job that would map out
> index urls as keys and documents with their ids wrapped in a
> SerializableWritable object as the values.  Then inside of a second job
> using the index as input we would have a MapRunner that would read the
> other supporting databases (linkdb, segments) and map all objects as
> ObjectWritables.  Then on the reduce we had a custom Output and
> OutputFormat that took all of the objects and wrote out the databases
> and indexes into each split.
> There was a problem with this first approach though in that writing out
> an index from a previously serialized document would lose any fields
> that are not stored (which is most).  So we went with a second approach.
> The second approach takes a number of splits and runs through an
> indexing job on the fly.  It calls the indexing and scoring filters.  It
> uses the linkdb, crawldb, and segments as input.  As it indexes is also
> splits the databases and indexes into the number of reduce tasks so that
> the final output is multiple splits each hold a part of the index and
> its supporting databases.  Each of  the databases holds only the
> information for the urls that are in its part of the index.  These parts
> can then be pushed to separate search servers.  This type of splitting
> works well but you can NOT define a specific number of documents or urls
> per split and sometimes one split will have alot more urls than another
> if you are indexing some sites that have alot of pages (i.e. wikipedia
> or cnn archives).  This is currently how our system works.  We fetch,
> invert links, run through some other processes, and then index and split
> on the fly.  Then we use python scripts to pull each split directly from
> the DFS to each search server and then start the search servers.
> We are still working on the splitter because the ideal approach would be
> to be able to specify a number of documents per split as well as to
> group by different keys, not just url.  I would be happy to share the
> current code but it is highly integrated so I would need to pull it out
> of our code base first.  It would be best if I could send it to someone,
> say Andrzej, to take a look at first.
> Dennis
> Andrzej Bialecki wrote:
>> Dennis Kubes wrote:
>> [...]
>>> Having a new index on each machine and having to create separate
>>> indexes is not the most elegant way to accomplish this architecture.
>>> The best way that we have found is to have an splitter job that
>>> indexes and splits the index and
>> Have you implemented a Lucene index splitter, i.e. a tool that takes
>> an existing Lucene index and splits it into parts by document id? This
>> sounds very interesting - could you tell us a bit about this?

View raw message