nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gal Nitzan <gnit...@usa.net>
Subject Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Date Tue, 11 Oct 2005 23:36:32 GMT
Hi Otis,

I have only a few thousands urls in my db at the moment. However, for a 
100K it should be about 600-800KB. I do not cache the url itself, only a 
hash string. So the next time a url is searched in the cache if the hash 
exists than it is allowed.

Regards,

Gal

ogjunk-nutch@yahoo.com wrote:
> Hi Gal,
>
> I'm curious about the memory consumption of the cache and the speed of
> retrieval of an item from the cache, when the cache has 100k domains in
> it.
>
> Thanks,
> Otis
>
>
> --- Gal Nitzan <gnitzan@usa.net> wrote:
>
>   
>> Hi Michael,
>>
>> At the moment I have about 3000 domains in my db. I didn't time the 
>> performance however having even 100k domains shouldn't have an impact
>>
>> since it is fetched only once from the database to the cache. A
>> little 
>> performance hit should be over 100k (depends on number elements
>> defined 
>> in xml file).
>>
>> After a few birth problems, the plugin works nicely and I do not feel
>>
>> any impact.
>>
>> Regards,
>>
>> Gal
>>
>>
>> Michael Ji wrote:
>>     
>>> hi,
>>>
>>> How is performance concern if the size of domain list
>>> reaches 10,000?
>>>
>>> Micheal Ji,
>>>
>>> --- "Gal Nitzan (JIRA)" <jira@apache.org> wrote:
>>>
>>>   
>>>       
>>>>      [
>>>>
>>>>     
>>>>         
>>> http://issues.apache.org/jira/browse/NUTCH-100?page=all
>>>   
>>>       
>>>> ]
>>>>
>>>> Gal Nitzan updated NUTCH-100:
>>>> -----------------------------
>>>>
>>>>            type: Improvement  (was: New Feature)
>>>>     Description: 
>>>> Hi,
>>>>
>>>> I have written a new plugin, based on the URLFilter
>>>> interface: urlfilter-db .
>>>>
>>>> The purpose of this plugin is to filter domains,
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>
>>>> The plugin uses a caching system (SwarmCache, easier
>>>> to deploy than JCS) and on the back-end a database.
>>>>
>>>> For each url
>>>>    filter is called
>>>> end for
>>>>
>>>> filter
>>>>  get the domain name from url
>>>>   call cache.get domain
>>>>   if not in cache try the database
>>>>   if in database cache it and return it
>>>>   return null
>>>> end filter
>>>>
>>>>
>>>> The plugin reads the cache size, jdbc driver,
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>>
>>>>   was:
>>>> Hi,
>>>>
>>>> I have written (not much) a new plugin, based on the
>>>> URLFilter interface: urlfilter-db .
>>>>
>>>> The purpose of this plugin is to filter domains,
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>
>>>> The plugin uses a caching system (SwarmCache, easier
>>>> to deploy than JCS) and on the back-end a database.
>>>>
>>>> For each url
>>>>    filter is called
>>>> end for
>>>>
>>>> filter
>>>>  get the domain name from url
>>>>   call cache.get domain
>>>>   if not in cache try the database
>>>>   if in database cache it and return it
>>>>   return null
>>>> end filter
>>>>
>>>>
>>>> The plugin reads the cache size, jdbc driver,
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>>
>>>>     Environment: All Nutch versions  (was: MapRed)
>>>>
>>>> Fixed some issues
>>>> clean up
>>>> Added a patch for Subversion
>>>>
>>>>     
>>>>         
>>>>> New plugin urlfilter-db
>>>>> -----------------------
>>>>>
>>>>>          Key: NUTCH-100
>>>>>          URL:
>>>>>       
>>>>>           
>>>> http://issues.apache.org/jira/browse/NUTCH-100
>>>>     
>>>>         
>>>>>      Project: Nutch
>>>>>         Type: Improvement
>>>>>   Components: fetcher
>>>>>     Versions: 0.8-dev
>>>>>  Environment: All Nutch versions
>>>>>     Reporter: Gal Nitzan
>>>>>     Priority: Trivial
>>>>>  Attachments: AddedDbURLFilter.patch,
>>>>>       
>>>>>           
>>>> urlfilter-db.tar.gz, urlfilter-db.tar.gz
>>>>     
>>>>         
>>>>> Hi,
>>>>> I have written a new plugin, based on the
>>>>>       
>>>>>           
>>>> URLFilter interface: urlfilter-db .
>>>>     
>>>>         
>>>>> The purpose of this plugin is to filter domains,
>>>>>       
>>>>>           
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>     
>>>>         
>>>>> The plugin uses a caching system (SwarmCache,
>>>>>       
>>>>>           
>>>> easier to deploy than JCS) and on the back-end a
>>>> database.
>>>>     
>>>>         
>>>>> For each url
>>>>>    filter is called
>>>>> end for
>>>>> filter
>>>>>  get the domain name from url
>>>>>   call cache.get domain
>>>>>   if not in cache try the database
>>>>>   if in database cache it and return it
>>>>>   return null
>>>>> end filter
>>>>> The plugin reads the cache size, jdbc driver,
>>>>>       
>>>>>           
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>> -- 
>>>> This message is automatically generated by JIRA.
>>>> -
>>>> If you think it was sent incorrectly contact one of
>>>> the administrators:
>>>>   
>>>>
>>>>     
>>>>         
>>> http://issues.apache.org/jira/secure/Administrators.jspa
>>>   
>>>       
>>>> -
>>>> For more information on JIRA, see:
>>>>    http://www.atlassian.com/software/jira
>>>>
>>>>
>>>>     
>>>>         
>>>
>>> 		
>>> __________________________________ 
>>> Yahoo! Music Unlimited 
>>> Access over 1 million songs. Try it free.
>>> http://music.yahoo.com/unlimited/
>>>
>>> .
>>>
>>>   
>>>       
>>
>>     
>
>
> .
>
>   



Mime
View raw message