hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject HttpAsyncClient as a spider
Date Tue, 22 Jul 2014 02:42:21 GMT
hi all,
    I used to use HttpComponents Client to crawl webpages. I need to
improve it by using async client. What I want to is something like:
   Queue<URL> needCrawlQueue;
   Queue<String[]> htmlQueue;

    HttpAsyncClient client;
    int maxConcurrent=500;

    //if finished a url, then get notified and call back this code
    if(client.currentCrawlingCount<maxConcurrent){
             URL url=needCrawlQueue.take();
             //request this url
    }

    //if finished a url, then get notifed and call back this code
    //String url;String html is call back arguments
    htmlQueue.put(new String[]{url, html};

    I mean I have a asnyc client class which take two queues.
    if current unfinished urls less than maxConcurrent, then it task a
url from a queue and request this url. if a url succeed(or failed),
add the result to another queue.

------------------------------------

   I use 500 threads in a 4 cpu virtual machine. The load average is
about 7 and context switch(using vmstat) is larger than 4,000
so I want to give async client a try. anyone can help me? I don't know
how to use async client. it only return a future. I am not familiar
with it.
what I want is a class. it take urls from a queue and fetch its content
and then send to another queue.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org


Mime
View raw message