lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phan The Dai <thienthanhom...@gmail.com>
Subject Re: A way to download URLs and index better ?
Date Sat, 16 Jan 2010 08:06:39 GMT
Thank much to Ahmet Arslan,
Although I have known about Nutch and Heritrix, when I read Lucene in Action
book but I am a newer of using them
I am not sure they are support for my situation.

I need an API allowing my application automatically fetching URLs and
download them into a Directory (as Directory of Lucene).
Detail question: I need 2 progresses running sequentially written in Java.
         Progress1 generates a list of URLs.
         Now I want a module (progress 1) download URLs' content to index
them directly.
Nutch and Heritrix can support libraries to do it ? An example or other
comment !

Thank you again  for answering this question.


On Sat, Jan 16, 2010 at 4:26 PM, Ahmet Arslan <iorixxx@yahoo.com> wrote:

> > Hi everyone, please help me this
> > question:
> > I need downloading some webpages from a list of URLs (about
> > 200 links) and
> > then index them by Lucene.
> > This list is not fixed, because it depends on definition of
> > my process.
> > Currently, in my web application, I wrote class for
> > downloading, but it
> > download time is too long.
> >
> > Please recommend me a Java library suitable with my
> > situation for optimize
> > downloading.
> > More its examples are very wonderful (INPUT: list of URLs;
> > OUTPUT: webpages
> > content, or indexed repository)
> > Thank you very much.
>
> Probably most famous ones :
>
> http://lucene.apache.org/nutch/
> http://crawler.archive.org/
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message