lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SearchTech <searc...@gmail.com>
Subject Paid Job: Looking for a developer to create a small java application to extract url's from .fdt files
Date Mon, 13 Feb 2012 14:31:34 GMT
Hi there,

I am currently working on a search engine based on lucene and have some
issues because java is not my regular programming language, which makes
things a it hard.
What I was wondering about is if you would be available for a small custom
(paid) job to solve one of my issues.

I am basically looking for a way to extract and save all links from a .fdt
file to a text file.

The reason for this is simple: The engine I am building is indexing remote
sites based on the dmoz dump. The issue is that my mysql database where all
urls are stored contains 2 million ntries, but when I have indexed
everything, I get about 1.8 million documents because some timeout, some
redirect to another domain or some just fail. So my mission is extracting
all URL's from the final fdt files and then enter them to my database again
to have a "fresh" set of URL's to index without the need to run the crawler
on all domains again just to waste bandwidth.

That said, I was wondering if you would possiblbe available for a quick
project to write me some java tool which works like:

java tool.jar index.fdt links.txt

which would basically export all found links from the fdt file and save it
line by line to links.txt

This would be really wonderful and would enable me to finalize my project :)

If you are up for this, please do let me know and also let me know how much
you would charge for this.

Thank you for your time reading this.

Juergen

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message