lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <>
Subject Re: Paid Job: Looking for a developer to create a small java application to extract url's from .fdt files
Date Mon, 13 Feb 2012 14:59:05 GMT
for 2.x and 3.x you can simply use this codes:
    Directory File("./testindex"));
    List<String> urls=new ArrayList<String>(reader.numDocs());
    for(int i=0;i<reader.maxDoc();i++){
     Document doc=reader.document(i);

if url fields is indexed, you can use FieldCache.StringIndex to speed up.

as for Trunk 4.x, I can't find the isDeleted(int) method. any one could
tell me why this method is removed?
On Mon, Feb 13, 2012 at 10:31 PM, SearchTech <> wrote:

> Hi there,
> I am currently working on a search engine based on lucene and have some
> issues because java is not my regular programming language, which makes
> things a it hard.
> What I was wondering about is if you would be available for a small custom
> (paid) job to solve one of my issues.
> I am basically looking for a way to extract and save all links from a .fdt
> file to a text file.
> The reason for this is simple: The engine I am building is indexing remote
> sites based on the dmoz dump. The issue is that my mysql database where all
> urls are stored contains 2 million ntries, but when I have indexed
> everything, I get about 1.8 million documents because some timeout, some
> redirect to another domain or some just fail. So my mission is extracting
> all URL's from the final fdt files and then enter them to my database again
> to have a "fresh" set of URL's to index without the need to run the crawler
> on all domains again just to waste bandwidth.
> That said, I was wondering if you would possiblbe available for a quick
> project to write me some java tool which works like:
> java tool.jar index.fdt links.txt
> which would basically export all found links from the fdt file and save it
> line by line to links.txt
> This would be really wonderful and would enable me to finalize my project
> :)
> If you are up for this, please do let me know and also let me know how much
> you would charge for this.
> Thank you for your time reading this.
> Juergen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message