lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: Paid Job: Looking for a developer to create a small java application to extract url's from .fdt files
Date Mon, 13 Feb 2012 14:59:05 GMT
for 2.x and 3.x you can simply use this codes:
    Directory dir=FSDirectory.open(new File("./testindex"));
    IndexReader reader=IndexReader.open(dir);
    List<String> urls=new ArrayList<String>(reader.numDocs());
    for(int i=0;i<reader.maxDoc();i++){
     if(!reader.isDeleted(i)){
     Document doc=reader.document(i);
     urls.add(doc.get("url"));
     }

    }
if url fields is indexed, you can use FieldCache.StringIndex to speed up.

as for Trunk 4.x, I can't find the isDeleted(int) method. any one could
tell me why this method is removed?
On Mon, Feb 13, 2012 at 10:31 PM, SearchTech <searchqt@gmail.com> wrote:

> Hi there,
>
> I am currently working on a search engine based on lucene and have some
> issues because java is not my regular programming language, which makes
> things a it hard.
> What I was wondering about is if you would be available for a small custom
> (paid) job to solve one of my issues.
>
> I am basically looking for a way to extract and save all links from a .fdt
> file to a text file.
>
> The reason for this is simple: The engine I am building is indexing remote
> sites based on the dmoz dump. The issue is that my mysql database where all
> urls are stored contains 2 million ntries, but when I have indexed
> everything, I get about 1.8 million documents because some timeout, some
> redirect to another domain or some just fail. So my mission is extracting
> all URL's from the final fdt files and then enter them to my database again
> to have a "fresh" set of URL's to index without the need to run the crawler
> on all domains again just to waste bandwidth.
>
> That said, I was wondering if you would possiblbe available for a quick
> project to write me some java tool which works like:
>
> java tool.jar index.fdt links.txt
>
> which would basically export all found links from the fdt file and save it
> line by line to links.txt
>
> This would be really wonderful and would enable me to finalize my project
> :)
>
> If you are up for this, please do let me know and also let me know how much
> you would charge for this.
>
> Thank you for your time reading this.
>
> Juergen
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message