lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Indexing with foreign key
Date Sun, 31 Oct 2010 20:47:00 GMT
Hmmmm. Are you too memory limited to do a first pass through the file and
save
the key/download links part in a map, then make another pass through the
file
indexing the data and grabbing the link from your map? I'm assuming that
there's a lot less than 200M in just the key/link part.

Alternatively (and this would probably be kinda slow, but...) still do a 2
pass
process, but instead of making a map, put the data in a Lucene index on
disk.
Then the second pass searches that index for the data to add to the docs
in your "real" index.

HTH
Erick

On Sun, Oct 31, 2010 at 12:17 PM, Paulo Levi <i30817@gmail.com> wrote:

> I'm stepping tru a rdf file (the project gutenberg catalog) and sending
> data
> to a lucene index to allow searches of titles authors and such. However the
> gutenberg rdf is a little bit "special". It has two sections, one for
> title,
> authors, collaborators and such, and (after all the books) starts the other
> section that has the download links. The connection is a kind of foreign
> key
> that exists on both tags (a unique number id). While i don't need to search
> the download link, i do need to save it.
>
> I'm memory limited and can't put in memory the 200 mb file that the catalog
> is. I'm wondering if there is some way for me to use the number id to
> connect both kinds of information without having to keep things in memory.
> A
> first search for the things i want, and a second using the number id?
> It seems very clumsy. I'm not actually using a database, and i don't want
> to
> use very large libraries. Compass is 60 mbs (!). I tried lucenesail for a
> while, but it has stopped working and the code is a mess (it is not adapted
> to the filtering of the gutemberg rdf that i'm doing).
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message