lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paulo Levi <>
Subject Re: Indexing with foreign key
Date Sun, 31 Oct 2010 21:17:31 GMT
Yes that's what i ended up doing. I will probably "fork" a new java vm
instead of doing it in the same. That way i can control the memory
requirements, though it hasn't given me any problems (actually it even
worked with -Xmx, though it probably doesn't if i do something else in the
program at the same time - i'm not indexing the book subjects yet too, need
to do some sort of string caching for that and authors.)

On Sun, Oct 31, 2010 at 8:47 PM, Erick Erickson <>wrote:

> Hmmmm. Are you too memory limited to do a first pass through the file and
> save
> the key/download links part in a map, then make another pass through the
> file
> indexing the data and grabbing the link from your map? I'm assuming that
> there's a lot less than 200M in just the key/link part.
> Alternatively (and this would probably be kinda slow, but...) still do a 2
> pass
> process, but instead of making a map, put the data in a Lucene index on
> disk.
> Then the second pass searches that index for the data to add to the docs
> in your "real" index.
> Erick
> On Sun, Oct 31, 2010 at 12:17 PM, Paulo Levi <> wrote:
> > I'm stepping tru a rdf file (the project gutenberg catalog) and sending
> > data
> > to a lucene index to allow searches of titles authors and such. However
> the
> > gutenberg rdf is a little bit "special". It has two sections, one for
> > title,
> > authors, collaborators and such, and (after all the books) starts the
> other
> > section that has the download links. The connection is a kind of foreign
> > key
> > that exists on both tags (a unique number id). While i don't need to
> search
> > the download link, i do need to save it.
> >
> > I'm memory limited and can't put in memory the 200 mb file that the
> catalog
> > is. I'm wondering if there is some way for me to use the number id to
> > connect both kinds of information without having to keep things in
> memory.
> > A
> > first search for the things i want, and a second using the number id?
> > It seems very clumsy. I'm not actually using a database, and i don't want
> > to
> > use very large libraries. Compass is 60 mbs (!). I tried lucenesail for a
> > while, but it has stopped working and the code is a mess (it is not
> adapted
> > to the filtering of the gutemberg rdf that i'm doing).
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message