lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: lucene suiteable ? 6 mio recods / day 1k
Date Fri, 19 Dec 2008 13:13:14 GMT
Well, I'm reasonably sure you could make this work, although it'll
take some effort.

The 3,000,000 records/day should be pretty easy.

Parsing the URLs, if none of the supplied tokenizers do exactly what you
want, you can always make your own. Or you can pre-process the input
if that's easier. e.g. replaceAll("[/:]", " ") then just use one of the
regular
processors.

You can easily delete records. Assume one of your fields is the date
to day resolution. Your daemon could delete by term all records from
90 days ago. Take care to store the date in a convenient form.

Optimizing your index will reclaim all the space from deleted records. It
may take a while to accomplish.


Or you could create a new index every day and use one of the
MultiSearcher kinds of queries. Then you would simply delete the
appropriate index every day. How performant this solution would be
is something I don't have a good feel for, maybe someone else will
chime in.

But all in all, Lucene (or maybe SOLR) could work in this scenario. But
this is a significant amount of data and you'd have to do some testing
to see if you'd get acceptable performance.

Best
Erick


On Fri, Dec 19, 2008 at 6:22 AM, Christian Brennsteiner
<eingfoan@yahoo.de>wrote:

> hi *,
>
> i am searching for a fulltext index capeable of the following requirements:
>
> index everyday 3 000 000 new records with a validity of N days (e.g.
> 90 days expiration)
> == 34,7 / s
> one record is e.g. an url and can be up to 2 k big
>
> http://example.com/somedir/some.html
>
> lucene should use "/" as a word seperator and should e.g. eliminate all ":"
>
> so the following "sentence" shoule be indexed:
>
> http example.com somedir some.html when having the url
> http://example.com/somedir/some.html
>
> my main concern about this requirement is that the index should not
> grow over time as it always holds
> NR OF DAYS * RECORDS PER DAY  and expires the records after a given
> time. in my opinione ther must be some background thread always
> throwing away expired hits.
>
> is this easilly possible with lucene?
>
> regards chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message