lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex vB <>
Subject Indexing large XML dumps
Date Mon, 03 Jan 2011 16:35:36 GMT

Hello everybody,

I am currently indexing wikipedia dumps and create an index for versioned
document collections. As far everything is working fine but I have never
thought that single articles of wikipedia would reach a size of around 2 GB!
One article for example has 20000 versions with an average length of 60000
characters  for each (HUGE in memory!). This means I need a heap space
around 4 GB to perform indexing and I would like to decrease my memory
consumption ;).

At the moment I load every wikipedia article completely into memory
containing all versions. Then I collect some statistical data about the
article to store extra information about term occurences which are written
into the index as payloads. The statistic is created during an own
tokenization run which happens before the document is written into index.
This means I am analyzing my documents twice! :( I know there is a
CachingTokenFilter but I haven't found how and where to implement it exactly
(I tried it in my Analyzer but stream.reset() seems not to work). Does
somebody have a nice example?

1) Can I somehow avoid loading one complete article to get my statistics? 
2) Is it possible to index large files without completely loading it into a
3) How can I avoid to parse an article twice? 

Best regards 

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message