lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Witdouck, Xavier" <>
Subject Building term frequency matrix over 6 million documents...
Date Fri, 24 Jan 2014 16:23:23 GMT
Hi all,

We have over 6 million documents in our index, and would like to construct a term frequency
matrix over all 6 million documents as quickly as possible.  Each document has a numeric date
field, so we would like to build a time series which contains values which are the sum of
all frequencies for documents on that date.  So for example, if the term was "iPhone", we
would want a time series which contained the sum of all iPhone mentions across all buckets,
but decomposed into time buckets.

The approach we have tried is to write a custom Collector as below, but this seems really,
really slow...any way of approaching this differently to make it perform much better?

public void collect(int docId) throws IOException {
    try {
      if (reader != null) {
          final Terms terms = reader.getTermVector(docId, field);
          termsEnum = terms.iterator(termsEnum);
          final int colIndex = matrix.columns().add(term);
          if (termsEnum.seekExact(termRef)) {
            final DocsAndPositionsEnum docsAndPositionsEnum = termsEnum.docsAndPositions(null,
null, DocsAndPositionsEnum.FLAG_FREQS);
            while (docsAndPositionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS){
                final int date = dates.get(docId);
                final int freq = docsAndPositionsEnum.freq();
                final int rowIndex = matrix.rows().add(date);
                final double value = matrix.getDouble(rowIndex, colIndex);
                matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ? freq : value +
                if (++docCount % 1000 == 0) {
        "Processed " + docCount + " / " + collectCount + " documents in
term frequency analysis...");
    } catch (Throwable t) {
      throw new RuntimeException("Failed to collect document " + docId, t);

public void setNextReader(AtomicReaderContext atomicReaderContext) throws IOException {
    this.reader = atomicReaderContext.reader();
    this.dates = FieldCache.DEFAULT.getInts(reader, "date", false);

Any help would be much appreciated...


this message was misdirected, BlackRock, Inc. and its subsidiaries, ("BlackRock") does not
waive any confidentiality or privilege.  If you are not the intended recipient, please notify
us immediately and destroy the message without disclosing its contents to anyone.  Any distribution,
use or copying of this e-mail or the information it contains by other than an intended recipient
is unauthorized.  The views and opinions expressed in this e-mail message are the author's
own and may not reflect the views and opinions of BlackRock, unless the author is authorized
by BlackRock to express such views or opinions on its behalf.  All email sent to or from this
address is subject to electronic storage and review by BlackRock.  Although BlackRock operates
anti-virus programs, it does not accept responsibility for any damage whatsoever caused by
viruses being passed.

BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK) Limited are authorised
and regulated by the Financial Conduct Authority. Registered in England No. 796793 and No.
2020394 respectively. BlackRock Life Limited is authorised by the Prudential Regulation Authority
and regulated by the Financial Conduct Authority and Prudential Regulation Authority. Registered
in England No. 2223202. Registered Offices: Drapers Gardens, 12 Throgmorton Avenue, London
EC2N 2DL. BlackRock International Limited is authorised and regulated by the Financial Conduct
Authority and is a registered investment adviser with the Securities and Exchange Commission
(SEC).  Registered in Scotland No. SC160821. Registered Office: 40 Torphichen Street, Edinburgh,
EH3 8JB.

© 2013 BlackRock, Inc. All Rights reserved.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message