lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georger Araujo <georger.ara...@gmail.com>
Subject Re: Iterating over all documents in an index
Date Sat, 12 Feb 2011 17:54:57 GMT
Hi Erick,
Thanks for the input. I didn't really do the doc.get() in my code, because
what I'm really looking for are the stemmed terms in the index. I only
borrowed the idea from the snippet. Here's how my code looks like now:

Map<String,Integer> termMap = new TreeMap<String,Integer>();
List<String> filenames = new ArrayList<String>( reader.maxDoc() );

for ( int i = 0; i < reader.maxDoc(); i++ ) {
    if ( reader.isDeleted( i ) )
        continue;

        Document doc = reader.document( i );
        filenames.add( i, doc.get( "filename" ) );

        TermFreqVector tfVector = reader.getTermFreqVector( i,
"filecontents" );
        for ( int j = 0; j < tfVector.getTerms().length; j++ ) {
            // Add terms to the map...
        }
    }

The point is that I need to read the term frequency vector for each
document. I actually want the stemmed terms. From that and the filenames
(stored in the "filename" field), i was able to build a matrix that looks
like this:
        term1 term2 ... termn
doc1    1        1            0
doc2    0        1            0
...
docn    1        0            1

I already have this working. I just don't know if I'm doing it the
simplest/quickest way, though - and that's going to be an issue because I
intend to run the clustering algorithm on matrices as big as 5000 x 100000.
Regards,

Georger

2011/2/12 Erick Erickson <erickerickson@gmail.com>

> Be aware that when you do a doc.get(), the fields are the
> *stored* fields in their original, unanalyzed form. Is that really
> what you want? Or do you want the tokenized form of the fields?
>
> If the latter, you might get the Luke code, it reconstructs all the fields
> in the document from the terms that are actually indexed. Note two
> things: 1> it's slow. You're really undoing all the work that went into
> inverting the index in the first place.
> 2> it's lossy. For instance, a term that's been stemmed will only have
> the stemmed version in the index. Is that OK?
>
> Best
> Erick
>
> On Sat, Feb 12, 2011 at 9:07 AM, Georger Araujo
> <georger.araujo@gmail.com> wrote:
> > Hi,
> > I want to iterate over all documents in a given index. I've found the
> > following piece of code [1]:
> >
> > IndexReader reader = // create IndexReader
> > for (int i=0; i<reader.maxDoc(); i++) {
> >    if (reader.isDeleted(i))
> >        continue;
> >
> >    Document doc = reader.document(i);
> >    String docId = doc.get("docId");
> >
> >    // do something with docId here...
> > }
> >
> > I implemented it in my code and it worked fine. After that, I found out
> > about MatchAllDocsQuery.
> > I am not concerned with scoring nor sorting - all I want to do is iterate
> > over all documents in the index and collect their terms. My ultimate goal
> is
> > to build a bag-of-words of all documents and their terms so that I can
> run a
> > clustering algorithm on it.I've also found out about Mahout's built-in
> > vector creation utility [2], but I need to do this task from my own code.
> >
> > I ask, what is the recommended approach?
> >
> > [1]
> >
> http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
> > [2]
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text
> >
> > Regards,
> >
> > Georger
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message