lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Iterating over all documents in an index
Date Sat, 12 Feb 2011 17:30:27 GMT
Be aware that when you do a doc.get(), the fields are the
*stored* fields in their original, unanalyzed form. Is that really
what you want? Or do you want the tokenized form of the fields?

If the latter, you might get the Luke code, it reconstructs all the fields
in the document from the terms that are actually indexed. Note two
things: 1> it's slow. You're really undoing all the work that went into
inverting the index in the first place.
2> it's lossy. For instance, a term that's been stemmed will only have
the stemmed version in the index. Is that OK?

Best
Erick

On Sat, Feb 12, 2011 at 9:07 AM, Georger Araujo
<georger.araujo@gmail.com> wrote:
> Hi,
> I want to iterate over all documents in a given index. I've found the
> following piece of code [1]:
>
> IndexReader reader = // create IndexReader
> for (int i=0; i<reader.maxDoc(); i++) {
>    if (reader.isDeleted(i))
>        continue;
>
>    Document doc = reader.document(i);
>    String docId = doc.get("docId");
>
>    // do something with docId here...
> }
>
> I implemented it in my code and it worked fine. After that, I found out
> about MatchAllDocsQuery.
> I am not concerned with scoring nor sorting - all I want to do is iterate
> over all documents in the index and collect their terms. My ultimate goal is
> to build a bag-of-words of all documents and their terms so that I can run a
> clustering algorithm on it.I've also found out about Mahout's built-in
> vector creation utility [2], but I need to do this task from my own code.
>
> I ask, what is the recommended approach?
>
> [1]
> http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
> [2]
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text
>
> Regards,
>
> Georger
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message