lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A. L. Benhenni " <albenhe...@gmail.com>
Subject Indexing directly from stdin in lucene 3.5
Date Tue, 19 Feb 2013 10:04:04 GMT
I am currently writing an indexer class to index texts from stdin. I also
need the text to be tokenized and stored to access the termvector of the
document.

I tweaked the lucene indexer from the demo file (I have to use lucene 3.5
for compatibility reasons), and the process of indexing itself works very
well.

I do however have a problem with the final index. I tried to index a
document with approximatly 20000 words, and it took almost 20s, ending with
an index of 93M.

I know very well where the problem lies. But I don't know what's the way to
handle it. I'm actually using the following code for indexing

        while((read_line = content.readLine()) != null){
            read_line = read_line.trim();
            doc.add(new Field("contents", read_line, Field.Store.YES,
Field.Index.ANALYZED));
            writer.updateDocument(new Term("path", path_field_name), doc);
        }

"content" here is a BufferedReader instance. So I'm basically looping over
each line and reindexing each time. It is so because I don't want to use an
intermediary structure like a StringBuilder, or a random buffer, and I
implemented at first a Field(string, Reader) before I realized it can't be
stored (!!!). The drawback is that I have as much as docID as number of
lines for the same documents. This led me to two questions :

1/ Is there a more appropriate way of handling the indexing of an incoming
stream ?
2/ Is there an easy way to clean the index ? or should I delete every
previous docID associated with a path by writing my own code (with an
indexsearcher and an indexreader) ?

And a subsidiary 3/ Why a field can't store a reader ?

By the way, I also tried to index the same document but as a 1 line
document, and the index was only 236K. But transforming each time my stream
to a 1 line stream also implies to use an intermediary structure.

---
Amine Lies BENHENNI
---

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message