lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Question from a new user : IndexSearcher.doc
Date Mon, 21 Jun 2010 01:45:43 GMT
By and large, you won't ever actually be interested in very many documents,
what's returned in the TopDocs structure internal document ID and score, in
score order. But retrieval by document ID is quite efficient, it's not a
search. I'm quite sure this won't be a problem.

Adding 10,000 documents a day means that in 588 years you'll exceed a 31-bit
number. I don't think you really need to worry about that either. And that's
the worst-case, assuming the ints are signed. And I believe that they're
unsigned anyway.

What you will have to worry about is the time to get the top N
highest-scoring documents. That is, IndexSearcher.seach() will be your
limiting factor long before you reach these numbers. By that time, though,
you'll have moved to SOLR or some other distributed search mechanism.

Performance is influenced by the complexity of the queries and the structure
and size of your index. The time spent retrieving the top few matches is
completely dwarfed by the search time for an index of any size.

All this may be irrelevant if you really want to retrieve a very large
number of documents rather than, say, the top 100. But the use case would
have to be very interesting for it to be a requirement to return, say,
100,000 documents to a user.

But do be aware that you're not retrieving the *original* text with
IndexSearcher. Typically, the relevant data is indexed but not stored These
two concepts are confusing when you start using Lucene, especially since
they're specified in the same call. Indexing a field splits it up into
tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc). The
indexed data is the part that's searched. You can also store the input
verbatim, the but stored part is just a copy that's never searched but is
available for retrieval.

Which brings up one of the central decisions you need to make. Are you,
indeed, going to store all the data for retrieval in your index or just
index the relevant text to be searched along with some locator information
to the original document? You mention Cassandra, which leads me to speculate
that it's the latter.

HTH
Erick


On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon
<victor.kabdebon@gmail.com>wrote:

> Hello Simon,
>
> As I told you, I am quite new with Lucene, so there are many things that
> might be wrong.
> I'm using Lucene to make a search service for a website that has a large
> amount of information daily. This amount of information is directly avaible
> as text in a Cassandra Database.
> There might be as much as 10.000 new documents added daily, and yes my
> concern is it possible to retrieve more documents than the integer max
> value
> ?
> I don't really see also how the IndexSearcher.doc( ) really works, because
> it seems like we give this method an ID and it is going to search in the
> indexed documents. So what exactly is going to do this
> IndexSearcher.doc(int) ?
>
> *Or are you concerned about retrieving all documents
> containing term "XY" if the number of documents matching is large?*
> *
> *
>
> I'm also concerned by this problem, yes
>
> Could you explain me a little bit how it works, and how Lucene enables one
> to retrieve a very large number of documents even if it uses int ?
>
> Thank you for your answers,
> Victor
>
> 2010/6/20 Simon Willnauer <simon.willnauer@googlemail.com>
>
> > Hi, maybe I don't understand your question correctly. Are you asking
> > if you could run into problems if you retrieve more documents than
> > integer max value? Or are you concerned about retrieving all documents
> > containing term "XY" if the number of documents matching is large? If
> > you are afraid of loading all documents matched from a stored field I
> > guess you are doing something wrong.
> > What are you using lucene for?
> >
> > simon
> >
> > On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
> > <victor.kabdebon@gmail.com> wrote:
> > > Hello everybody,
> > >
> > > I am new to Apache Lucene and it seems to fit perfectly my needs for my
> > > application.
> > > However I'm a little concerned about something (pardon me if it's a
> > > recurrent question, I've searched the archives but I didn't find
> > something
> > > about that)
> > >
> > > So here is my case :
> > >
> > > I have index a few files (like 10) and I'm trying to search something
> > stupid
> > > in it. The word "test". So after opening everything etc... (assuming it
> > > works also) I do that :
> > >
> > > *Term test = new Term("text_comment","test");*
> > > *        Query query = new TermQuery(test);*
> > > *        TopDocs top = searcher.search(query, 10);*
> > >
> > > I want to recover the first document (I have 2 documents in TopDocs), I
> > do :
> > >
> > > *IndexSearcher.doc( top[0].doc)*
> > >
> > > I searched a little bit in javadoc and I saw that this method uses
> "int"
> > as
> > > a parameter
> > > I'm a little bit concerned about this... At the moment, I have 10
> > documents
> > > so that's ok, but if I want to index let's say 20 files documents, how
> > will
> > > the IndexSearcher.doc(int) be able to retrieve documents ?
> > > Same problem if 100.000 files have the word "test" in "text_comment"
> will
> > I
> > > still be able to get these 100.000 documents or is it going to be a
> > problem
> > > ?
> > >
> > > Thank you very much.
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message