lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Kabdebon <victor.kabde...@gmail.com>
Subject Re: Question from a new user : IndexSearcher.doc
Date Mon, 21 Jun 2010 07:29:32 GMT
Hi Erick,

Thank you very much for you explanations. 588 is a rather long way to go, so
you're right maybe I won't need at the moment to care about that problem.
To answer your final question : no indeed I won't need to store a lot of
data. Just some keys  in order to find the data in Cassandra later on.

If you don't mind, please let me ask you another question :

Is it really interesting to begin with Lucene rather than directly with solR
(or Nutch) ? What I mean by that is : is it the same difficulty to implement
a search with solR and stay with it instead of first implement a search with
Lucene, then when the project becomes very big change it to a new system ?
My goal is to have that can evolve with time even if I have 1 million
documents added daily ?

Thank you,
Victor

2010/6/21 Erick Erickson <erickerickson@gmail.com>

> By and large, you won't ever actually be interested in very many documents,
> what's returned in the TopDocs structure internal document ID and score, in
> score order. But retrieval by document ID is quite efficient, it's not a
> search. I'm quite sure this won't be a problem.
>
> Adding 10,000 documents a day means that in 588 years you'll exceed a
> 31-bit
> number. I don't think you really need to worry about that either. And
> that's
> the worst-case, assuming the ints are signed. And I believe that they're
> unsigned anyway.
>
> What you will have to worry about is the time to get the top N
> highest-scoring documents. That is, IndexSearcher.seach() will be your
> limiting factor long before you reach these numbers. By that time, though,
> you'll have moved to SOLR or some other distributed search mechanism.
>
> Performance is influenced by the complexity of the queries and the
> structure
> and size of your index. The time spent retrieving the top few matches is
> completely dwarfed by the search time for an index of any size.
>
> All this may be irrelevant if you really want to retrieve a very large
> number of documents rather than, say, the top 100. But the use case would
> have to be very interesting for it to be a requirement to return, say,
> 100,000 documents to a user.
>
> But do be aware that you're not retrieving the *original* text with
> IndexSearcher. Typically, the relevant data is indexed but not stored These
> two concepts are confusing when you start using Lucene, especially since
> they're specified in the same call. Indexing a field splits it up into
> tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc). The
> indexed data is the part that's searched. You can also store the input
> verbatim, the but stored part is just a copy that's never searched but is
> available for retrieval.
>
> Which brings up one of the central decisions you need to make. Are you,
> indeed, going to store all the data for retrieval in your index or just
> index the relevant text to be searched along with some locator information
> to the original document? You mention Cassandra, which leads me to
> speculate
> that it's the latter.
>
> HTH
> Erick
>
>
> On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon
> <victor.kabdebon@gmail.com>wrote:
>
> > Hello Simon,
> >
> > As I told you, I am quite new with Lucene, so there are many things that
> > might be wrong.
> > I'm using Lucene to make a search service for a website that has a large
> > amount of information daily. This amount of information is directly
> avaible
> > as text in a Cassandra Database.
> > There might be as much as 10.000 new documents added daily, and yes my
> > concern is it possible to retrieve more documents than the integer max
> > value
> > ?
> > I don't really see also how the IndexSearcher.doc( ) really works,
> because
> > it seems like we give this method an ID and it is going to search in the
> > indexed documents. So what exactly is going to do this
> > IndexSearcher.doc(int) ?
> >
> > *Or are you concerned about retrieving all documents
> > containing term "XY" if the number of documents matching is large?*
> > *
> > *
> >
> > I'm also concerned by this problem, yes
> >
> > Could you explain me a little bit how it works, and how Lucene enables
> one
> > to retrieve a very large number of documents even if it uses int ?
> >
> > Thank you for your answers,
> > Victor
> >
> > 2010/6/20 Simon Willnauer <simon.willnauer@googlemail.com>
> >
> > > Hi, maybe I don't understand your question correctly. Are you asking
> > > if you could run into problems if you retrieve more documents than
> > > integer max value? Or are you concerned about retrieving all documents
> > > containing term "XY" if the number of documents matching is large? If
> > > you are afraid of loading all documents matched from a stored field I
> > > guess you are doing something wrong.
> > > What are you using lucene for?
> > >
> > > simon
> > >
> > > On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
> > > <victor.kabdebon@gmail.com> wrote:
> > > > Hello everybody,
> > > >
> > > > I am new to Apache Lucene and it seems to fit perfectly my needs for
> my
> > > > application.
> > > > However I'm a little concerned about something (pardon me if it's a
> > > > recurrent question, I've searched the archives but I didn't find
> > > something
> > > > about that)
> > > >
> > > > So here is my case :
> > > >
> > > > I have index a few files (like 10) and I'm trying to search something
> > > stupid
> > > > in it. The word "test". So after opening everything etc... (assuming
> it
> > > > works also) I do that :
> > > >
> > > > *Term test = new Term("text_comment","test");*
> > > > *        Query query = new TermQuery(test);*
> > > > *        TopDocs top = searcher.search(query, 10);*
> > > >
> > > > I want to recover the first document (I have 2 documents in TopDocs),
> I
> > > do :
> > > >
> > > > *IndexSearcher.doc( top[0].doc)*
> > > >
> > > > I searched a little bit in javadoc and I saw that this method uses
> > "int"
> > > as
> > > > a parameter
> > > > I'm a little bit concerned about this... At the moment, I have 10
> > > documents
> > > > so that's ok, but if I want to index let's say 20 files documents,
> how
> > > will
> > > > the IndexSearcher.doc(int) be able to retrieve documents ?
> > > > Same problem if 100.000 files have the word "test" in "text_comment"
> > will
> > > I
> > > > still be able to get these 100.000 documents or is it going to be a
> > > problem
> > > > ?
> > > >
> > > > Thank you very much.
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message