lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: newbie seeking explanation of semantics of "Field" class
Date Tue, 17 Feb 2009 20:08:14 GMT
This confused me on my first encounter, but it all makes
sense after a while....

The first thing to understand is that Store and Index are
orthogonal.That is, when you index a field that data
is placed in the inverted index and is searchable, whether
or not you store it. But it is not retrievable easily.

Conversely, when you store the data, a literal copy is
stored, no analysis is done. This preserves case,
stopwords, original word if stemming is used,
punctuation, etc.

See below for more details...

On Tue, Feb 17, 2009 at 2:19 PM, <rolarenfan@earthlink.net> wrote:

> R2.4
>
> I have been looking through the soon-to-be-superseded (by its 2nd ed.) book
> "Lucene In Action" (hope it's ok on this newsgroup to say I like that book);
> also at these two tutorials: http://darksleep.com/lucene/ and
> http://www.informit.com/articles/article.aspx?p=461633&seqNum=3 and also
> at the Lucene online docco (http://lucene.apache.org/java/2_4_0/index.html)
> the last of which has nothing on the topic at all! I've also tried to search
> http://www.nabble.com/Lucene---Java-Users-f45.html -- but there are almost
> 10,000 docs there on "Field." so that is too much data.
>
> The book is consistent with the two tutorials, but all three seem to be out
> of date (and the design less clear) compared to the code:
> http://lucene.apache.org/java/2_4_0/api/index.html
>
> I have copied some code and it is working for me, but I am a little
> uncertain how to decide what value of Field.Index and Field.Store to choose
> in order to get the behavior I'd like. If I read the javadocs, and decide to
> ignore all the "expert" items, it looks like this:
>
> Field.Store.NO = I'll never see that data again; I wonder why I'd do this?


Mostly for space reasons. Say you have a version of a document that you
really want to show the user, for instance PDFs or images of pages. We have
cases where we have images of pages in books, and OCRd data of those images.
We'll never want to show the OCR to the user, but that's all we have to
search. What we do show the user is the image of the page from the image
vault. So we don't store the OCR text, just index it.

>
>
> Field.Store.YES = good, the data will be stored


Yes, but this bloats the index. If you're not going to show the user the
field exactly as it exists, there's no reason to store it. See above.


>
>
> Field.Store.COMPRESS = even better, stored and compressed; why would anyone
> do anything else?


Because of the cost involved in decompressing it, assuming you want to store
it in the first place (see above). But assume you want to show 500 documents
at a time. Decompressing time may count.

>
>
> ========
>
> Field.Index.NO = I cannot search that data, but if I need its value for a
> given document (e.g., to decorate a result), I can retrieve it (use-case:
> maybe, the date the document was created -- but why not just make that
> searchable? I am having a hard time thinking of an actually useful piece of
> data that could go here and would not want to be one of ANALYZED or
> NOT_ANALYZED)
>

Think harder <G>.... We have an application where we store meta-data in the
document for page navigation. There's no reason to index this data because
we never search on it. We don't want to provide the users with an interface
like "search for Erick on pages 21-32", we don't think it's valuable. One
can solve this kind of problem with an external data store (say a database),
but the added complexity of a second storage mechanism is worth avoiding if
possible.


>
> Field.Index.ANALYZED = the normal value, I would guess, except in the
> special case of stuff not searchable but used to decorate results (
> Field.Index.NO)


Yep, this is the most common case in my experience.


>
>
> Field.Index.NOT_ANALYZED = I can search for this value, but it won't get
> analyzed, so it is searched for as the very same value I put in (the docco
> suggests product numbers: any other interesting use-cases anyone can
> suggest?)


Most analyzers change the stream into tokens. For instance it would be very
common to break up 1234-5678 into 1234 and 5678. Assume this is a part
number. Matching 1234 is useless. And it's even worse if your analyzer
strips out the numbers entirely. Or SSN or even a case where you don't want
to break up on whitespace. Telephone numbers. Although in truth I don't use
this very often, but it saves a world of hurt when needed

Best
Erick

>
>
> =========
>
> thanks in advance for helping me get clearer on this!
>
> -Paul
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message