Return-Path: Delivered-To: apmail-lucene-java-commits-archive@www.apache.org Received: (qmail 17125 invoked from network); 27 Nov 2006 00:01:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Nov 2006 00:01:32 -0000 Received: (qmail 14669 invoked by uid 500); 27 Nov 2006 00:01:41 -0000 Delivered-To: apmail-lucene-java-commits-archive@lucene.apache.org Received: (qmail 14651 invoked by uid 500); 27 Nov 2006 00:01:41 -0000 Mailing-List: contact java-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 14640 invoked by uid 99); 27 Nov 2006 00:01:41 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Nov 2006 16:01:41 -0800 X-ASF-Spam-Status: No, hits=-9.4 required=10.0 tests=ALL_TRUSTED,NO_REAL_NAME X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO eris.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Nov 2006 16:01:27 -0800 Received: by eris.apache.org (Postfix, from userid 65534) id 5FD531A984D; Sun, 26 Nov 2006 16:00:50 -0800 (PST) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r479465 [2/4] - in /lucene/java/trunk: docs/ docs/images/ docs/lucene-sandbox/ docs/styles/ src/site/ src/site/src/ src/site/src/documentation/ src/site/src/documentation/classes/ src/site/src/documentation/conf/ src/site/src/documentation/... Date: Mon, 27 Nov 2006 00:00:49 -0000 To: java-commits@lucene.apache.org From: gsingers@apache.org X-Mailer: svnmailer-1.1.0 Message-Id: <20061127000050.5FD531A984D@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,160 @@ + + +
+ + Apache Lucene - Basic Demo Sources Walkthrough + +
+ +Andrew C. Oliver + + + +
About the Code +

+In this section we walk through the sources behind the basic Lucene Web Application demo: where to +find them, their parts and their function. This section is intended for Java developers wishing to +understand how to use Lucene in their applications or for those involved in deploying web +applications based on Lucene. +

+
+ + +
Location of the source (developers/deployers) +

+Relative to the directory created when you extracted Lucene or retrieved it from Subversion, you +should see a directory called src which in turn contains a directory called +jsp. This is the root for all of the Lucene web demo. +

+

+Within this directory you should see index.jsp. Bring this up in vi or your editor of +choice. +

+
+ +
index.jsp (developers/deployers) +

+This jsp page is pretty boring by itself. All it does is include a header, display a form and +include a footer. If you look at the form, it has two fields: query (where you enter +your search criteria) and maxresults where you specify the number of results per page. +By the structure of this JSP it should be easy to customize it without even editing this particular +file. You could simply change the header and footer. Let's look at the header.jsp +(located in the same directory) next. +

+
+ +
header.jsp (developers/deployers) +

+The header is also very simple by itself. The only thing it does is include the +configuration.jsp (which you looked at in the last section of this guide) and set the +title and a brief header. This would be a good place to put your own custom HTML to "pretty" things +up a bit. We won't cover the footer because all it does is display the footer and close your tags. +Let's look at the results.jsp, the meat of this application, next. +

+
+ +
results.jsp (developers) +

+Most of the functionality lies in results.jsp. Much of it is for paging the search +results, which we'll not cover here as it's commented well enough. The first thing in this page is +the actual imports for the Lucene classes and Lucene demo classes. These classes are loaded from +the jars included in the WEB-INF/lib directory in the luceneweb.war file. +

+

+You'll notice that this file includes the same header and footer as index.jsp. From +there it constructs an IndexSearcher with the +indexLocation that was specified in configuration.jsp. If there is an +error of any kind in opening the index, it is displayed to the user and the boolean flag +error is set to tell the rest of the sections of the jsp not to continue. +

+

+From there, this jsp attempts to get the search criteria, the start index (used for paging) and the +maximum number of results per page. If the maximum results per page is not set or not valid then it +and the start index are set to default values. If only the start index is invalid it is set to a +default value. If the criteria isn't provided then a servlet error is thrown (it is assumed that +this is the result of url tampering or some form of browser malfunction). +

+

+The jsp moves on to construct a StandardAnalyzer to +analyze the search text. This matches the analyzer used during indexing (IndexHTML), which is generally +recommended. This is passed to the QueryParser along with the +criteria to construct a Query +object. You'll also notice the string literal "contents" included. This specifies +that the search should cover the contents field and not the title, +url or some other field in the indexed documents. If there is any error in +constructing a Query object an +error is displayed to the user. +

+

+In the next section of the jsp the IndexSearcher is asked to search +given the query object. The results are returned in a collection called hits. If the +length property of the hits collection is 0 (meaning there were no results) then an +error is displayed to the user and the error flag is set. +

+

+Finally the jsp iterates through the hits collection, taking the current page into +account, and displays properties of the Document objects we talked about in +the first walkthrough. These objects contain "known" fields specific to their indexer (in this case +IndexHTML constructs a document +with "url", "title" and "contents"). +

+

+Please note that in a real deployment of Lucene, it's best to instantiate IndexSearcher and QueryParser once, and then +share them across search requests, instead of re-instantiating per search request. +

+
+ +
More sources (developers) +

+There are additional sources used by the web app that were not specifically covered by either +walkthrough. For example the HTML parser, the IndexHTML class and HTMLDocument class. These are very +similar to the classes covered in the first example, with properties specific to parsing and +indexing HTML. This is beyond our scope; however, by now you should feel like you're "getting +started" with Lucene. +

+
+ +
Where to go from here? (everyone!) +

+There are a number of things this demo doesn't do or doesn't do quite right. For instance, you may +have noticed that documents in the root context are unreachable (unless you reconfigure Tomcat to +support that context or redirect to it), anywhere where the directory doesn't quite match the +context mapping, you'll have a broken link in your results. If you want to index non-local files or +have some other needs this isn't supported, plus there may be security issues with running the +indexing application from your webapps directory. There are a number of things left for you the +developer to do. +

+

+In time some of these things may be added to Lucene as features (if you've got a good idea we'd love +to hear it!), but for now: this is where you begin and the search engine/indexer ends. Lastly, one +would assume you'd want to follow the above advice and customize the application to look a little +more fancy than black on white with "Lucene Template" at the top. We'll see you on the Lucene +Users' or Developers' mailing lists! +

+
+ +
When to contact the Author +

+Please resist the urge to contact the authors of this document (without bribes of fame and fortune +attached). First contact the mailing lists, taking care to Ask Questions The Smart Way. +Certainly you'll get the most help that way as well. That being said, feedback, and modifications +to this document and samples are ever so greatly appreciated. They are just best sent to the lists +or posted as patches, so that +everyone can share in them. Thanks for understanding! +

+
+ + +
+ Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,47 @@ + + +
+Apache Lucene - Features +
+ + +
Features +

Lucene offers powerful features through a simple API:

+
+ +
Scalable, High-Performance Indexing +
    +
  • over 20MB/minute on Pentium M 1.5GHz
  • +
  • small RAM requirements -- only 1MB heap
  • +
  • incremental indexing as fast as batch indexing
  • +
  • index size roughly 20-30% the size of text indexed
  • +
+
+ + + +
Cross-Platform Solution + +
+ + +
+ Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,1377 @@ + + + +
+ +Apache Lucene - Index File Formats + +
+ + + + + + + + +
+ Index File Formats +

+ This document defines the index file formats used + in Lucene version 2.0. If you are using a different + version of Lucene, please consult the copy of + docs/fileformats.html that was distributed + with the version you are using. +

+ +

+ Apache Lucene is written in Java, but several + efforts are underway to write + versions + of Lucene in other programming + languages. If these versions are to remain compatible with Apache + Lucene, then a language-independent definition of the Lucene index + format is required. This document thus attempts to provide a + complete and independent definition of the Apache Lucene 1.4 file + formats. +

+ +

+ As Lucene evolves, this document should evolve. + Versions of Lucene in different programming languages should endeavor + to agree on file formats, and generate new versions of this document. +

+ +

+ Compatibility notes are provided in this document, + describing how file formats have changed from prior versions. +

+ +
+ +
+ Definitions +

+ The fundamental concepts in Lucene are index, + document, field and term. +

+ + +

+ An index contains a sequence of documents. +

+ +
    +
  • +

    + A document is a sequence of fields. +

    +
  • + +
  • +

    + A field is a named sequence of terms. +

    +
  • + +
  • + A term is a string. +
  • +
+ +

+ The same string in two different fields is + considered a different term. Thus terms are represented as a pair of + strings, the first naming the field, and the second naming text + within the field. +

+ +
+ Inverted Indexing +

+ The index stores statistics about terms in order + to make term-based search more efficient. Lucene's + index falls into the family of indexes known as an inverted + index. This is because it can list, for a term, the documents that contain + it. This is the inverse of the natural relationship, in which + documents list terms. +

+
+
+ Types of Fields +

+ In Lucene, fields may be stored, in which + case their text is stored in the index literally, in a non-inverted + manner. Fields that are inverted are called indexed. A field + may be both stored and indexed.

+ +

The text of a field may be tokenized into terms to be + indexed, or the text of a field may be used literally as a term to be indexed. + Most fields are + tokenized, but sometimes it is useful for certain identifier fields + to be indexed literally. +

+

See the Field java docs for more information on Fields.

+
+ +
+ Segments +

+ Lucene indexes may be composed of multiple sub-indexes, or + segments. Each segment is a fully independent index, which could be searched + separately. Indexes evolve by: +

+ +
    +
  1. Creating new segments for newly added documents.

    +
  2. +
  3. Merging existing segments.

    +
  4. +
+ +

+ Searches may involve multiple segments and/or multiple indexes, each + index potentially composed of a set of segments. +

+
+ +
+ Document Numbers +

+ Internally, Lucene refers to documents by an integer document + number. The first document added to an index is numbered zero, and each + subsequent document added gets a number one greater than the previous. +

+ +

+
+

+ +

+ Note that a document's number may change, so caution should be taken + when storing these numbers outside of Lucene. In particular, numbers may + change in the following situations: +

+ + +
    +
  • +

    + The + numbers stored in each segment are unique only within the segment, + and must be converted before they can be used in a larger context. + The standard technique is to allocate each segment a range of + values, based on the range of numbers used in that segment. To + convert a document number from a segment to an external value, the + segment's base document + number is added. To convert an external value back to a + segment-specific value, the segment is identified by the range that + the external value is in, and the segment's base value is + subtracted. For example two five document segments might be + combined, so that the first segment has a base value of zero, and + the second of five. Document three from the second segment would + have an external value of eight. +

    +
  • +
  • +

    + When documents are deleted, gaps are created + in the numbering. These are eventually removed as the index evolves + through merging. Deleted documents are dropped when segments are + merged. A freshly-merged segment thus has no gaps in its numbering. +

    +
  • +
+ +
+ +
+ +
+ Overview +

+ Each segment index maintains the following: +

+
    +
  • Field names. This + contains the set of field names used in the index. + +

    +
  • +
  • Stored Field + values. This contains, for each document, a list of attribute-value + pairs, where the attributes are field names. These are used to + store auxiliary information about the document, such as its title, + url, or an identifier to access a + database. The set of stored fields are what is returned for each hit + when searching. This is keyed by document number. +

    +
  • +
  • Term dictionary. + A dictionary containing all of the terms used in all of the indexed + fields of all of the documents. The dictionary also contains the + number of documents which contain the term, and pointers to the + term's frequency and proximity data. +

    +
  • + +
  • Term Frequency + data. For each term in the dictionary, the numbers of all the + documents that contain that term, and the frequency of the term in + that document. +

    +
  • + +
  • Term Proximity + data. For each term in the dictionary, the positions that the term + occurs in each document. +

    +
  • + +
  • Normalization + factors. For each field in each document, a value is stored that is + multiplied into the score for hits on that field. +

    +
  • +
  • Term Vectors. For each field in each document, the term vector + (sometimes called document vector) may be stored. A term vector consists + of term text and term frequency. To add Term Vectors to your index see the + Field constructors +

    +
  • +
  • Deleted documents. + An optional file indicating which documents are deleted. +

    +
  • +
+ +

Details on each of these are provided in subsequent sections. +

+
+ +
+ File Naming +

+ All files belonging to a segment have the same name with varying + extensions. The extensions correspond to the different file formats + described below. When using the Compound File format (default in 1.4 and greater) these files are + collapsed into a single .cfs file (see below for details) +

+ +

+ Typically, all segments + in an index are stored in a single directory, although this is not + required. +

+ +
+ +
+ Primitive Types +
+ Byte +

+ The most primitive type + is an eight-bit byte. Files are accessed as sequences of bytes. All + other data types are defined as sequences + of bytes, so file formats are byte-order independent. +

+ +
+ +
+ UInt32 +

+ 32-bit unsigned integers are written as four + bytes, high-order bytes first. +

+

+ UInt32 --> <Byte>4 +

+ +
+ +
+ Uint64 +

+ 64-bit unsigned integers are written as eight + bytes, high-order bytes first. +

+ +

UInt64 --> <Byte>8 +

+ +
+ +
+ VInt +

+ A variable-length format for positive integers is + defined where the high-order bit of each byte indicates whether more + bytes remain to be read. The low-order seven bits are appended as + increasingly more significant bits in the resulting integer value. + Thus values from zero to 127 may be stored in a single byte, values + from 128 to 16,383 may be stored in two bytes, and so on. +

+ +

VInt Encoding Example

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Value +

+
+

First byte +

+
+

Second byte +

+
+

Third byte +

+
+

0 +

+
+

+ 00000000 +

+
+


+ +

+
+


+ +

+
+

1 +

+
+

+ 00000001 +

+
+


+ +

+
+


+ +

+
+

2 +

+
+

+ 00000010 +

+
+


+ +

+
+


+ +

+
+

... +

+
+


+ +

+
+


+ +

+
+


+ +

+
+

127 +

+
+

+ 01111111 +

+
+


+ +

+
+


+ +

+
+

128 +

+
+

+ 10000000 +

+
+

+ 00000001 +

+
+


+ +

+
+

129 +

+
+

+ 10000001 +

+
+

+ 00000001 +

+
+


+ +

+
+

130 +

+
+

+ 10000010 +

+
+

+ 00000001 +

+
+


+ +

+
+

... +

+
+


+ +

+
+


+ +

+
+


+ +

+
+

16,383 +

+
+

+ 11111111 +

+
+

+ 01111111 +

+
+


+ +

+
+

16,384 +

+
+

+ 10000000 +

+
+

+ 10000000 +

+
+

+ 00000001 +

+
+

16,385 +

+
+

+ 10000001 +

+
+

+ 10000000 +

+
+

+ 00000001 +

+
+

... +

+
+

+
+ +

+
+

+
+ +

+
+

+
+ +

+
+ +

+ This provides compression while still being + efficient to decode. +

+ +
+ +
+ Chars +

+ Lucene writes unicode + character sequences using Java's + "modified + UTF-8 encoding". +

+ + +
+ +
+ String +

+ Lucene writes strings as a VInt representing the length, followed by + the character data. +

+ +

+ String --> VInt, Chars +

+ +
+ +
+ +
+ Per-Index Files +

+ The files in this section exist one-per-index. +

+ +
+ Segments File +

+ The active segments in the index are stored in the + segment info file. An index only has + a single file in this format, and it is named "segments". + This lists each segment by name, and also contains the size of each + segment. +

+ +

+ Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>SegCount +

+ +

+ Format, NameCounter, SegCount, SegSize --> UInt32 +

+ +

+ Version --> UInt64 +

+ +

+ SegName --> String +

+ +

+ Format is -1 in Lucene 1.4. +

+ +

+ Version counts how often the index has been + changed by adding or deleting documents. +

+ +

+ NameCounter is used to generate names for new segment files. +

+ +

+ SegName is the name of the segment, and is used as the file name prefix + for all of the files that compose the segment's index. +

+ +

+ SegSize is the number of documents contained in the segment index. +

+ + +
+ +
+ Lock Files +

+ Several files are used to indicate that another + process is using an index. Note that these files are not + stored in the index directory itself, but rather in the + system's temporary directory, as indicated in the Java + system property "java.io.tmpdir". +

+ +
    +
  • +

    + When a file named "commit.lock" + is present, a process is currently re-writing the "segments" + file and deleting outdated segment index files, or a process is + reading the "segments" + file and opening the files of the segments it names. This lock file + prevents files from being deleted by another process after a process + has read the "segments" + file but before it has managed to open all of the files of the + segments named therein. +

    +
  • + +
  • +

    + When a file named "write.lock" + is present, a process is currently adding documents to an index, or + removing files from that index. This lock file prevents several + processes from attempting to modify an index at the same time. +

    +
  • +
+
+ +
+ Deletable File +

+ A file named "deletable" + contains the names of files that are no longer used by the index, but + which could not be deleted. This is only used on Win32, where a + file may not be deleted while it is still open. On other platforms + the file contains only null bytes. +

+ +

+ Deletable --> DeletableCount, + <DelableName>DeletableCount +

+ +

DeletableCount --> UInt32 +

+

DeletableName --> + String +

+
+ +
+ Compound Files +

Starting with Lucene 1.4 the compound file format became default. This + is simply a container for all files described in the next section.

+ +

Compound (.cfs) --> FileCount, <DataOffset, FileName>FileCount, + FileDataFileCount

+ +

FileCount --> VInt

+ +

DataOffset --> Long

+ +

FileName --> String

+ +

FileData --> raw file data

+

The raw file data is the data from the individual files named above.

+ +
+ +
+ +
+ Per-Segment Files +

+ The remaining files are all per-segment, and are + thus defined by suffix. +

+
+ Fields +


Field Info

+ +

+ Field names are + stored in the field info file, with suffix .fnm. +

+

+ FieldInfos + (.fnm) --> FieldsCount, <FieldName, + FieldBits>FieldsCount +

+ +

+ FieldsCount --> VInt +

+ +

+ FieldName --> String +

+ +

+ FieldBits --> Byte +

+ +

+

    +
  • + The low-order bit is one for + indexed fields, and zero for non-indexed fields. +
  • +
  • + The second lowest-order + bit is one for fields that have term vectors stored, and zero for fields + without term vectors. +
  • +

    Lucene >= 1.9:

    +
  • If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.
  • +
  • If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.
  • +
  • If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.
  • +
+

+ +

+ Fields are numbered by their order in this file. Thus field zero is + the + first field in the file, field one the next, and so on. Note that, + like document numbers, field numbers are segment relative. +

+ +


Stored Fields

+ +

+ Stored fields are represented by two files: +

+ +
    +
  1. +

    + The field index, or .fdx file. +

    + +

    + This contains, for each document, a pointer to + its field data, as follows: +

    + +

    + FieldIndex + (.fdx) --> + <FieldValuesPosition>SegSize +

    +

    FieldValuesPosition + --> Uint64 +

    +

    This + is used to find the location within the field data file of the + fields of a particular document. Because it contains fixed-length + data, this file may be easily randomly accessed. The position of + document n's field data is the Uint64 at n*8 in + this file. +

    +
  2. +
  3. +

    + The field data, or .fdt file. + +

    + +

    + This contains the stored fields of each document, + as follows: +

    + +

    + FieldData (.fdt) --> + <DocFieldData>SegSize +

    +

    DocFieldData --> + FieldCount, <FieldNum, Bits, Value>FieldCount +

    +

    FieldCount --> + VInt +

    +

    FieldNum --> + VInt +

    + +

    Lucene <= 1.4:

    +

    Bits --> + Byte +

    +

    Value --> + String +

    +

    Only the low-order bit of Bits is used. It is one for + tokenized fields, and zero for non-tokenized fields. +

    +

    Lucene >= 1.9:

    +

    Bits --> + Byte +

    +

    +

      +
    • low order bit is one for tokenized fields
    • +
    • second bit is one for fields containing binary data
    • +
    • third bit is one for fields with compression option enabled + (if compression is enabled, the algorithm used is ZLIB)
    • +
    +

    +

    Value --> + String | BinaryValue (depending on Bits) +

    +

    BinaryValue --> + ValueSize, <Byte>^ValueSize +

    +

    ValueSize --> + VInt +

    + +
  4. +
+ +
+
+ Term Dictionary +

+ The term dictionary is represented as two files: +

+
    +
  1. +

    + The term infos, or tis file. +

    + +

    + TermInfoFile (.tis)--> + TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos +

    +

    TIVersion --> + UInt32 +

    +

    TermCount --> + UInt64 +

    +

    IndexInterval --> + UInt32 +

    +

    SkipInterval --> + UInt32 +

    +

    TermInfos --> + <TermInfo>TermCount +

    +

    TermInfo --> + <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> +

    +

    Term --> + <PrefixLength, Suffix, FieldNum> +

    +

    Suffix --> + String +

    +

    PrefixLength, + DocFreq, FreqDelta, ProxDelta, SkipDelta
    --> VInt +

    +

    This + file is sorted by Term. Terms are ordered first lexicographically + by the term's field name, and within that lexicographically by the + term's text. +

    +

    TIVersion names the version of the format + of this file and is -2 in Lucene 1.4. +

    +

    Term + text prefixes are shared. The PrefixLength is the number of initial + characters from the previous term which must be pre-pended to a + term's suffix in order to form the term's text. Thus, if the + previous term's text was "bone" and the term is "boy", + the PrefixLength is two and the suffix is "y". +

    +

    FieldNumber + determines the term's field, whose name is stored in the .fdt file. +

    +

    DocFreq + is the count of documents which contain the term. +

    +

    FreqDelta + determines the position of this term's TermFreqs within the .frq + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file). +

    +

    ProxDelta + determines the position of this term's TermPositions within the .prx + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file. +

    +

    SkipDelta determines the position of this + term's SkipData within the .frq file. In + particular, it is the number of bytes + after TermFreqs that the SkipData starts. + In other words, it is the length of the + TermFreq data. +

    +
  2. +
  3. +

    + The term info index, or .tii file. +

    + +

    + This contains every IndexIntervalth entry from the .tis + file, along with its location in the "tis" file. This is + designed to be read entirely into memory and used to provide random + access to the "tis" file. +

    + +

    + The structure of this file is very similar to the + .tis file, with the addition of one item per record, the IndexDelta. +

    + +

    + TermInfoIndex (.tii)--> + TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices +

    +

    TIVersion --> + UInt32 +

    +

    IndexTermCount --> + UInt64 +

    +

    IndexInterval --> + UInt32 +

    +

    SkipInterval --> + UInt32 +

    +

    TermIndices --> + <TermInfo, IndexDelta>IndexTermCount +

    +

    IndexDelta --> + VLong +

    +

    IndexDelta + determines the position of this term's TermInfo within the .tis file. In + particular, it is the difference between the position of this term's + entry in that file and the position of the previous term's entry. +

    +

    SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). + Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while + smaller values result in bigger indexes, less acceleration and more + accelerable cases.

    +
  4. +
+
+ +
+ Frequencies +

+ The .frq file contains the lists of documents + which contain each term, along with the frequency of the term in that + document. +

+

FreqFile (.frq) --> + <TermFreqs, SkipData>TermCount +

+

TermFreqs --> + <TermFreq>DocFreq +

+

TermFreq --> + DocDelta, Freq? +

+

SkipData --> + <SkipDatum>DocFreq/SkipInterval +

+

SkipDatum --> + DocSkip,FreqSkip,ProxSkip +

+

DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> + VInt +

+

TermFreqs + are ordered by term (the term is implicit, from the .tis file). +

+

TermFreq + entries are ordered by increasing document number. +

+

DocDelta + determines both the document number and the frequency. In + particular, DocDelta/2 is the difference between this document number + and the previous document number (or zero when this is the first + document in a TermFreqs). When DocDelta is odd, the frequency is + one. When DocDelta is even, the frequency is read as another VInt. +

+

For + example, the TermFreqs for a term which occurs once in document seven + and three times in document eleven would be the following sequence of + VInts: +

+

15, + 8, 3 +

+

DocSkip records the document number before every + SkipIntervalth document in TermFreqs. + Document numbers are represented as differences + from the previous value in the sequence. FreqSkip + and ProxSkip record the position of every + SkipIntervalth entry in FreqFile and + ProxFile, respectively. File positions are + relative to the start of TermFreqs and Positions, + to the previous SkipDatum in the sequence. +

+

For example, if DocFreq=35 and SkipInterval=16, + then there are two SkipData entries, containing + the 15th and 31st document + numbers in TermFreqs. The first FreqSkip names + the number of bytes after the beginning of + TermFreqs that the 16th SkipDatum + starts, and the second the number of bytes after + that that the 32nd starts. The first + ProxSkip names the number of bytes after the + beginning of Positions that the 16th + SkipDatum starts, and the second the number of + bytes after that that the 32nd starts. +

+ +
+
+ Positions +

+ The .prx file contains the lists of positions that + each term occurs at within documents. +

+

ProxFile (.prx) --> + <TermPositions>TermCount +

+

TermPositions --> + <Positions>DocFreq +

+

Positions --> + <PositionDelta>Freq +

+

PositionDelta --> + VInt +

+

TermPositions + are ordered by term (the term is implicit, from the .tis file). +

+

Positions + entries are ordered by increasing document number (the document + number is implicit from the .frq file). +

+

PositionDelta + is the difference between the position of the current occurrence in + the document and the previous occurrence (or zero, if this is the + first occurrence in this document). +

+

+ For example, the TermPositions for a + term which occurs as the fourth term in one document, and as the + fifth and ninth term in a subsequent document, would be the following + sequence of VInts: +

+

4, + 5, 4 +

+
+
+ Normalization Factors +

There's a norm file for each indexed field with a byte for + each document. The .f[0-9]* file contains, + for each document, a byte that encodes a value that is multiplied + into the score for hits on that field: +

+

Norms + (.f[0-9]*) --> <Byte>SegSize +

+

Each + byte encodes a floating point value. Bits 0-2 contain the 3-bit + mantissa, and bits 3-8 contain the 5-bit exponent. +

+

These + are converted to an IEEE single float value as follows: +

+
    +
  1. If + the byte is zero, use a zero float. +

    +
  2. +
  3. Otherwise, + set the sign bit of the float to zero; +

    +
  4. +
  5. add + 48 to the exponent and use this as the float's exponent; +

    +
  6. +
  7. map + the mantissa to the high-order 3 bits of the float's mantissa; and + +

    +
  8. +
  9. set + the low-order 21 bits of the float's mantissa to zero. +

    +
  10. +
+ +
+
+ Term Vectors + Term Vector support is an optional on a field by field basis. It consists of 4 + files. +
    +
  1. +

    The Document Index or .tvx file.

    +

    This contains, for each document, a pointer to the document data in the Document + (.tvd) file. +

    +

    DocumentIndex (.tvx) --> TVXVersion<DocumentPosition>NumDocs

    +

    TVXVersion --> Int

    +

    DocumentPosition --> UInt64

    +

    This is used to find the position of the Document in the .tvd file.

    +
  2. +
  3. +

    The Document or .tvd file.

    +

    This contains, for each document, the number of fields, a list of the fields with + term vector info and finally a list of pointers to the field information in the .tvf + (Term Vector Fields) file.

    +

    + Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,>NumDocs +

    +

    TVDVersion --> Int

    +

    NumFields --> VInt

    +

    FieldNums --> <FieldNumDelta>NumFields

    +

    FieldNumDelta --> VInt

    +

    FieldPositions --> <FieldPosition>NumFields

    +

    FieldPosition --> VLong

    +

    The .tvd file is used to map out the fields that have term vectors stored and + where the field information is in the .tvf file.

    +
  4. +
  5. +

    The Field or .tvf file.

    +

    This file contains, for each field that has a term vector stored, a list of + the terms and their frequencies.

    +

    Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs>NumFields

    +

    TVFVersion --> Int

    +

    NumTerms --> VInt

    +

    NumDistinct --> VInt -- Future Use

    +

    TermFreqs --> <TermText, TermFreq>NumTerms

    +

    TermText --> <PrefixLength, Suffix>

    +

    PrefixLength --> VInt

    +

    Suffix --> String

    +

    TermFreq --> VInt

    +

    Term + text prefixes are shared. The PrefixLength is the number of initial + characters from the previous term which must be pre-pended to a + term's suffix in order to form the term's text. Thus, if the + previous term's text was "bone" and the term is "boy", + the PrefixLength is two and the suffix is "y". +

    +
  6. +
+
+ +
+ Deleted Documents + +

The .del file is + optional, and only exists when a segment contains deletions: +

+ +

Deletions + (.del) --> ByteCount,BitCount,Bits +

+ +

ByteSize,BitCount --> + Uint32 +

+ +

Bits --> + <Byte>ByteCount +

+ +

ByteCount + indicates the number of bytes in Bits. It is typically + (SegSize/8)+1. +

+ +

+ BitCount + indicates the number of bits that are currently set in Bits. +

+ +

Bits + contains one bit for each document indexed. When the bit + corresponding to a document number is set, that document is marked as + deleted. Bit ordering is from least to most significant. Thus, if + Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as + deleted. +

+
+
+ +
+ Limitations +

There + are a few places where these file formats limit the maximum number of + terms and documents to a 32-bit quantity, or to approximately 4 + billion. This is not today a problem, but, in the long term, + probably will be. These should therefore be replaced with either + UInt64 values, or better yet, with VInt values which have no limit. +

+ +
+ + + +
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,55 @@ + + +
+ + Apache Lucene - Getting Started Guide + +
+ +Andrew C. Oliver + + + +
+ Getting Started +

+This document is intended as a "getting started" guide. It has three audiences: first-time users +looking to install Apache Lucene in their application or web server; developers looking to modify or base +the applications they develop on Lucene; and developers looking to become involved in and contribute +to the development of Lucene. This document is written in tutorial and walk-through format. The +goal is to help you "get started". It does not go into great depth on some of the conceptual or +inner details of Lucene. +

+ +

+Each section listed below builds on one another. More advanced users +may wish to skip sections. +

+ + +
+ + +
+ Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico ------------------------------------------------------------------------------ svn:executable = * Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream