+ Apache Lucene - Basic Demo Sources Walkthrough +

Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 14640 invoked by uid 99); 27 Nov 2006 00:01:41 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Nov 2006 16:01:41 -0800 X-ASF-Spam-Status: No, hits=-9.4 required=10.0 tests=ALL_TRUSTED,NO_REAL_NAME X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO eris.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Nov 2006 16:01:27 -0800 Received: by eris.apache.org (Postfix, from userid 65534) id 5FD531A984D; Sun, 26 Nov 2006 16:00:50 -0800 (PST) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r479465 [2/4] - in /lucene/java/trunk: docs/ docs/images/ docs/lucene-sandbox/ docs/styles/ src/site/ src/site/src/ src/site/src/documentation/ src/site/src/documentation/classes/ src/site/src/documentation/conf/ src/site/src/documentation/... Date: Mon, 27 Nov 2006 00:00:49 -0000 To: java-commits@lucene.apache.org From: gsingers@apache.org X-Mailer: svnmailer-1.1.0 Message-Id: <20061127000050.5FD531A984D@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,160 @@ + + +

+ + Apache Lucene - Basic Demo Sources Walkthrough + +

+ +Andrew C. Oliver + + + +

About the Code +

+In this section we walk through the sources behind the basic Lucene Web Application demo: where to +find them, their parts and their function. This section is intended for Java developers wishing to +understand how to use Lucene in their applications or for those involved in deploying web +applications based on Lucene. +

+ + +

Location of the source (developers/deployers) +

+Relative to the directory created when you extracted Lucene or retrieved it from Subversion, you +should see a directory called src which in turn contains a directory called +jsp. This is the root for all of the Lucene web demo. +

+Within this directory you should see index.jsp. Bring this up in vi or your editor of +choice. +

+ +

index.jsp (developers/deployers) +

+This jsp page is pretty boring by itself. All it does is include a header, display a form and +include a footer. If you look at the form, it has two fields: query (where you enter +your search criteria) and maxresults where you specify the number of results per page. +By the structure of this JSP it should be easy to customize it without even editing this particular +file. You could simply change the header and footer. Let's look at the header.jsp +(located in the same directory) next. +

+ + + +

results.jsp (developers) +

+Most of the functionality lies in results.jsp. Much of it is for paging the search +results, which we'll not cover here as it's commented well enough. The first thing in this page is +the actual imports for the Lucene classes and Lucene demo classes. These classes are loaded from +the jars included in the WEB-INF/lib directory in the luceneweb.war file. +

+You'll notice that this file includes the same header and footer as index.jsp. From +there it constructs an IndexSearcher with the +indexLocation that was specified in configuration.jsp. If there is an +error of any kind in opening the index, it is displayed to the user and the boolean flag +error is set to tell the rest of the sections of the jsp not to continue. +

+From there, this jsp attempts to get the search criteria, the start index (used for paging) and the +maximum number of results per page. If the maximum results per page is not set or not valid then it +and the start index are set to default values. If only the start index is invalid it is set to a +default value. If the criteria isn't provided then a servlet error is thrown (it is assumed that +this is the result of url tampering or some form of browser malfunction). +

+The jsp moves on to construct a StandardAnalyzer to +analyze the search text. This matches the analyzer used during indexing (IndexHTML), which is generally +recommended. This is passed to the QueryParser along with the +criteria to construct a Query +object. You'll also notice the string literal "contents" included. This specifies +that the search should cover the contents field and not the title, +url or some other field in the indexed documents. If there is any error in +constructing a Query object an +error is displayed to the user. +

+In the next section of the jsp the IndexSearcher is asked to search +given the query object. The results are returned in a collection called hits. If the +length property of the hits collection is 0 (meaning there were no results) then an +error is displayed to the user and the error flag is set. +

+Finally the jsp iterates through the hits collection, taking the current page into +account, and displays properties of the Document objects we talked about in +the first walkthrough. These objects contain "known" fields specific to their indexer (in this case +IndexHTML constructs a document +with "url", "title" and "contents"). +

+Please note that in a real deployment of Lucene, it's best to instantiate IndexSearcher and QueryParser once, and then +share them across search requests, instead of re-instantiating per search request. +

+ +

More sources (developers) +

+There are additional sources used by the web app that were not specifically covered by either +walkthrough. For example the HTML parser, the IndexHTML class and HTMLDocument class. These are very +similar to the classes covered in the first example, with properties specific to parsing and +indexing HTML. This is beyond our scope; however, by now you should feel like you're "getting +started" with Lucene. +

+ +

Where to go from here? (everyone!) +

+There are a number of things this demo doesn't do or doesn't do quite right. For instance, you may +have noticed that documents in the root context are unreachable (unless you reconfigure Tomcat to +support that context or redirect to it), anywhere where the directory doesn't quite match the +context mapping, you'll have a broken link in your results. If you want to index non-local files or +have some other needs this isn't supported, plus there may be security issues with running the +indexing application from your webapps directory. There are a number of things left for you the +developer to do. +

+In time some of these things may be added to Lucene as features (if you've got a good idea we'd love +to hear it!), but for now: this is where you begin and the search engine/indexer ends. Lastly, one +would assume you'd want to follow the above advice and customize the application to look a little +more fancy than black on white with "Lucene Template" at the top. We'll see you on the Lucene +Users' or Developers' mailing lists! +

+ +

When to contact the Author +

+Please resist the urge to contact the authors of this document (without bribes of fame and fortune +attached). First contact the mailing lists, taking care to Ask Questions The Smart Way. +Certainly you'll get the most help that way as well. That being said, feedback, and modifications +to this document and samples are ever so greatly appreciated. They are just best sent to the lists +or posted as patches, so that +everyone can share in them. Thanks for understanding! +

+ + + + Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,47 @@ + + +

+Apache Lucene - Features +

+ + +

Features +

Lucene offers powerful features through a simple API:

+ +

Scalable, High-Performance Indexing +

over 20MB/minute on Pentium M 1.5GHz
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed

+ +

Powerful, Accurate and Efficient Search Algorithms +

ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity + queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching

+ +

Cross-Platform Solution +

Available as Open Source software under the + Apache License + which lets you use Lucene in both commercial and Open Source programs
100%-pure Java
Implementations in other + programming languages available that are index-compatible

+ + + + Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,1377 @@ + + + +

+ +Apache Lucene - Index File Formats + +

+ + + + + + + + +

+ Index File Formats +

+ This document defines the index file formats used + in Lucene version 2.0. If you are using a different + version of Lucene, please consult the copy of + docs/fileformats.html that was distributed + with the version you are using. +

+ +

+ Apache Lucene is written in Java, but several + efforts are underway to write + versions + of Lucene in other programming + languages. If these versions are to remain compatible with Apache + Lucene, then a language-independent definition of the Lucene index + format is required. This document thus attempts to provide a + complete and independent definition of the Apache Lucene 1.4 file + formats. +

+ +

+ As Lucene evolves, this document should evolve. + Versions of Lucene in different programming languages should endeavor + to agree on file formats, and generate new versions of this document. +

+ +

+ Compatibility notes are provided in this document, + describing how file formats have changed from prior versions. +

+ +

+ Definitions +

+ The fundamental concepts in Lucene are index, + document, field and term. +

+ + +

+ An index contains a sequence of documents. +

+ +

+
+ A document is a sequence of fields. +
+
+
+ A field is a named sequence of terms. +
+
+ A term is a string. +

+ +

+ The same string in two different fields is + considered a different term. Thus terms are represented as a pair of + strings, the first naming the field, and the second naming text + within the field. +

+ +

+ Inverted Indexing +

+ The index stores statistics about terms in order + to make term-based search more efficient. Lucene's + index falls into the family of indexes known as an inverted + index. This is because it can list, for a term, the documents that contain + it. This is the inverse of the natural relationship, in which + documents list terms. +

+ Types of Fields +

+ In Lucene, fields may be stored, in which + case their text is stored in the index literally, in a non-inverted + manner. Fields that are inverted are called indexed. A field + may be both stored and indexed.

+ +

The text of a field may be tokenized into terms to be + indexed, or the text of a field may be used literally as a term to be indexed. + Most fields are + tokenized, but sometimes it is useful for certain identifier fields + to be indexed literally. +

See the Field java docs for more information on Fields.

+ +

+ Segments +

+ Lucene indexes may be composed of multiple sub-indexes, or + segments. Each segment is a fully independent index, which could be searched + separately. Indexes evolve by: +

+ +

Creating new segments for newly added documents.
+
Merging existing segments.
+

+ +

+ Searches may involve multiple segments and/or multiple indexes, each + index potentially composed of a set of segments. +

+ +

+ Document Numbers +

+ Internally, Lucene refers to documents by an integer document + number. The first document added to an index is numbered zero, and each + subsequent document added gets a number one greater than the previous. +

+ +

+
+

+ +

+ Note that a document's number may change, so caution should be taken + when storing these numbers outside of Lucene. In particular, numbers may + change in the following situations: +

+ + +

+
+ The + numbers stored in each segment are unique only within the segment, + and must be converted before they can be used in a larger context. + The standard technique is to allocate each segment a range of + values, based on the range of numbers used in that segment. To + convert a document number from a segment to an external value, the + segment's base document + number is added. To convert an external value back to a + segment-specific value, the segment is identified by the range that + the external value is in, and the segment's base value is + subtracted. For example two five document segments might be + combined, so that the first segment has a base value of zero, and + the second of five. Document three from the second segment would + have an external value of eight. +
+
+
+ When documents are deleted, gaps are created + in the numbering. These are eventually removed as the index evolves + through merging. Deleted documents are dropped when segments are + merged. A freshly-merged segment thus has no gaps in its numbering. +
+

+ +

+ Overview +

+ Each segment index maintains the following: +

Field names. This + contains the set of field names used in the index. + +
+
Stored Field + values. This contains, for each document, a list of attribute-value + pairs, where the attributes are field names. These are used to + store auxiliary information about the document, such as its title, + url, or an identifier to access a + database. The set of stored fields are what is returned for each hit + when searching. This is keyed by document number. +
+
Term dictionary. + A dictionary containing all of the terms used in all of the indexed + fields of all of the documents. The dictionary also contains the + number of documents which contain the term, and pointers to the + term's frequency and proximity data. +
+
Term Frequency + data. For each term in the dictionary, the numbers of all the + documents that contain that term, and the frequency of the term in + that document. +
+
Term Proximity + data. For each term in the dictionary, the positions that the term + occurs in each document. +
+
Normalization + factors. For each field in each document, a value is stored that is + multiplied into the score for hits on that field. +
+
Term Vectors. For each field in each document, the term vector + (sometimes called document vector) may be stored. A term vector consists + of term text and term frequency. To add Term Vectors to your index see the + Field constructors +
+
Deleted documents. + An optional file indicating which documents are deleted. +
+

+ +

Details on each of these are provided in subsequent sections. +

+ +

+ File Naming +

+ All files belonging to a segment have the same name with varying + extensions. The extensions correspond to the different file formats + described below. When using the Compound File format (default in 1.4 and greater) these files are + collapsed into a single .cfs file (see below for details) +

+ +

+ Typically, all segments + in an index are stored in a single directory, although this is not + required. +

+ +

+ Primitive Types +

+ Byte +

+ The most primitive type + is an eight-bit byte. Files are accessed as sequences of bytes. All + other data types are defined as sequences + of bytes, so file formats are byte-order independent. +

+ +

+ UInt32 +

+ 32-bit unsigned integers are written as four + bytes, high-order bytes first. +

+ UInt32 --> <Byte>⁴ +

+ +

+ Uint64 +

+ 64-bit unsigned integers are written as eight + bytes, high-order bytes first. +

+ +

UInt64 --> <Byte>⁸ +

+ +

+ VInt +

+ A variable-length format for positive integers is + defined where the high-order bit of each byte indicates whether more + bytes remain to be read. The low-order seven bits are appended as + increasingly more significant bits in the resulting integer value. + Thus values from zero to 127 may be stored in a single byte, values + from 128 to 16,383 may be stored in two bytes, and so on. +

+ +

VInt Encoding Example

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ Value + +	+ First byte + +	+ Second byte + +	+ Third byte + +
+ 0 + +	+ + 00000000 + +	+ + + +	+ + + +
+ 1 + +	+ + 00000001 + +	+ + + +	+ + + +
+ 2 + +	+ + 00000010 + +	+ + + +	+ + + +
+ ... + +	+ + + +	+ + + +	+ + + +
+ 127 + +	+ + 01111111 + +	+ + + +	+ + + +
+ 128 + +	+ + 10000000 + +	+ + 00000001 + +	+ + + +
+ 129 + +	+ + 10000001 + +	+ + 00000001 + +	+ + + +
+ 130 + +	+ + 10000010 + +	+ + 00000001 + +	+ + + +
+ ... + +	+ + + +	+ + + +	+ + + +
+ 16,383 + +	+ + 11111111 + +	+ + 01111111 + +	+ + + +
+ 16,384 + +	+ + 10000000 + +	+ + 10000000 + +	+ + 00000001 + +
+ 16,385 + +	+ + 10000001 + +	+ + 10000000 + +	+ + 00000001 + +
+ ... + +	+ + + + +	+ + + + +	+ + + + +

+ +

+ This provides compression while still being + efficient to decode. +

+ +

+ Chars +

+ Lucene writes unicode + character sequences using Java's + "modified + UTF-8 encoding". +

+ + +

+ +

+ String +

+ Lucene writes strings as a VInt representing the length, followed by + the character data. +

+ +

+ String --> VInt, Chars +

+ +

+ Per-Index Files +

+ The files in this section exist one-per-index. +

+ +

+ Segments File +

+ The active segments in the index are stored in the + segment info file. An index only has + a single file in this format, and it is named "segments". + This lists each segment by name, and also contains the size of each + segment. +

+ +

+ Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>^SegCount +

+ +

+ Format, NameCounter, SegCount, SegSize --> UInt32 +

+ +

+ Version --> UInt64 +

+ +

+ SegName --> String +

+ +

+ Format is -1 in Lucene 1.4. +

+ +

+ Version counts how often the index has been + changed by adding or deleting documents. +

+ +

+ NameCounter is used to generate names for new segment files. +

+ +

+ SegName is the name of the segment, and is used as the file name prefix + for all of the files that compose the segment's index. +

+ +

+ SegSize is the number of documents contained in the segment index. +

+ + +

+ +

+ Lock Files +

+ Several files are used to indicate that another + process is using an index. Note that these files are not + stored in the index directory itself, but rather in the + system's temporary directory, as indicated in the Java + system property "java.io.tmpdir". +

+ +

+
+ When a file named "commit.lock" + is present, a process is currently re-writing the "segments" + file and deleting outdated segment index files, or a process is + reading the "segments" + file and opening the files of the segments it names. This lock file + prevents files from being deleted by another process after a process + has read the "segments" + file but before it has managed to open all of the files of the + segments named therein. +
+
+
+ When a file named "write.lock" + is present, a process is currently adding documents to an index, or + removing files from that index. This lock file prevents several + processes from attempting to modify an index at the same time. +
+

+ +

+ Deletable File +

+ A file named "deletable" + contains the names of files that are no longer used by the index, but + which could not be deleted. This is only used on Win32, where a + file may not be deleted while it is still open. On other platforms + the file contains only null bytes. +

+ +

+ Deletable --> DeletableCount, + <DelableName>^{DeletableCount} +

+ +

DeletableCount --> UInt32 +

DeletableName --> + String +

+ +

+ Compound Files +

Starting with Lucene 1.4 the compound file format became default. This + is simply a container for all files described in the next section.

+ +

Compound (.cfs) --> FileCount, <DataOffset, FileName>^FileCount, + FileData^FileCount

+ +

FileCount --> VInt

+ +

DataOffset --> Long

+ +

FileName --> String

+ +

FileData --> raw file data

The raw file data is the data from the individual files named above.

+ +

+ Per-Segment Files +

+ The remaining files are all per-segment, and are + thus defined by suffix. +

+ Fields +

Field Info

+ +

+ Field names are + stored in the field info file, with suffix .fnm. +

+ FieldInfos + (.fnm) --> FieldsCount, <FieldName, + FieldBits>^FieldsCount +

+ +

+ FieldsCount --> VInt +

+ +

+ FieldName --> String +

+ +

+ FieldBits --> Byte +

+ +

+ The low-order bit is one for + indexed fields, and zero for non-indexed fields. +
+ The second lowest-order + bit is one for fields that have term vectors stored, and zero for fields + without term vectors. +

Lucene >= 1.9:

If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.
If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.
If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.

+ +

+ Fields are numbered by their order in this file. Thus field zero is + the + first field in the file, field one the next, and so on. Note that, + like document numbers, field numbers are segment relative. +

+ +

Stored Fields

+ +

+ Stored fields are represented by two files: +

+ +

+
+ The field index, or .fdx file. +
+ +
+ This contains, for each document, a pointer to + its field data, as follows: +
+ +
+ FieldIndex + (.fdx) --> + <FieldValuesPosition>^SegSize +
+
FieldValuesPosition + --> Uint64 +
+
This + is used to find the location within the field data file of the + fields of a particular document. Because it contains fixed-length + data, this file may be easily randomly accessed. The position of + document n's field data is the Uint64 at n*8 in + this file. +
+
+
+ The field data, or .fdt file. + +
+ +
+ This contains the stored fields of each document, + as follows: +
+ +
+ FieldData (.fdt) --> + <DocFieldData>^SegSize +
+
DocFieldData --> + FieldCount, <FieldNum, Bits, Value>^FieldCount +
+
FieldCount --> + VInt +
+
FieldNum --> + VInt +
+ +
Lucene <= 1.4:
+
Bits --> + Byte +
+
Value --> + String +
+
Only the low-order bit of Bits is used. It is one for + tokenized fields, and zero for non-tokenized fields. +
+
Lucene >= 1.9:
+
Bits --> + Byte +
+
+
- low order bit is one for tokenized fields
- second bit is one for fields containing binary data
- third bit is one for fields with compression option enabled + (if compression is enabled, the algorithm used is ZLIB)
+
+
Value --> + String | BinaryValue (depending on Bits) +
+
BinaryValue --> + ValueSize, <Byte>^ValueSize +
+
ValueSize --> + VInt +
+ +

+ +

+ Term Dictionary +

+ The term dictionary is represented as two files: +

+
+ The term infos, or tis file. +
+ +
+ TermInfoFile (.tis)--> + TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos +
+
TIVersion --> + UInt32 +
+
TermCount --> + UInt64 +
+
IndexInterval --> + UInt32 +
+
SkipInterval --> + UInt32 +
+
TermInfos --> + <TermInfo>^TermCount +
+
TermInfo --> + <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> +
+
Term --> + <PrefixLength, Suffix, FieldNum> +
+
Suffix --> + String +
+
PrefixLength, + DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt +
+
This + file is sorted by Term. Terms are ordered first lexicographically + by the term's field name, and within that lexicographically by the + term's text. +
+
TIVersion names the version of the format + of this file and is -2 in Lucene 1.4. +
+
Term + text prefixes are shared. The PrefixLength is the number of initial + characters from the previous term which must be pre-pended to a + term's suffix in order to form the term's text. Thus, if the + previous term's text was "bone" and the term is "boy", + the PrefixLength is two and the suffix is "y". +
+
FieldNumber + determines the term's field, whose name is stored in the .fdt file. +
+
DocFreq + is the count of documents which contain the term. +
+
FreqDelta + determines the position of this term's TermFreqs within the .frq + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file). +
+
ProxDelta + determines the position of this term's TermPositions within the .prx + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file. +
+
SkipDelta determines the position of this + term's SkipData within the .frq file. In + particular, it is the number of bytes + after TermFreqs that the SkipData starts. + In other words, it is the length of the + TermFreq data. +
+
+
+ The term info index, or .tii file. +
+ +
+ This contains every IndexInterval^th entry from the .tis + file, along with its location in the "tis" file. This is + designed to be read entirely into memory and used to provide random + access to the "tis" file. +
+ +
+ The structure of this file is very similar to the + .tis file, with the addition of one item per record, the IndexDelta. +
+ +
+ TermInfoIndex (.tii)--> + TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices +
+
TIVersion --> + UInt32 +
+
IndexTermCount --> + UInt64 +
+
IndexInterval --> + UInt32 +
+
SkipInterval --> + UInt32 +
+
TermIndices --> + <TermInfo, IndexDelta>^{IndexTermCount} +
+
IndexDelta --> + VLong +
+
IndexDelta + determines the position of this term's TermInfo within the .tis file. In + particular, it is the difference between the position of this term's + entry in that file and the position of the previous term's entry. +
+
SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). + Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while + smaller values result in bigger indexes, less acceleration and more + accelerable cases.
+

+ +

+ Frequencies +

+ The .frq file contains the lists of documents + which contain each term, along with the frequency of the term in that + document. +

FreqFile (.frq) --> + <TermFreqs, SkipData>^TermCount +

TermFreqs --> + <TermFreq>^DocFreq +

TermFreq --> + DocDelta, Freq? +

SkipData --> + <SkipDatum>^{DocFreq/SkipInterval} +

SkipDatum --> + DocSkip,FreqSkip,ProxSkip +

DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --> + VInt +

TermFreqs + are ordered by term (the term is implicit, from the .tis file). +

TermFreq + entries are ordered by increasing document number. +

DocDelta + determines both the document number and the frequency. In + particular, DocDelta/2 is the difference between this document number + and the previous document number (or zero when this is the first + document in a TermFreqs). When DocDelta is odd, the frequency is + one. When DocDelta is even, the frequency is read as another VInt. +

For + example, the TermFreqs for a term which occurs once in document seven + and three times in document eleven would be the following sequence of + VInts: +

15, + 8, 3 +

DocSkip records the document number before every + SkipInterval^th document in TermFreqs. + Document numbers are represented as differences + from the previous value in the sequence. FreqSkip + and ProxSkip record the position of every + SkipInterval^th entry in FreqFile and + ProxFile, respectively. File positions are + relative to the start of TermFreqs and Positions, + to the previous SkipDatum in the sequence. +

For example, if DocFreq=35 and SkipInterval=16, + then there are two SkipData entries, containing + the 15^th and 31^st document + numbers in TermFreqs. The first FreqSkip names + the number of bytes after the beginning of + TermFreqs that the 16^th SkipDatum + starts, and the second the number of bytes after + that that the 32^nd starts. The first + ProxSkip names the number of bytes after the + beginning of Positions that the 16^th + SkipDatum starts, and the second the number of + bytes after that that the 32^nd starts. +

+ +

+ Positions +

+ The .prx file contains the lists of positions that + each term occurs at within documents. +

ProxFile (.prx) --> + <TermPositions>^TermCount +

TermPositions --> + <Positions>^DocFreq +

Positions --> + <PositionDelta>^Freq +

PositionDelta --> + VInt +

TermPositions + are ordered by term (the term is implicit, from the .tis file). +

Positions + entries are ordered by increasing document number (the document + number is implicit from the .frq file). +

PositionDelta + is the difference between the position of the current occurrence in + the document and the previous occurrence (or zero, if this is the + first occurrence in this document). +

+ For example, the TermPositions for a + term which occurs as the fourth term in one document, and as the + fifth and ninth term in a subsequent document, would be the following + sequence of VInts: +

4, + 5, 4 +

+ Normalization Factors +

There's a norm file for each indexed field with a byte for + each document. The .f[0-9]* file contains, + for each document, a byte that encodes a value that is multiplied + into the score for hits on that field: +

Norms + (.f[0-9]*) --> <Byte>^SegSize +

Each + byte encodes a floating point value. Bits 0-2 contain the 3-bit + mantissa, and bits 3-8 contain the 5-bit exponent. +

These + are converted to an IEEE single float value as follows: +

If + the byte is zero, use a zero float. +
+
Otherwise, + set the sign bit of the float to zero; +
+
add + 48 to the exponent and use this as the float's exponent; +
+
map + the mantissa to the high-order 3 bits of the float's mantissa; and + +
+
set + the low-order 21 bits of the float's mantissa to zero. +
+

+ +

+ Term Vectors + Term Vector support is an optional on a field by field basis. It consists of 4 + files. +

+
The Document Index or .tvx file.
+
This contains, for each document, a pointer to the document data in the Document + (.tvd) file. +
+
DocumentIndex (.tvx) --> TVXVersion<DocumentPosition>^NumDocs
+
TVXVersion --> Int
+
DocumentPosition --> UInt64
+
This is used to find the position of the Document in the .tvd file.
+
+
The Document or .tvd file.
+
This contains, for each document, the number of fields, a list of the fields with + term vector info and finally a list of pointers to the field information in the .tvf + (Term Vector Fields) file.
+
+ Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,>^NumDocs +
+
TVDVersion --> Int
+
NumFields --> VInt
+
FieldNums --> <FieldNumDelta>^NumFields
+
FieldNumDelta --> VInt
+
FieldPositions --> <FieldPosition>^NumFields
+
FieldPosition --> VLong
+
The .tvd file is used to map out the fields that have term vectors stored and + where the field information is in the .tvf file.
+
+
The Field or .tvf file.
+
This file contains, for each field that has a term vector stored, a list of + the terms and their frequencies.
+
Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs>^NumFields
+
TVFVersion --> Int
+
NumTerms --> VInt
+
NumDistinct --> VInt -- Future Use
+
TermFreqs --> <TermText, TermFreq>^NumTerms
+
TermText --> <PrefixLength, Suffix>
+
PrefixLength --> VInt
+
Suffix --> String
+
TermFreq --> VInt
+
Term + text prefixes are shared. The PrefixLength is the number of initial + characters from the previous term which must be pre-pended to a + term's suffix in order to form the term's text. Thus, if the + previous term's text was "bone" and the term is "boy", + the PrefixLength is two and the suffix is "y". +
+

+ +

+ Deleted Documents + +

The .del file is + optional, and only exists when a segment contains deletions: +

+ +

Deletions + (.del) --> ByteCount,BitCount,Bits +

+ +

ByteSize,BitCount --> + Uint32 +

+ +

Bits --> + <Byte>^ByteCount +

+ +

ByteCount + indicates the number of bytes in Bits. It is typically + (SegSize/8)+1. +

+ +

+ BitCount + indicates the number of bits that are currently set in Bits. +

+ +

Bits + contains one bit for each document indexed. When the bit + corresponding to a document number is set, that document is marked as + deleted. Bit ordering is from least to most significant. Thus, if + Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as + deleted. +

+ +

+ Limitations +

There + are a few places where these file formats limit the maximum number of + terms and documents to a 32-bit quantity, or to approximately 4 + billion. This is not today a problem, but, in the long term, + probably will be. These should therefore be replaced with either + UInt64 values, or better yet, with VInt values which have no limit. +

+ +

+ + + + Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml?view=auto&rev=479465 ============================================================================== --- lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml (added) +++ lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml Sun Nov 26 16:00:46 2006 @@ -0,0 +1,55 @@ + + +

+ + Apache Lucene - Getting Started Guide + +

+ +Andrew C. Oliver + + + +

+ Getting Started +

+This document is intended as a "getting started" guide. It has three audiences: first-time users +looking to install Apache Lucene in their application or web server; developers looking to modify or base +the applications they develop on Lucene; and developers looking to become involved in and contribute +to the development of Lucene. This document is written in tutorial and walk-through format. The +goal is to help you "get started". It does not go into great depth on some of the conceptual or +inner details of Lucene. +

+ +

+Each section listed below builds on one another. More advanced users +may wish to skip sections. +

+ +

About the command-line Lucene demo and its usage. This section + is intended for anyone who wants to use the command-line Lucene demo.

+ +

About the sources and implementation for the command-line Lucene + demo. This section walks through the implementation details (sources) of the + command-line Lucene demo. This section is intended for developers.

+ +

About installing and configuring the demo template web + application. While this walk-through assumes Tomcat as your container of choice, + there is no reason you can't (provided you have the requisite knowledge) adapt the + instructions to your container. This section is intended for those responsible for the + development or deployment of Lucene-based web applications.

+ +

About the sources used to construct the demo template web + application. Please note the template application is designed to highlight features of + Lucene and is not an example of best practices. (One would hopefully use MVC + architecture such as provided by Jakarta Struts and taglibs, but showing you how to do that + would be WAY beyond the scope of this guide.) This section is intended for developers and + those wishing to customize the demo template web application to their needs.

+ + + + Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico ------------------------------------------------------------------------------ svn:executable = * Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif?view=auto&rev=479465 ============================================================================== Binary file - no diff available. Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream