lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Armins Stepanjans <armins.bagr...@gmail.com>
Subject Format of Wikipedia Index
Date Tue, 23 Jan 2018 03:27:13 GMT
Hi,

I have a question regarding the format of the Index created by DocMaker,
from EnWikiContentSource.

After creating the Index from dump of all Wikipedia's articles (
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-
pages-articles-multistream.xml.bz2), I'm having trouble understanding the
format of Documents created, because when I get a document from the Index,
its only field is docid.
Is this an indicator of incorrect indexation and if not, how should I use
the index, in order to search for occurrences of a term, within an article
(I was imagining of doing a boolean query, with on sub-query being the
article's name and the other the term I'm searching for within the article)?

Regards,
Armīns

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message