lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Boldi <bo...@dsi.unimi.it>
Subject mg4j - Managing Gigabyte for Java
Date Fri, 17 Sep 2004 11:12:20 GMT

Hi everybody!

When we started the MG4J Project, we did not want to make anything like Lucene, but rather,

as the project name suggests, to produce a Java version of the MG (Managing Gigabytes) project
of Moffat et al.

In the first stage, we decided to focus on indexing rather than document compression, because
that was
our primary need (we wanted to use MG4J in the context of our other projects about Web crawling,
querying and compression, http://webgraph.dsi.unimi.it/ and http://ubi.imc.pi.cnr.it/projects/ubicrawler/).

The idea was to have a Java library to create and access inverted indices. To do that, we
also needed efficient bit-level manipulation classes, and raw variable-length encoding
of integers.
This was not aimed to end users, but to developers, even though we had (and still have) in
mind
that eventually some tools to make it easy-to-use should be provided anyway.

Otis is only partially right: the old version of MG4J did not contain any way to search the
index,
but the current release (http://mg4j.dsi.unimi.it/) has some basic search capabilities, and
you can query the index
with general boolean expressions (OR, AND, NOT, full-phrase etc.). Also, we made the overall
structure a lot more flexible and easy-to-use. Most of the new features are still experimental,
and only partially documented: we plan to give a full account of the new stuff in the next
few weeks/months, both by providing more documentation and by writing some research paper
about some features that are new in the field.

Still, MG4J has different aims than Lucene, and so the two projects are quite incomparable:

- MG4J assumes that you provide documents in the very rough form of word sequences: you should
do the tokenization/parsing by yourself
- MG4J has no concept of "field", but the new version introduces a (much more rudimentary)
notion of
different indices built over the same document collection (like, for example, a mailbox indexed
by subject,
author, content etc.)
- on the other hand, MG4J puts much emphasis on the usage of state-of-the-art compression
and
querying techniques (the new version contains experimental classes to produce indices with
multilevel skip lists,
lazy search and semantically-sound multi-index query), so you can expect to have usually smaller
indices
and faster searches.

Bye

				Paolo Boldi

> Hi Anson,
> 
> It's not quite correct to comparing MG4J and Lucene directly.  Lucene
> is a toolkit whose primary goal is to let you create an index and
> search it, while MG4J is really a library of Java classes that people
> implementing an IR library (such as Lucene, for example) may find
> useful.  You cannot create a searchable index with MG4J alone.
> 
> Otis
> 
> 
> --- Anson Lau <alau@fulfil-net.com> wrote:
> 
> > Hi All,
> > 
> > Has anyone seen the project MG4J (Managing Gigabyte for Java)
> > http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
> > and MG4J to comment on how the two compares?
> > 
> > Thanks,
> > 
> > Anson
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > 
> > 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message