lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Delbru (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference
Date Fri, 04 Feb 2011 12:53:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990548#comment-12990548
] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

{quote}
So if we can pack long streams of 1s with
freqs and positions I think this is probably useful for a lot of people.
{quote}
Yes, if the overhead is minimal, it might not be an issue in certain cases.

{quote}
Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is 
{quote}

Yes, Ent is jsut a delta representation of the id of the entity (which can be considered as
the document id). It is just that I have changed the name of the concept, as SIREn is manipulating
principally entity and not document. In my case, an entity is just a set of attribute-value
pairs, similarly to a document in Lucene.

{quote}
Because I would think if you take e.g. Geonames, the place
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
country=Germany's postings. 
{quote}
I checked, and Geonames dataset was alphabetically sorted by url names:
http://sws.geonames.org/1/
http://sws.geonames.org/10/
...
as well as dbpedia and sindice.

So, yes, this might have (good) consequences on the docdelta list for certain datasets such
as geonames. And especially when indexing semi-structured data, as the schema of the data
in one dataset is generally identical across entities/documents. therefore it is likely to
see long runs of 1 for certain terms or schema terms.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0
branch.
> I am providing the source code of its implementation. Some work needs to be done, as
this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation,
as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits
word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message