lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jürgen Jakobitsch <juergen.jakobit...@semantic-web.com>
Subject codec: accessing term dictionary
Date Thu, 09 Mar 2017 20:22:46 GMT
hi,

i'd like to ask users for their experiences with the fastest way to access
the term dictionary.

what i want to do is to implement some algorithms to find phrases (e.g.
mutual rank ratio [1])
(and other statistics on term distribution, generally: corpus related stuff)

the idea would be to do statistics on numbers (i.e. long from the term
dictionary) to minimize memory usage. i did try this with termsEnum +
ordinal number of terms, which are easily retrievable, but getting a term
by ord then throws UnsupportedOperationException [2]. i see there's also a
codec blocktreeord [3].

now before diving deeper into this (i.e. changing codecs for my indexes),
i'd like to ask if a workflow like described above is considered at least
semi smart or if i'm on the wrong track with this and there's a smarter way
to be able to not having to calculate collocations based an actualy strings
or byteRefs?

any pointer really appreciated.

kind regard jürgen

[1] http://www.google.ch/patents/US20100250238
[2]
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/SegmentTermsEnum.java
[3]
https://github.com/apache/lucene-solr/blob/master/lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/OrdsSegmentTermsEnum.java

*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710 <+43%20676%206212710>
http://www.semantic-web.at
http://www.poolparty.biz



PERSONAL INFORMATION
| web       : http://www.turnguard.com
| foaf      : http://www.turnguard.com/turnguard
| g+        : https://plus.google.com/111233759991616358206/posts
| skype     : jakobitsch-punkt
| xmlns:tg  = "http://www.turnguard.com/turnguard#"
| blockchain : https://onename.com/turnguard

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message