lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jürgen Jakobitsch <>
Subject codec: accessing term dictionary
Date Thu, 09 Mar 2017 20:22:46 GMT

i'd like to ask users for their experiences with the fastest way to access
the term dictionary.

what i want to do is to implement some algorithms to find phrases (e.g.
mutual rank ratio [1])
(and other statistics on term distribution, generally: corpus related stuff)

the idea would be to do statistics on numbers (i.e. long from the term
dictionary) to minimize memory usage. i did try this with termsEnum +
ordinal number of terms, which are easily retrievable, but getting a term
by ord then throws UnsupportedOperationException [2]. i see there's also a
codec blocktreeord [3].

now before diving deeper into this (i.e. changing codecs for my indexes),
i'd like to ask if a workflow like described above is considered at least
semi smart or if i'm on the wrong track with this and there's a smarter way
to be able to not having to calculate collocations based an actualy strings
or byteRefs?

any pointer really appreciated.

kind regard jürgen


*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710 <+43%20676%206212710>

| web       :
| foaf      :
| g+        :
| skype     : jakobitsch-punkt
| xmlns:tg  = ""
| blockchain :

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message