lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: codec: accessing term dictionary
Date Fri, 10 Mar 2017 10:41:08 GMT
Yes, this is a reasonable way to use Lucene (to see terms statistics across
the corpus) but it may not be performant enough for your needs.

E.g. wasting memory and making a giant hash table for one time or periodic
corpus analysis may be faster.

If you are looking for word N gram stats, you could index your text with
ShingleFilter to make it faster to get ngram counts.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
juergen.jakobitsch@semantic-web.com> wrote:

> hi,
>
> i'd like to ask users for their experiences with the fastest way to access
> the term dictionary.
>
> what i want to do is to implement some algorithms to find phrases (e.g.
> mutual rank ratio [1])
> (and other statistics on term distribution, generally: corpus related
> stuff)
>
> the idea would be to do statistics on numbers (i.e. long from the term
> dictionary) to minimize memory usage. i did try this with termsEnum +
> ordinal number of terms, which are easily retrievable, but getting a term
> by ord then throws UnsupportedOperationException [2]. i see there's also a
> codec blocktreeord [3].
>
> now before diving deeper into this (i.e. changing codecs for my indexes),
> i'd like to ask if a workflow like described above is considered at least
> semi smart or if i'm on the wrong track with this and there's a smarter way
> to be able to not having to calculate collocations based an actualy strings
> or byteRefs?
>
> any pointer really appreciated.
>
> kind regard jürgen
>
> [1] http://www.google.ch/patents/US20100250238
> [2]
> https://github.com/apache/lucene-solr/blob/releases/
> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/
> SegmentTermsEnum.java
> [3]
> https://github.com/apache/lucene-solr/blob/master/
> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/
> OrdsSegmentTermsEnum.java
>
> *Jürgen Jakobitsch*
> Innovation Director
> Semantic Web Company GmbH
> EU: +43-1-4021235-0
> Mobile: +43-676-6212710 <+43%20676%206212710>
> http://www.semantic-web.at
> http://www.poolparty.biz
>
>
>
> PERSONAL INFORMATION
> | web       : http://www.turnguard.com
> | foaf      : http://www.turnguard.com/turnguard
> | g+        : https://plus.google.com/111233759991616358206/posts
> | skype     : jakobitsch-punkt
> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
> | blockchain : https://onename.com/turnguard
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message