lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: codec: accessing term dictionary
Date Fri, 10 Mar 2017 10:49:03 GMT
Or you could encode those term/ ngram frequencies one FST and then
reuse it. This would be memory-saving and fairly fast (~comparable to
a hash table).

Dawid

On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> Yes, this is a reasonable way to use Lucene (to see terms statistics across
> the corpus) but it may not be performant enough for your needs.
>
> E.g. wasting memory and making a giant hash table for one time or periodic
> corpus analysis may be faster.
>
> If you are looking for word N gram stats, you could index your text with
> ShingleFilter to make it faster to get ngram counts.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
> juergen.jakobitsch@semantic-web.com> wrote:
>
>> hi,
>>
>> i'd like to ask users for their experiences with the fastest way to access
>> the term dictionary.
>>
>> what i want to do is to implement some algorithms to find phrases (e.g.
>> mutual rank ratio [1])
>> (and other statistics on term distribution, generally: corpus related
>> stuff)
>>
>> the idea would be to do statistics on numbers (i.e. long from the term
>> dictionary) to minimize memory usage. i did try this with termsEnum +
>> ordinal number of terms, which are easily retrievable, but getting a term
>> by ord then throws UnsupportedOperationException [2]. i see there's also a
>> codec blocktreeord [3].
>>
>> now before diving deeper into this (i.e. changing codecs for my indexes),
>> i'd like to ask if a workflow like described above is considered at least
>> semi smart or if i'm on the wrong track with this and there's a smarter way
>> to be able to not having to calculate collocations based an actualy strings
>> or byteRefs?
>>
>> any pointer really appreciated.
>>
>> kind regard jürgen
>>
>> [1] http://www.google.ch/patents/US20100250238
>> [2]
>> https://github.com/apache/lucene-solr/blob/releases/
>> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/
>> SegmentTermsEnum.java
>> [3]
>> https://github.com/apache/lucene-solr/blob/master/
>> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/
>> OrdsSegmentTermsEnum.java
>>
>> *Jürgen Jakobitsch*
>> Innovation Director
>> Semantic Web Company GmbH
>> EU: +43-1-4021235-0
>> Mobile: +43-676-6212710 <+43%20676%206212710>
>> http://www.semantic-web.at
>> http://www.poolparty.biz
>>
>>
>>
>> PERSONAL INFORMATION
>> | web       : http://www.turnguard.com
>> | foaf      : http://www.turnguard.com/turnguard
>> | g+        : https://plus.google.com/111233759991616358206/posts
>> | skype     : jakobitsch-punkt
>> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
>> | blockchain : https://onename.com/turnguard
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message