lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jürgen Jakobitsch <juergen.jakobit...@semantic-web.com>
Subject Re: codec: accessing term dictionary
Date Fri, 10 Mar 2017 11:03:06 GMT
david, thanks for your input..

initially i was hoping to be able to use FST somehow in this process, but
my knowledge in this area is fairly manageable..
i will give it a second thought anyway... ;-)

krj

*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710 <+43%20676%206212710>
http://www.semantic-web.at
http://www.poolparty.biz



PERSONAL INFORMATION
| web       : http://www.turnguard.com
| foaf      : http://www.turnguard.com/turnguard
| g+        : https://plus.google.com/111233759991616358206/posts
| skype     : jakobitsch-punkt
| xmlns:tg  = "http://www.turnguard.com/turnguard#"
| blockchain : https://onename.com/turnguard

2017-03-10 11:49 GMT+01:00 Dawid Weiss <dawid.weiss@gmail.com>:

> Or you could encode those term/ ngram frequencies one FST and then
> reuse it. This would be memory-saving and fairly fast (~comparable to
> a hash table).
>
> Dawid
>
> On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
> > Yes, this is a reasonable way to use Lucene (to see terms statistics
> across
> > the corpus) but it may not be performant enough for your needs.
> >
> > E.g. wasting memory and making a giant hash table for one time or
> periodic
> > corpus analysis may be faster.
> >
> > If you are looking for word N gram stats, you could index your text with
> > ShingleFilter to make it faster to get ngram counts.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
> > juergen.jakobitsch@semantic-web.com> wrote:
> >
> >> hi,
> >>
> >> i'd like to ask users for their experiences with the fastest way to
> access
> >> the term dictionary.
> >>
> >> what i want to do is to implement some algorithms to find phrases (e.g.
> >> mutual rank ratio [1])
> >> (and other statistics on term distribution, generally: corpus related
> >> stuff)
> >>
> >> the idea would be to do statistics on numbers (i.e. long from the term
> >> dictionary) to minimize memory usage. i did try this with termsEnum +
> >> ordinal number of terms, which are easily retrievable, but getting a
> term
> >> by ord then throws UnsupportedOperationException [2]. i see there's
> also a
> >> codec blocktreeord [3].
> >>
> >> now before diving deeper into this (i.e. changing codecs for my
> indexes),
> >> i'd like to ask if a workflow like described above is considered at
> least
> >> semi smart or if i'm on the wrong track with this and there's a smarter
> way
> >> to be able to not having to calculate collocations based an actualy
> strings
> >> or byteRefs?
> >>
> >> any pointer really appreciated.
> >>
> >> kind regard jürgen
> >>
> >> [1] http://www.google.ch/patents/US20100250238
> >> [2]
> >> https://github.com/apache/lucene-solr/blob/releases/
> >> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/
> codecs/blocktree/
> >> SegmentTermsEnum.java
> >> [3]
> >> https://github.com/apache/lucene-solr/blob/master/
> >> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/
> >> OrdsSegmentTermsEnum.java
> >>
> >> *Jürgen Jakobitsch*
> >> Innovation Director
> >> Semantic Web Company GmbH
> >> EU: +43-1-4021235-0
> >> Mobile: +43-676-6212710 <+43%20676%206212710>
> >> http://www.semantic-web.at
> >> http://www.poolparty.biz
> >>
> >>
> >>
> >> PERSONAL INFORMATION
> >> | web       : http://www.turnguard.com
> >> | foaf      : http://www.turnguard.com/turnguard
> >> | g+        : https://plus.google.com/111233759991616358206/posts
> >> | skype     : jakobitsch-punkt
> >> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
> >> | blockchain : https://onename.com/turnguard
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message