lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juarez Sampaio <jua...@simbioseventures.com>
Subject Re: Automata and Transducer on Lucene 6
Date Wed, 19 Apr 2017 14:29:16 GMT
Thanks for the reply everyone!

I've spent some time looking at tests and source code and I've learned a
lot about Lucene's Automata and FST. Way more productive than scanning
javadocs. Thanks for the hint.

*> Are you looking for a historical book on Lucene development or are you*
*> looking to solve a particular problem?*

Dawid, the thing is that I am not even sure that Automata are the perfect
fit for my project and I thought some literature on it would help me decide
whether to use it or not. Furthermore, I'll probably have to modify
Lucene's Automata implementation for my project, and I thought that reading
about design choices would allow me to better understand how to improve it.

Anyway, I am done with basic usage and whenever I find time to write about
what I've learned I'll make sure to share the link here.

I'll list some features I'll need to add to Lucene 6 base implementation of
Automata and FST. I'd like to hear your thoughts on it, specially if you
find any of them particularly a waste of time.

1. I need to be able to add data to an FST (add new strings and update the
mapped value in the case of FST). I thought about a multi-layer strategy
where old data has been compressed to an FST format whereas new data is
added to a delta partition (probably a BST or a simple list). A background
process merges delta into the closed FST. The merge process consists of
materializing all strings encoded into the FST, merge this list with
strings on delta, and then construct a new FST. Probably the merge can be
done during the process of enumerating, since the enumeration happens in
lexicographical order.

2. I need to have multiple FST loaded and to be able to search them on
demand. I thought about modifying the implementation to access data on a
memory mapped file instead of a raw in memory byte[] or int[]' s.

Juarez

On Wed, Apr 19, 2017 at 3:39 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> > One small correction: we moved away from objects to more compact int[] a
> > while ago for our automata implementation.
>
> Right, forgot about that. There are still some trappy object-heavy
> utilities like this one:
>
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L127-L129
>
> and the API is using objects (Transition) for an 'inout' mutable
> holder type which may be confusing at first (but is unavoidable in
> Java).
>
> Dawid
>
-- 
Juarez

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message