lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Tomás Atria <jtat...@gmail.com>
Subject Re: Persistence/Serialization of Automaton
Date Thu, 24 Mar 2016 17:08:28 GMT
​Thanks for the link to the archives Jim, but i get the feeling that the
approached discussed in that thread may be overkill for what I want to do,
I'm not even using Query objects, just filtering Terms instances with
automata.

short version: I basically wanted to know if there is code somewhere in
lucene that allows for Automaton instances to be persisted, and if this is
not currently available, if it is more or less simple to implement or if
there are some gotchas I should be aware of. I am not a FSA expert nor
familiar with Lucene's implementation, but from my basic understanding,
Automata are basically int arrays, so it should be trivial to implement
persistence support.

I was looking at the code for Automaton.copy( Automaton other ), but it
seems that this method relies on some state variables in the copied
automaton, and that's when I decided to come here and ask before shoot
myself in the foot.

The long version:

What I am trying to do may be sort of a stretch with Lucene, but it's
actually pretty efficient and straightforward: My application needs to
define sets of terms that will be included/excluded and/or grouped to build
a lexicon, a stable list of words represented as sets of terms. Right now,
I'm doing this more or less like:

Automaton included = getIncludeRule();
Automaton excluded = getExcludeRule();
Automaton filter = Operations.intersection( included,
Operations.complement( excluded, Operations.DEFAULT_MAX_DETERMINIZED_STATES
) );
Terms terms = leafReader.terms( "field" );
TermsEnum tEnum = terms.intersect( new CompiledAutomaton( filter ) );
while( tEnum.next != null ) {
    // add term to lexicon.
}

Grouping terms works in a similar way:

Terms terms = leafReader.terms( "field" );
Map<Automaton,String> termSets = getTermSets();
for( Entry<Automaton,String> tSet : termSets.entrySet() ) {
     CompiledAutomaton cau = new CompiledAutomaton( tSet.getKey() );
     TermsEnum tEnum = terms.intersect( cau );
     MyTermGroup grp = new MyTermGroup( tSet.getKey, tSet.getValue() );
     while( tEnum.next != null ) {
           grp.addAnddoStuff( tEnum );
     }
}

This works extremely well. It's blazing fast, and it allows me to have a
very clean and efficient API for building sets of (groups of) terms that I
can then use as the basis for a stable corpus lexicon for downstream
analysis. Also,  this allows me to avoid having to run searches and rely
only on IndexReader instances (maybe this is stupid? I don't have much
experience with Lucene's search aspects).

However, lacking a way to persist the automata that define a lexicon, I
have to build the set from scratch everytime, and I'm not totally
comfortable doing that. Hence the above question, since I would assume that
Automata are more or less straightforward to serialize, given that they are
nothing but int arrays, right? But I am not such a FSA expert, and I'm not
very familiar with the implementation details of Lucene's automata.

Any tips would be greatly appreciated :)







On Thu, Mar 24, 2016 at 12:08 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> I'm really out of my league here, but some of the suggester stuff
> builds an image on disk and some of the implementations use FSTs,
> which are at least in the ballpark.
>
> What I'm saying here is that the code may already be in place, or at
> least a place to start.
>
> And I have to ask, "why do you want to do this in the first place?".
> What is the problem you're trying to solve anyway?
>
> Best,
> Erick
>
> On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
> <james.mckinley@cengage.com> wrote:
> > Here's an archive link from this mailing list regarding serializing
> queries, I guess this would work for Automaton objects as well.
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
> >
> > Hope it helps.
> >
> > Jim
> > ________________________________________
> > From: José Tomás Atria <jtatria@gmail.com>
> > Sent: 23 March 2016 19:09
> > To: java-user@lucene.apache.org
> > Subject: Persistence/Serialization of Automaton
> >
> > Hello!
> >
> > Is it possible to serialize Lucene's Automata? I see that the javadoc for
> > the original BRICS package indicates that instances of Automaton
> implement
> > Serialzable, but this is not the case with the Automaton class in Lucene
> 5+.
> >
> > I assume it is possible, considering that a FSA is basically just a set
> of
> > states and transitions, but how would I go about (1) extracting that data
> > from an instance of automaton and (2) recreating the original automaton
> > given a set of transitions and states as it would be possible to obtain
> > them from a live instance?
> >
> > Alternatively, maybe there is some other place where this is implemented?
> > How can I persist lucene's automata?
> >
> > thanks,
> > jta
> >
> > --
> > entia non sunt multiplicanda praeter necessitatem
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
entia non sunt multiplicanda praeter necessitatem

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message