lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian chen" <chenjian1...@gmail.com>
Subject Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text
Date Wed, 21 Mar 2007 08:32:52 GMT
Hi, Mark,

Thanks a lot for your explanation. This code is very useful so it could even
be in a separate library  for text extraction.

Again, thanks for taking time to answer my question.

Jian

On 3/21/07, markharw00d <markharw00d@yahoo.co.uk> wrote:
>
> The Analyzer keeps a window of (by default) the last 300 documents.
> Every token created in these cached documents is stored for reference
> and as new documents arrive their token sequences are examined to see if
> any of the sequences was seen before, in which case the analyzer does
> not emit them as tokens. A sequence is of a definable length but I have
> found something like 10 to be a good value (passed to the constructor).
>
> If I was indexing this newslist for example all of your content copied
> below would be removed automatically (because it occurred more than once
> within a 300 documents window).
>
>
> >>My question is, would the Analyzer be able to remove the copy right
> notice in step 3)?
> In your example, "yes" - because it re-occurred within 300 documents.
>
> >>Could you provide some more insights how your algorithm works
>
> There are a number of optimizations to make it run fast that make the
> code trickier to read.
> The basis of it is:
> 1) a "tokens" map is contained with a key for every unique term
> 2) The value for each map entry is a list of ChainedTokens each of which
> represent an occurrence of the term in a doc
> 3) ChainedTokens contain the current term plus a reference to the
> previous term in that document.
> 4) The analyzer periodically (i.e not for every token) checks the tokens
> map for the current term and looks at all previous occurrences of this
> term, following the sequences of ChainedTokens looking for a common
> pattern.
> 5) As soon as a pattern looks like it is established and the analyzer is
> "onto something" it switches to a mode of concentrating solely on
> comparing the current sequence with a single likely previous sequence
> rather than testing ALL previous sequences as in step 4). If the
> repeated chains of tokens is over the desired sequence length these
> tokens are not emitted as part of the output TokenStream.
> * Periodically the tokens map and ChainedToken occurrences are pruned to
> avoid bloating memory. As part of this exercise "Stop words" are also
> automatically identified and recorded to avoid the cost of chasing all
> occurrences (step 4) or recording occurrences for very common words.
>
> Glad you find it useful.
>
> Cheers,
> Mark
>
>
> jian chen wrote:
> > Also, how about this scenario.
> >
> > 1) The Analyzer does 100 documents, each with copy right notice inside.
> I
> > guess in this case, the copy right notices will be removed when
> indexing.
> >
> > 2) The Analyzer does another 50 documents, each without any copy right
> > notice inside.
> >
> > 3) Then, the Analyzer runs into a document that has copy right notice
> > inside
> > again.
> >
> > My question is, would the Analyzer be able to remove the copy right
> > notice
> > in step 3)?
> >
> > Cheers,
> >
> > Jian
> >
> > On 3/20/07, jian chen <chenjian1227@gmail.com> wrote:
> >>
> >> Hi, Mark,
> >>
> >> Your program is very helpful. I am trying to understand your code but
> it
> >> seems would take longer to do that than simply asking you some
> >> questions.
> >>
> >> 1) What is the sliding window used for? It is that the Analyzer
> >> remembers
> >> the previously seen N tokens, and N is the window size?
> >>
> >> 2) As the Analyzer does text parsing, is it that the patterns happened
> >> before (in the previous N token window) is used and any such pattern
> >> in the
> >> latest N token window is recognized?
> >>
> >> Could you provide some more insights how your algorithm works by
> >> removing
> >> duplicate snippets of text from many documents?
> >>
> >> Thanks and really appreciate your help.
> >>
> >> Jian
> >>
> >>
> >> On 3/20/07, Mark Harwood (JIRA) <jira@apache.org > wrote:
> >> >
> >> >
> >> >      [
> >> >
> >>
> https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >>
> >> >
> >> > Mark Harwood updated LUCENE-725:
> >> > --------------------------------
> >> >
> >> >     Attachment: NovelAnalyzer.java
> >> >
> >> > Updated version can now process any number of documents and remove
> >> > "boilerplate" text tokens such as copyright notices etc.
> >> > New version automatically maintains only a sliding window of
> >> content in
> >> > which it searches for duplicate paragraphs enabling it to process
> >> unlimited
> >> > numbers of documents.
> >> >
> >> > > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters
> out
> >> > all "boilerplate" text
> >> > >
> >> >
> >>
> -------------------------------------------------------------------------------------------
> >>
> >> > >
> >> > >                 Key: LUCENE-725
> >> > >                 URL:
> >> https://issues.apache.org/jira/browse/LUCENE-725
> >> > >             Project: Lucene - Java
> >> > >          Issue Type: New Feature
> >> > >          Components: Analysis
> >> > >            Reporter: Mark Harwood
> >> > >            Priority: Minor
> >> > >         Attachments: NovelAnalyzer.java, NovelAnalyzer.java
> >> > >
> >> > >
> >> > > This is a class I have found to be useful for analyzing small (in
> >> the
> >> > hundreds) collections of documents and  removing any duplicate
> >> content such
> >> > as standard disclaimers or repeated text in an exchange of  emails.
> >> > > This has applications in sampling query results to identify key
> >> > phrases, improving speed-reading of results with similar content
> >> (eg email
> >> > threads/forum messages) or just removing duplicated noise from a
> >> search
> >> > index.
> >> > > To be more generally useful it needs to scale to millions of
> >> documents
> >> > - in which case an alternative implementation is required. See the
> >> notes in
> >> > the Javadocs for this class for more discussion on this
> >> >
> >> > --
> >> > This message is automatically generated by JIRA.
> >> > -
> >> > You can reply to this email to add a comment to the issue online.
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >
> >> >
> >>
> >
>
>
>
>
>
>
> ___________________________________________________________
> All new Yahoo! Mail "The new Interface is stunning in its simplicity and
> ease of use." - PC Magazine
> http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message