lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kieran Logan" <kie...@roleconnect.com>
Subject [Lucene.Net] Suggestions on indexing and searching an unusual requirement
Date Fri, 30 Sep 2011 15:45:36 GMT
Hi All

 

I have a series of documents to index and search and could do with some
pointers on how best to achieve the desired results. 

 

The process flow is relatively straightforward, a queue of documents already
exists. The application will pick the next document is a folder queue,
initially index the single document in a RAMDirectory(), display the
document to the user for adjustment and once the user selects 'Save' the
amended document will be committed to a Lucene FSDirectory index. (I'm
glossing over a few details here and I'm aware of what needs to be done with
IndexWriter and various indexes, locks etc).

 

The document has various parts which will become fields, as follows:

 

Document ID

Title

Introduction

Paragraph1 .... ParagraphN. There may be anything from 1 paragraph to N
paragraphs, 40 paragraphs is generally towards the maximum. Each paragraph
has a specific purpose and will have its own field for search purposes i.e.
it may be required later to search paragraph 3 in all documents for a given
term. So, for example, legal precedents and cases which the document may
refer to will always be in paragraph 3 and only in paragraph 3.

Conclusion

Keyword(s) again keyword1 ... keyword

 

Once the initial indexing has been done to RAMDirectory(), I would like to
show the user how many instances of the keyword terms are contained in the
document in total (Title, Introduction, Paragraph(s), Conclusion) - for
example, C#(46), VB.Net(14), ASP(22), JQuery(11). Also, it would be really
useful if feasible, to show other terms from the total document which the
user could add to the Keywords or ignore e.g. Microsoft(88) Google(109) etc.


 

(I used developer terms rather than the actual application use case as
hopefully everyone would be familiar with the examples).

 

Bonus (great if anyone would like to answer but I'm reasonably ok with
this): 

 

When search the entire document for keywords, I will also be using synonyms
so in the example above if the keyword is "Java" and the document (title,
introduction, para1...n, conclusion) mentions "Java" twice, JavaBeans once,
J2EE six times then the total count will show as Java(9). I have already
developed a technical synonym dictionary rather than using wordnet or
alternatives so covered for creating the synonym terms).

 

Many thanks in advance and hopefully the example above is reasonably self
explanatory

 

Kieran Logan


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message