lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "McGibbney, Lewis John" <Lewis.McGibb...@gcu.ac.uk>
Subject Keyword extraction from pdf to text
Date Tue, 30 Nov 2010 17:09:04 GMT
Hello list,

I am currently attempting to extract keywords from pdf documents, my aim is then to begin
constructing a domain ontology using the words which are extracted. I do not need to index
anything at this stage, but wish to extract and push the output as plain text into a text
file. An example of input text from the pdf document would be as follows
________________________________
6.1.3 Calculating carbon dioxide emissions for the proposed dwelling
The second calculation involves establishing the carbon dioxide emissions
for the proposed dwelling (DER). To do this the values proposed for the
dwelling should be used in the methodology i.e. the U-values, air infiltration,
heating system, etc.
The exceptions to entering the dwelling specific values are:
a. it may be assumed that all glazing is orientated east/west;
b. average overshading may be assumed if not known. 'Very little' shading
should not be entered;
c. 2 sheltered sides should be assumed if not known. More than 2 sheltered
sides should not be entered;
d. where secondary heating is proposed, if a chimney or flue is present but
no appliance installed the worst case should be assumed i.e. a decorative
fuel-effect gas appliance with 20% efficiency. If there is no gas point, an
open fire with 37% efficiency should be assumed, burning solid mineral
fuel for dwellings outwith a smokeless zone and smokeless solid mineral
fuel for those that are within such a zone.
All other values can be varied, but before entering values into the
methodology, reference should be made to:
* the back-stop U-values identified in guidance to standard 6.2; and
* guidance on systems and equipment within standards 6.3 to 6.6.
________________________________
My requirements are as follows


*         drop stop words

*         be able to pick up Bi Grams or NGrams such as the following "U-Values", "super-insulated",
"air infiltration" etc,

*         lower case filter

I have currently been using Lucene 3.0.1 with a custom filter to achieve the above bullet
points, then using Luke to pick up phrases and entities from text by looking into the generated
index, however I found that this was very time consuming. My intention is to pass the pdf
document as input and receive the above as output which I can then use to manually construct
my ontology from entities and their relationships.

I previously posted this to the Tika list with no response, so again I apologise if this is
not a problem for the Lucene java list. Can anyone suggest a possible solution to the problem.

Any help would be great ;0) Thanks

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald
Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message