Researchers and students constantly face this scenario: It is almost impossible to read most if not all of the newly published papers to be informed of the latest progress and when they work on a research project, the time spent on reading literature review seems endless. The goal of this project is to design a domain independent, automatic text extraction system to alleviate, if not totally solve, this problem.
Without the use of NLP at our disposal, we have scored sentences in the given text both statistically and linguistically to generate a summary comprising of the most important ones obtained so. The program takes input from a text file, and outputs the summary into a similar text file. The most daunting task at hand was to generate an efficient scoring algorithm that would produce the best results for a wide range of text types. The only means to arrive at it was to manually summarize and then evaluate sentences for common traits, which would then be converted into the machine language.
Our program essentially works on the following logics:
a. WORD SCORING:
1. Stop Words: These are some insignificant words that are so commonly used in the English language that no text can be created without them. They therefore provide no real idea about the textual theme, and have therefore, been neglected while scoring sentences.Eg. I, a, an, of, am, the, et cetera.
2. Cue Words: These are words usually used in concluding sentences of a text, making sentences containing them crucial for any given summary. Cue Words provide closure to a given matter, and have therefore, been given prime importance while scoring sentences. Eg. Thus, hence, summary, conclusion, et cetera.
3. Basic Dictionary Words: 850 words of the English language have been defined as the most frequently used words that add meaning to a sentence. These words form the backbone of our algorithm, and have been vital in the creation of a sensible summary. We have hence, given these words moderate importance while scoring sentences.
4. Proper Nouns: Proper Nouns in most cases form the central theme of a given text. Albeit, the identification of proper nouns without the use of linguistic methods was difficult, we have been successful in identifying them in most cases. Proper Nouns provide semantics to the summary, and have therefore been given high importance while scoring sentences.
5. Keywords: The user has been given an option to get a summary generated which contains a particular word, the keyword. Though this is greatly limited by the absence of NLP, we have tried our best to produce results.
6. Word Frequency: Once basic scores have been allotted to words, their final score is calculated on the basis of their frequency of occurrence in the document. Words in the text which are repeated more frequently than others contain a more profound impression of the context, and have therefore been given a higher importance.
1. Primary Score: Using the above methods, a final word score is calculated, and the sum of word scores gives a sentence score. This gives long sentence a clear advantage over their smaller counterparts, which might not necessarily be of lesser importance.
2. Final Score: By multiplying the score so obtained by the ratio “average length / current length” the above drawback can be nullified to a large extent, and a final sentence score is obtained. The most noteworthy aspect has been the successful merger of frequency based and definition based categorisation of words into one efficient algorithm to generate an as complete as possible summary for a given sensible text.
1. The 850 Basic English Words which are inputted from a file have been sorted lexicographically, and Binary Search has been implemented on the same, which takes O(ln n) time.
2. The entered text has been stored into two types of Data Structures:
a. Red Black Tree
b. Hash Table
And an analogy has been drawn between the two.