International Journal of Advances in Computer Science and Its Applications
Author(s) : B ESWARA REDDY, K MUNIVELU REDDY
Most of the document clustering techniques are based on statistical analysis of a term, either a word or phrase.The statistical analysis of a term frequency captures the importance of the term within the document only. Thus, theunderlying mining model should indicate terms that capture the semantics of the text. In this case, Themining model can capture terms that present the concepts of the sentence, which leads to the discovery of the topic of document. A new conceptbased mining model focuses on the web document clustering;the model consists of three components: concept-based statistical analyzer, COG and concept extractor.The statistical analyzer is to analyze terms on the sentence and document levels. The COG is to extract the most important terms with respect to the meaning of the text. Theconcepts that have maximum weights are selected by the concept extractor.The similarity between documents is calculated based on the Concept-based document similarity measure; It is the combination of , and .The experimental results demonstrate extensive comparison between the concept-based analysis and thestatistical analysis.