Arabic Text Categorization using Rocchio Model



Automatic text categorization is considered an important application in natural language processing. It is the process of assigning a document to predefined categories based on its content. In this research, some well-known techniques developed for classifying English text are considered to be applied on Arabic. This work focuses on applying the well-known Rocchio (Centroid-based) technique on Arabic documents. This technique uses centroids to define good class boundaries. The centroid of a class c is computed as center of mass of its members. Arabic language is highly inflectional and derivational which makes text processing a complex task. In the proposed work, first Arabic text is preprocessed using tokenization and stemming techniques. Then, the Rocchio Algorithm is adopted and adapted to be applied to classify Arabic documents. The implemented algorithm is evaluated using a corpus containing a set of actual documents. The results show that the adapted Rocchio algorithm is applicable to categorize Arabic text. Ratios of 92.2%, 92.7%, and 92.1% of Micro-averaging recall, precision, and F-measure respectively are achieved, against a data set of 500 Arabic text documents covering five distinct categories.

