Information Retrieval (IR) systems help users locate relevant information from large document collections. While research exists for major languages like English, less-resourced languages such as Amharic lack specialized IR tools. This project develops an Amharic text retrieval system using the Vector Space Model (VSM), incorporating term weighting and cosine similarity to rank documents. Implemented in Python 3.6.2, it includes tokenization, normalization, stop-word removal, and stemming tailored to Amharic morphology. Using 10 news articles from Ethiopian Broadcasting Corporation (EBC) and Fana Broadcasting Corporation (FBC) (2009–2017), evaluation with 3-, 6-, and 10-word queries measured precision and recall. Results show relevant retrieval but highlight challenges from morphological variations, synonymy, polysemy, and inconsistent spelling, emphasizing the need for advanced linguistic tools for Amharic IR.
Keywords: Information Retrieval, Amharic Language, Vector Space Model, Natural Language Processing.
1. Introduction
The rapid growth of ICT has generated vast digital data, much of which is unstructured. IR is key to retrieving information efficiently (Manning et al., 2008; Baeza-Yates & Ribeiro-Neto, 2011).
Amharic, Ethiopia’s official language, has growing digital content, yet IR systems are scarce and often English-focused, limiting access for Amharic users (Yacob, 2006; Gasser, 2011; Temtim, 2014).
Challenges for Amharic IR include:
- Morphological richness: Single roots generate many inflected forms.
- Orthographic variation: Multiple symbols for same sounds (e.g., ሀ/ሐ/ኀ).
- Dialectal variation: Regional differences in writing and speaking.
- Loanword inconsistencies: Imported words lack standard spelling.
This project develops a VSM-based system with preprocessing to rank Amharic documents by query similarity.
2. Review of Literature
2.1 Overview of Information Retrieval
IR systems represent, store, and access unstructured text. Core components:
- Indexing: Organizes documents for fast retrieval.
- Query Processing: Interprets user input.
- Retrieval & Ranking: Matches queries to documents using models like VSM (Manning et al., 2008).
VSM represents documents and queries as vectors; TF–IDF weights term importance; cosine similarity measures closeness (Salton, 1975; Sparck Jones, 1972).
2.2 IR in Under-Resourced Languages
Research has focused on major languages. African languages like Amharic face fewer tools and datasets (Gasser, 2011; De Pauw et al., 2010). Existing work targets stemming, stop-word removal, normalization, and cross-lingual IR (Temtim, 2014; Mulugeta, 2015).
2.3 Amharic Language Challenges
Amharic is morphologically rich, with a Ge’ez-derived script (34 base characters × 7 orders). Challenges include:
- Redundant characters (e.g., አ vs. ዐ)
- Compound word inconsistencies (e.g., ወጥቤት vs. ወጥ ቤት)
- Non-standard loanword spelling
- Frequent inflectional/derivational morphemes
Without preprocessing, semantic equivalents may be treated as distinct, reducing retrieval effectiveness (Abiyot, 2000; Yacob, 2006).
2.4 Related Works
- Amharic stemming algorithms for recall improvement (Gasser, 2011)
- Stop-word lists and normalization (Temtim, 2014)
- Cross-lingual IR (Mulugeta, 2015)
- Afaan Oromo IR approaches (Alemayehu, 2010)
This project builds on these by integrating preprocessing into a functional VSM-based IR system.
3. Methodology
3.1 Development Tools
Implemented in Python 3.6.2 on Windows, leveraging libraries like NLTK, gensim, and scikit-learn (Bird et al., 2009).
3.2 Corpus Preparation
Collected 10 Amharic news articles from EBC and FBC (2009–2017). Small-scale datasets are common due to limited digitized Amharic corpora (Temtim, 2014; Gasser, 2011).
3.3 Text Preprocessing
- Tokenization: Rule-based segmentation (Abate et al., 2005)
- Stop-word Removal: Removes common function words
- Stemming: Suffix-stripping reduces words to stems (e.g., መምህር, መምህራን → መምህር)
- Normalization: Maps orthographic variants to canonical forms (e.g., ሀ/ሐ/ኀ)
3.4 Indexing
Used an inverted index: vocabulary + posting files for mapping terms to documents (Baeza-Yates & Ribeiro-Neto, 2011).
3.5 Retrieval Model
VSM with TF–IDF weighting; cosine similarity computed as:
sim(d, q) = Σ (wdi × wqi) / sqrt(Σ wdi² × Σ wqi²)
3.6 Evaluation Metrics
- Precision (P): Proportion of relevant retrieved documents
- Recall (R): Proportion of relevant documents retrieved
3. Methodology
3.1 Development Tools
System implemented using Python 3.6.2 on Windows. Python provides flexibility and libraries for NLP tasks (NLTK, gensim, scikit-learn) (Bird et al., 2009).
3.2 Corpus Preparation
Collected 10 Amharic news articles from EBC and FBC (2009–2017). Small datasets are common due to limited digitized Amharic text (Temtim, 2014; Gasser, 2011).
3.3 Text Preprocessing
- Tokenization: Rule-based word segmentation (Abate et al., 2005)
- Stop-word Removal: High-frequency function words removed
- Stemming: Suffix stripping reduces words to stems (e.g., መምህር, መምህራን → መምህር)
- Normalization: Maps orthographic variants to canonical forms (e.g., ሀ/ሐ/ኀ)
3.4 Indexing
Used an inverted index mapping terms to documents (Baeza-Yates & Ribeiro-Neto, 2011).
3.5 Retrieval Model
VSM with TF–IDF weighting; cosine similarity computed as:
sim(d, q) = Σ (wdi × wqi) / sqrt(Σ wdi² × Σ wqi²)
3.6 Evaluation Metrics
- Precision (P): Proportion of retrieved documents that are relevant
- Recall (R): Proportion of relevant documents retrieved
4. Experiment and Results
Three queries of different lengths were tested:
- Query 1: 3 words
- Query 2: 6 words
- Query 3: 10 words
Queries represent typical user information needs. Retrieval performance was measured for each.
4.1 Results for 3-word Queries
System retrieved 4 relevant and 6 non-relevant documents; precision was low due to morphology and synonymy.
| Doc # | Relevance | Recall | Precision |
|---|---|---|---|
| Doc1 | NR | 0.00 | 0.00 |
| Doc2 | NR | 0.00 | 0.00 |
| Doc3 | R | 0.25 | 0.33 |
| Doc4 | R | 0.50 | 0.50 |
| Doc5 | NR | 0.50 | 0.40 |
| Doc6 | NR | 0.50 | 0.33 |
| Doc7 | NR | 0.50 | 0.28 |
| Doc8 | NR | 0.50 | 0.25 |
| Doc9 | R | 0.75 | 0.33 |
| Doc10 | R | 1.00 | 0.40 |
4.2 Results for 6-word Queries
5 relevant documents retrieved; longer queries reduced ambiguity.
| Doc # | Relevance | Recall | Precision |
|---|---|---|---|
| Doc1 | R | 0.20 | 1.00 |
| Doc2 | R | 0.40 | 1.00 |
| Doc3 | NR | 0.40 | 0.60 |
| Doc4 | NR | 0.40 | 0.50 |
| Doc5 | R | 0.60 | 0.60 |
| Doc6 | NR | 0.60 | 0.50 |
| Doc7 | NR | 0.60 | 0.42 |
| Doc8 | NR | 0.60 | 0.37 |
| Doc9 | R | 0.80 | 0.44 |
| Doc10 | R | 1.00 | 0.50 |
4.3 Results for 10-word Queries
Similar performance to 6-word queries; longer queries helped reduce non-relevant retrieval.
4.4 Average Performance
- Average Recall: 0.56
- Average Precision: 0.35
5. Discussion
The system demonstrates feasibility of Amharic IR using VSM + TF–IDF. Longer queries improved results by reducing ambiguity. Low precision was caused by:
- Morphological inflection creating multiple forms for the same lemma
- Synonymy and polysemy confusing retrieval
- Orthographic variations
Future improvements:
- Morphological analyzers and lemmatizers
- Amharic synonym dictionaries or WordNet
- Embedding-based semantic IR (word2vec, BERT)
6. Conclusion and Recommendations
This project developed an Amharic IR system with tokenization, stop-word removal, stemming, and normalization. Experiments with 10 documents showed relevant retrieval, but linguistic challenges limited performance.
Key Contributions
- Demonstrated feasibility of VSM-based Amharic IR
- Highlighted importance of preprocessing
- Provided recall and precision evaluation using real Amharic articles
Recommendations
- Develop larger Amharic corpora
- Integrate morphological analyzers and machine-learning stemming
- Explore semantic IR with embeddings
- Extend to multimedia sources (speech, OCR)
By: Getahun A.
References
Abate, S. T., et al. (2005). A Morphological Analyzer for Amharic. ACL Workshop on Computational Approaches to Semitic Languages.Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern Information Retrieval. Addison Wesley.
Gasser, M. (2011). HornMorpho: Morphological Processing of Amharic. ACL Workshop on African NLP.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Salton, G. (1975). A Vector Space Model for Automatic Indexing. CACM, 18(11), 613–620.
Sparck Jones, K. (1972). A Statistical Interpretation of Term Specificity. Journal of Documentation, 28(1), 11–21.
Temtim, A. (2014). Development of Amharic IR Systems.
Yacob, W. (2006). Challenges of Amharic Information Processing.

0 Comments