Information Retrieval (IR) systems help users locate relevant information from large document collections. While research exists for major languages like English, less-resourced languages such as Amharic lack specialized IR tools. This project develops an Amharic text retrieval system using the Vector Space Model (VSM), incorporating term weighting and cosine similarity to rank documents. Implemented in Python 3.6.2, it includes tokenization, normalization, stop-word removal, and stemming tailored to Amharic morphology. Using 10 news articles from Ethiopian Broadcasting Corporation (EBC) and Fana Broadcasting Corporation (FBC) (2009–2017), evaluation with 3-, 6-, and 10-word queries measured precision and recall. Results show relevant retrieval but highlight challenges from morphological variations, synonymy, polysemy, and inconsistent spelling, emphasizing the need for advanced linguistic tools for Amharic IR.

Keywords: Information Retrieval, Amharic Language, Vector Space Model, Natural Language Processing.

1. Introduction

The rapid growth of ICT has generated vast digital data, much of which is unstructured. IR is key to retrieving information efficiently (Manning et al., 2008; Baeza-Yates & Ribeiro-Neto, 2011).

Amharic, Ethiopia’s official language, has growing digital content, yet IR systems are scarce and often English-focused, limiting access for Amharic users (Yacob, 2006; Gasser, 2011; Temtim, 2014).

Challenges for Amharic IR include:

Morphological richness: Single roots generate many inflected forms.
Orthographic variation: Multiple symbols for same sounds (e.g., ሀ/ሐ/ኀ).
Dialectal variation: Regional differences in writing and speaking.
Loanword inconsistencies: Imported words lack standard spelling.

This project develops a VSM-based system with preprocessing to rank Amharic documents by query similarity.

2. Review of Literature

2.1 Overview of Information Retrieval

IR systems represent, store, and access unstructured text. Core components:

Indexing: Organizes documents for fast retrieval.
Query Processing: Interprets user input.
Retrieval & Ranking: Matches queries to documents using models like VSM (Manning et al., 2008).

VSM represents documents and queries as vectors; TF–IDF weights term importance; cosine similarity measures closeness (Salton, 1975; Sparck Jones, 1972).

2.2 IR in Under-Resourced Languages

Research has focused on major languages. African languages like Amharic face fewer tools and datasets (Gasser, 2011; De Pauw et al., 2010). Existing work targets stemming, stop-word removal, normalization, and cross-lingual IR (Temtim, 2014; Mulugeta, 2015).

2.3 Amharic Language Challenges

Amharic is morphologically rich, with a Ge’ez-derived script (34 base characters × 7 orders). Challenges include:

Redundant characters (e.g., አ vs. ዐ)
Compound word inconsistencies (e.g., ወጥቤት vs. ወጥ ቤት)
Non-standard loanword spelling
Frequent inflectional/derivational morphemes

Without preprocessing, semantic equivalents may be treated as distinct, reducing retrieval effectiveness (Abiyot, 2000; Yacob, 2006).

2.4 Related Works

Amharic stemming algorithms for recall improvement (Gasser, 2011)
Stop-word lists and normalization (Temtim, 2014)
Cross-lingual IR (Mulugeta, 2015)
Afaan Oromo IR approaches (Alemayehu, 2010)

This project builds on these by integrating preprocessing into a functional VSM-based IR system.

3. Methodology

3.1 Development Tools

Implemented in Python 3.6.2 on Windows, leveraging libraries like NLTK, gensim, and scikit-learn (Bird et al., 2009).

3.2 Corpus Preparation

Collected 10 Amharic news articles from EBC and FBC (2009–2017). Small-scale datasets are common due to limited digitized Amharic corpora (Temtim, 2014; Gasser, 2011).

3.3 Text Preprocessing

Tokenization: Rule-based segmentation (Abate et al., 2005)
Stop-word Removal: Removes common function words
Stemming: Suffix-stripping reduces words to stems (e.g., መምህር, መምህራን → መምህር)
Normalization: Maps orthographic variants to canonical forms (e.g., ሀ/ሐ/ኀ)

3.4 Indexing

Used an inverted index: vocabulary + posting files for mapping terms to documents (Baeza-Yates & Ribeiro-Neto, 2011).

3.5 Retrieval Model

VSM with TF–IDF weighting; cosine similarity computed as:

sim(d, q) = Σ (wdi × wqi) / sqrt(Σ wdi² × Σ wqi²)

3.6 Evaluation Metrics

Precision (P): Proportion of relevant retrieved documents
Recall (R): Proportion of relevant documents retrieved

3. Methodology

3.1 Development Tools

System implemented using Python 3.6.2 on Windows. Python provides flexibility and libraries for NLP tasks (NLTK, gensim, scikit-learn) (Bird et al., 2009).

3.2 Corpus Preparation

Collected 10 Amharic news articles from EBC and FBC (2009–2017). Small datasets are common due to limited digitized Amharic text (Temtim, 2014; Gasser, 2011).

3.3 Text Preprocessing

Tokenization: Rule-based word segmentation (Abate et al., 2005)
Stop-word Removal: High-frequency function words removed
Stemming: Suffix stripping reduces words to stems (e.g., መምህር, መምህራን → መምህር)
Normalization: Maps orthographic variants to canonical forms (e.g., ሀ/ሐ/ኀ)

3.4 Indexing

Used an inverted index mapping terms to documents (Baeza-Yates & Ribeiro-Neto, 2011).

3.5 Retrieval Model

VSM with TF–IDF weighting; cosine similarity computed as:

sim(d, q) = Σ (wdi × wqi) / sqrt(Σ wdi² × Σ wqi²)

3.6 Evaluation Metrics

Precision (P): Proportion of retrieved documents that are relevant
Recall (R): Proportion of relevant documents retrieved

4. Experiment and Results

Three queries of different lengths were tested:

Query 1: 3 words
Query 2: 6 words
Query 3: 10 words

Queries represent typical user information needs. Retrieval performance was measured for each.

4.1 Results for 3-word Queries

System retrieved 4 relevant and 6 non-relevant documents; precision was low due to morphology and synonymy.

Doc #	Relevance	Recall	Precision
Doc1	NR	0.00	0.00
Doc2	NR	0.00	0.00
Doc3	R	0.25	0.33
Doc4	R	0.50	0.50
Doc5	NR	0.50	0.40
Doc6	NR	0.50	0.33
Doc7	NR	0.50	0.28
Doc8	NR	0.50	0.25
Doc9	R	0.75	0.33
Doc10	R	1.00	0.40

4.2 Results for 6-word Queries

5 relevant documents retrieved; longer queries reduced ambiguity.

Doc #	Relevance	Recall	Precision
Doc1	R	0.20	1.00
Doc2	R	0.40	1.00
Doc3	NR	0.40	0.60
Doc4	NR	0.40	0.50
Doc5	R	0.60	0.60
Doc6	NR	0.60	0.50
Doc7	NR	0.60	0.42
Doc8	NR	0.60	0.37
Doc9	R	0.80	0.44
Doc10	R	1.00	0.50

4.3 Results for 10-word Queries

Similar performance to 6-word queries; longer queries helped reduce non-relevant retrieval.

4.4 Average Performance

Average Recall: 0.56
Average Precision: 0.35

5. Discussion

The system demonstrates feasibility of Amharic IR using VSM + TF–IDF. Longer queries improved results by reducing ambiguity. Low precision was caused by:

Morphological inflection creating multiple forms for the same lemma
Synonymy and polysemy confusing retrieval
Orthographic variations

Future improvements:

Morphological analyzers and lemmatizers
Amharic synonym dictionaries or WordNet
Embedding-based semantic IR (word2vec, BERT)

6. Conclusion and Recommendations

This project developed an Amharic IR system with tokenization, stop-word removal, stemming, and normalization. Experiments with 10 documents showed relevant retrieval, but linguistic challenges limited performance.

Key Contributions

Demonstrated feasibility of VSM-based Amharic IR
Highlighted importance of preprocessing
Provided recall and precision evaluation using real Amharic articles

Recommendations

Develop larger Amharic corpora
Integrate morphological analyzers and machine-learning stemming
Explore semantic IR with embeddings
Extend to multimedia sources (speech, OCR)

By: Getahun A.

References

Abate, S. T., et al. (2005). A Morphological Analyzer for Amharic. ACL Workshop on Computational Approaches to Semitic Languages.
Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern Information Retrieval. Addison Wesley.
Gasser, M. (2011). HornMorpho: Morphological Processing of Amharic. ACL Workshop on African NLP.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Salton, G. (1975). A Vector Space Model for Automatic Indexing. CACM, 18(11), 613–620.
Sparck Jones, K. (1972). A Statistical Interpretation of Term Specificity. Journal of Documentation, 28(1), 11–21.
Temtim, A. (2014). Development of Amharic IR Systems.
Yacob, W. (2006). Challenges of Amharic Information Processing.

Ad Code

Ticker

Text-Based Information Retrieval System for Amharic Documents

1. Introduction

2. Review of Literature

2.1 Overview of Information Retrieval

2.2 IR in Under-Resourced Languages

2.3 Amharic Language Challenges

2.4 Related Works

3. Methodology

3.1 Development Tools

3.2 Corpus Preparation

3.3 Text Preprocessing

3.4 Indexing

3.5 Retrieval Model

3.6 Evaluation Metrics

3. Methodology

3.1 Development Tools

3.2 Corpus Preparation

3.3 Text Preprocessing

3.4 Indexing

3.5 Retrieval Model

3.6 Evaluation Metrics

4. Experiment and Results

4.1 Results for 3-word Queries

4.2 Results for 6-word Queries

4.3 Results for 10-word Queries

4.4 Average Performance

5. Discussion

6. Conclusion and Recommendations

Key Contributions

Recommendations

References

Post a Comment

0 Comments

Search This Blog

Ad Code

Labels

Get-Inform Resources

Featured post

AI Tools for Content Creation: A Practical Guide

Wikipedia

Popular Posts

The Intersection of Data Science and Information Management

Text-Based Information Retrieval System for Amharic Documents

Understanding the Basics of Data Mining: From Raw Data to Actionable Insights

Footer Menu Widget