Arabic Character Recognition using Approximate Stroke Sequence
Professor Mohammed Zeki Khedher* & Dr. Gheith Abandah**
*Dept. of Electrical Engg khedher@ju.edu.jo
**Dept. of Computer Engg abandah@ju.edu.jo
1st June 2002
Abstract
Arabic character recognition of handwriting is addressed. A novel approach for the Arabic Character Recognition is presented based on statistical analysis of a typical Arabic text is presented. Results showed that the sub-word in Arabic language is the basic pictorial block rather than the word. The method of approximate stroke sequence is applied for the recognition of some Arabic characters in their stand-alone form. This method could be extended further for more accurate results. It is recommended that research in Arabic OCR systems in the future is based on the basis of the sub-word as the basic block rather than the word.
1. Introduction
Automatic recognition of handwriting has become a mature discipline at the beginning of the 21st century. On-line systems are now available on handheld computers with acceptable performance. Off-line systems are less accurate than on-line systems. However, they are now good enough for specialized systems such as interpreting handwritten postal addresses on envelopes and reading currency amounts on bank checks (Plamondon 2000).
The recognition of Arabic characters is particularly difficult due to the necessity of segmentation even for printed text. In order to get an insight into the Arabic word structure, it becomes necessary to do some statistical analysis on some typical Arabic text in order to assess the nature of problems facing the workers on Arabic OCR systems. For this purpose, a reasonable size of Arabic text was selected and analyzed. Based on the results of this analysis, a new procedure is suggested for building Arabic OCR systems. As a first step in the implementation of such systems, recognition of Arabic characters in their stand-alone form is addressed. The method of approximate stroke sequence matching is applied and the results are shown.
The paper gives some literature survey on previous work done in the field. It then gives the main characteristics of Arabic writing, presenting the importance of the sub-word structure of the Arabic word, showing the statistical results proving this phenomena, and proposing a new procedure for Arabic OCR system. The newly proposed method suggests the treatment of the sub-word as the basic block in the recognition of Arabic characters. The size of the sub-word should be treated as a decisive factor in the method of recognition of the characters contained in the sub-word. The method of approximate stroke sequence matching is described and then applied to an example of unknown character and compared with two standard characters. A text containing different shapes of Arabic characters was written by 48 different persons and samples of these characters under test were copied for this study. Some results of applying this procedure onto different characters is given. The paper discusses the results obtained and ends up with some conclusions and suggestions for future work.
2. Previous Work in Arabic Character Recognition
Several good literature review papers were published for various research topics on Arabic character recognition (Jambi, 1991; Ahmed, 1994 and Amin, 1997). A recent comprehensive one is given by Ahmed in 2000. Here we shall give some elaboration on some of the effort spent in this direction.
Classical optical off-line recognition of handwriting is composed of a pre-processing stage, character segmentation, feature extraction and a classification stage (Casey 1996). Pre-processing consists of several operations like thresholding, noise removal, page orientation, skewing of lines removal, line segmentation, word segmentation, and pictures and figures removal. There is little difference in these processes between Arabic language and Latin-alphabet-based languages (Abdulla, 1988; Mahmoud, 1991; Hussain & Cowel, 2000; Hussain & Zalik, 2000).
Work on isolated printed Arabic characters and numerals took a lot of shapes. Overall vertical level of the character compared with the baseline was studied (Talba 1987). The chain code describing the sequence of character strokes using the 8-direction strokes was followed by the majority of researchers (Alshebeli, 1997). However, hexagonal sampled procedure was also found (Khellah, 1994). Different methods for matching the unknown character with the standard characters were followed.
Segmentation of characters is an important step in character recognition for cursive writing whether hand written or printed. There are three strategies for segmentation: the classical approach in which segments identification is based on “character-like” properties, the recognition-based segmentation strategy, in which the system searches the image for components that match classes in its alphabets, and the third strategy is the holistic method, in which the system seeks to recognise words as a whole, thus avoiding the need to segment into characters (Casey 1996) Different algorithms were followed to apply one of the above methods or the other, like hierarchical syntactic procedure (Haj-Hassan, 1990), quadratic discriminating functions (Udpa, 1992), the method of moment invariant algorithm (El-Khaly, 1990), accumulative invariant moment was used as an identifier in character recognition (El-Dabi, 1990), and even segmentation of printed Arabic characters was tried without the thinning process (El-Sheikh, 1988). Use of clustering technique was chosen for classification (El-Desouky, 1992) or tree representation for the description of various characters (Al-Waily, 1989; Saleh, 1994 & Saleh, 1996). Use of tree representation and fuzzy constrained graph models which tolerate large varieties in writing styles were reported also (Abuharba, 1994).
Recognition of different fonts of Arabic printed text was tried using pre-processing and structural feature extraction (Kavianifar,1998). Parallel Arabic OCR systems were also proposed (Alherbish, 1997).
Hidden Markov models which proved to be very successful in the area of automatic speech recognition was tried in the area of omnifont, open-vocabulary Arabic OCR system (Bazzi, 1999).
Work on limited hand-written Arabic text database was tried. A system based on four types of basic features, namely the end points, corners, the strokes and the branch points gave reasonable results (Jambi, 1991).
Recognition of Typewritten Arabic characters gave good results using external features such as character area ratio, n-th quadrant ratio, vertical line ratio, horizontal line ratio, number of upper edges, and other similar features (Al-Ohali, 1995).
Online character recognition uses the feature extraction process results in a sequential manner which is called the chain code. Treatment of secondary characters (mainly the points above and below the characters) is definitely an integral part of the recognition process (El-Gwad, 1990).
Neural Network was used in some work (Said, 1998 , Al-Kadi, 1995, Altuwaijri, 1995 and Al-Sharaidah, 2000).
3. Main characteristics of Arabic Writing
Arabic text is written from right to left and is always cursive. The shape of an Arabic character changes according to its location in the word. An Arabic character has up to four different shapes; the shape of a character depends on the type of character to its right and its position within the word. Table 1 shows the Arabic character set in the four different shapes.
The Arabic character set is composed of 28 basic characters. Fifteen of them have dots and 13 are without dots. Dots above and below the characters play a major role in distincting some characters that differ only by the number or location of dots. Take the example the letters: È Ê Ë í ä . In their middle form, all these five letters are written the same way as: ÜÈÜ ÜÊÜ ÜËÜ ÜíÜ ÜäÜ . They differ only by the number or the locations of the dots.
There are four characters which may take the secondary character “Hamzah Á”. Those are “Alif à Š”, “Waw Ä” , “Yaa Æ” and “Kaf ß”.
There are also some other secondary characters used above and below the characters to indicate vowels but we shall exclude them now from our discussions.
Arabic characters do not have fixed width or fixed size, even in printed form.
3.1 An important phenomena in Arabic writing
Arabic writing is known to be cursive even in printed form. However, it differs from cursive handwriting of
|
Letter |
Stand- alone |
Initial |
Middle |
Final |
Other shapes |
|
Alef |
Ç |
|
|
ÜÇ |
ì Üì |
|
Ba’ |
È |
ÈÜ |
ÜÈÜ |
ÜÈ |
|
|
Ta’ |
Ê |
ÊÜ |
ÜÊÜ |
ÜÊ |
É ÜÉ |
|
Tha’ |
Ë |
ËÜ |
ÜËÜ |
ÜË |
|
|
Jeem |
Ì |
ÌÜ |
ÜÌÜ |
ÜÌ |
|
|
H’a’ |
Í |
ÍÜ |
ÜÍÜ |
ÜÍ |
|
|
Kha’ |
Î |
ÎÜ |
ÜÎÜ |
ÜÎ |
|
|
Dal |
Ï |
|
|
ÜÏ |
|
|
Thal |
Ð |
|
|
ÜÐ |
|
|
Ra’ |
Ñ |
|
|
ÜÑ |
|
|
Zai |
Ò |
|
|
ÜÒ |
|
|
Seen |
Ó |
ÓÜ |
ÜÓÜ |
ÜÓ |
|
|
Sheen |
Ô |
ÔÜ |
ÜÔÜ |
ÜÔ |
|
|
Sad |
Õ |
ÕÜ |
ÜÕÜ |
ÜÕ |
|
|
Dhad |
Ö |
ÖÜ |
ÜÖÜ |
ÜÖ |
|
|
Tta |
Ø |
ØÜ |
ÜØÜ |
ÜØ |
|
|
Dha’ |
Ù |
ÙÜ |
ÜÙÜ |
ÜÙ |
|
|
Ain |
Ú |
ÚÜ |
ÜÚÜ |
ÜÚ |
|
|
Gahin |
Û |
ÛÜ |
ÜÛÜ |
ÜÛ |
|
|
Fa’ |
Ý |
ÝÜ |
ÜÝÜ |
ÜÝ |
|
|
Qaf |
Þ |
ÞÜ |
ÜÞÜ |
ÜÞ |
|
|
Kaf |
ß |
ᚠ|
ÜßÜ |
Üß |
|
|
Lam |
á |
áÜ |
ÜáÜ |
Üá |
|
|
Meem |
ã |
ãÜ |
ÜãÜ |
Üã |
|
|
Noon |
ä |
äÜ |
ÜäÜ |
Üä |
|
|
Ha’ |
å |
åÜ |
ÜåÜ |
Üå |
|
|
Waw |
æ |
|
|
Üæ |
|
|
Ya’ |
í |
íÜ |
ÜíÜ |
Üí |
|
Table1: The Different Forms of Arabic Alphabets
English in that some characters can be connected from one side only. Out of the 28 basic Arabic characters, six can be
connected from the right side only while the other 22 can be connected from both sides. These six characters are:dal (Ï ), raa (Ñ ), waw ( æ), alef ( Ç), thal ( Ð) , and zay (Ò ). These six characters have only two forms, the stand-alone form and the final form. Whereas the rest of the characters can appear in any of four forms: the initial, the middle, the final, and the stand-alone form. Consequently, an Arabic word may consist of one or more sub-words. A sub-word can be defined as the basic stand-alone pictorial block of the Arabic writing. Any optical character recognition of Arabic characters should treat the sub-word as the basic block for processing whatever the method it uses for preprocessing, segmentation, recognition, or classification.
This is because each sub-word is separated from other sub-word by a space. Although spaces between sub-words are usually shorter than those between successive words, still they are surrounded by space. A word may contain one or more sub-words. Some of these sub-words may even consist of a single character in its stand-alone form. Hence, their recognition does not need segmentation.
Shape of the letter in the text differs according to the location of the character in the sub-word, i.e. a character at the end of sub-word, has exactly the same shape when it comes at the end of a full word. Take the example of the
|
Char per sub-word |
Sub-words |
Sub-words % |
Stand-alone characters |
Initial characters |
Middle characters |
Final characters |
|
1 |
263,065 |
45.80% |
263,065 |
0 |
0 |
0 |
|
2 |
159,995 |
27.90% |
0 |
159,995 |
0 |
159,995 |
|
3 |
90,068 |
15.70% |
0 |
90,068 |
90,068 |
90,068 |
|
4 |
43,124 |
7.50% |
0 |
43,124 |
86,248 |
43,124 |
|
5 |
13,433 |
2.30% |
0 |
13,433 |
40,299 |
13,433 |
|
6 |
3,633 |
0.63% |
0 |
3,633 |
14,532 |
3,633 |
|
7 |
818 |
0.14% |
0 |
818 |
4,090 |
818 |
|
8 |
247 |
0.04% |
0 |
247 |
1,482 |
247 |
|
Total |
574,383 |
100.00% |
263,065 |
313,318 |
236,719 |
313,318 |
|
% characters |
|
|
23.4% |
27.8% |
21% |
27.8% |
Table 2: Sub-words and Shapes Statistics
word ÑÌÇá . It consists of 3 sub-words, containing 1, 2, and 1 character respectively.
3.2 Test Sample and Results
In order to give a fair idea about sub-words, a sample of Arabic text consisting of about 1.4MB was collected. It was randomly selected from old books, modern books, newspapers, and other available sources on the web. Statistics presented here about this sample text may give an idea about the structure of Arabic words in terms of sub-words and the four character shapes. Table 2 shows the analysis for this sample.
The sample consists of 262,647 words with 1,126,420 characters. This means that the average word length is 4.3 characters per word. The number of sub-words is 574,383. This means that on the average there is 2.2 sub-word per word. The number of sub-words consisting of one character is 263,065 which makes about 45.8% of the total number of sub-words (and 23.4% of the total number of characters). This means that, in the process of optical character recognition, slightly less than one half of the sub-words need no segmentation at all (whether printed or handwritten). The number of sub-words that consist of two characters is 159,995 which makes 27.9% of the total number of sub-words. This means that about 30% of the total number of sub-words need segmentation into two characters only. The table also shows that on the average the four different shapes of characters are almost equal with the middle form slightly less (23.4% stand-alone, 27.8% for each of the initial form and the final form, as the number should be equal, and 21% for the middle form).
3.3 Proposal for a New Procedure for Recognition of Arabic Characters
According to the above discussion, the approach suggested here is to separate the text into three groups:
1. Sub-words consisting of one character that is in the stand-alone form. This is to be recognised directly without any segmentation.
2. Sub-words consisting of two characters. The first one is in the initial form and the second one is in the final form. This needs segmentation in two parts only. If there is a pre-knowledge of the number of characters to be segmented in the sub-word, then the task becomes easier.
3. Sub-words consisting of more than two characters. The first one is in the initial form, the last one in the final form, and the rest are in the middle form.
Figure 1 shows the flow diagram for this procedure.