International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.2, No.2, April 2012
168
bilingual document at word level and then identify the different script forms before running an
individual OCR system.
In the context of Indian language document analysis, major literature is due to Pal and Choudhari.
The automatic separation of text lines from multi-script documents by extracting the features
from profiles, water reservoir concepts [1]. Santanu Choudhury, Gaurav Harit, Shekar Madnani
and R. B. Shet has proposed a method for identification of Indian languages by combining Gabor
filter based technique and direction distance histogram classifier considering Hindi, English,
Malayalam, Bengali, Telugu and Urdu [2]. Chanda and Pal have proposed an automatic
technique for word wise identification of Devnagari, English and Urdu scripts from a single
document [3]. Word level script identification in bilingual documents through discriminating
features has been developed by B V Dhandra, Mallikarjun Hangarge, Ravindra Hegadi and
V.S.Malemath [4].Vijaya and Padma has developed methods for English, Hindi and Kannada
script identification using discriminating features and top and bottom profile based features
(English, Hindi, Kannada) [5]. B.V.Dhandra, H.Mallikarjun, Ravindra Hegadi, V.S.Malemath
developed a method of Word-wise Script Identification from Bilingual Documents Based on
Morphological Reconstruction(English, Hindi, kannada) [6]. Prakash K. Aithal, Rajesh G.,
Dinesh U. Acharya, Krishnamoorthi M. Subbareddy N. V. Has proposed a method of Text Line
Script Identification for a Multilingual Document (English, Hindi, Kannada) [7].
This paper deals with word-wise script identification for Kannada, English and Hindi script
pertaining documents from Karnataka, Uttar Pradesh. Script identification is done based on the
features extracted from Horizontal Projection Profile and the vertical projection profile of the
word segment. To discriminate Kannada, English and Hindi the mean of horizontal Projection
Profile Values between first and second largest and value of the point immediately after either
first largest or second largest depending upon the position, which largest come earlier in the
horizontal projection profile is used.
Secondly after calculating the above feature, we calculate the vertical strokes present in the word
in order to achieve better accuracy in the result. After analysing the all three script we see that
Hindi and English language contain vertical strokes in their words (for example-B, D, E, T, I, d, b,
n, k, etc in English). But in case of Kannada vertical strokes is not present. For vertical strokes
first we calculate the height of every word by using horizontal projection, after that by using
vertical projection we calculate the vertical strokes equal to the word height. These strokes are
considered for feature extraction.
2. RESEARCH METHODOLOGY
2.1 Discriminating Features of Hindi, English and Kannada
1. In English script vertical strokes appear in the left side of the character mostly such as (B, D, H,
F, R, K, P, b, h, k, l) whereas in Hindi they appear in the right side of the characters as shown in
the fig 1.
Figure 1 Vertical stroke in the right side of the characters