Artigo Acesso aberto Revisado por pares

Spelled sign word recognition using key frame

2014; Institution of Engineering and Technology; Volume: 9; Issue: 5 Linguagem: Inglês

10.1049/iet-ipr.2012.0691

ISSN

1751-9667

Autores

Rajeshree S. Rokade, Dharmpal D. Doye,

Tópico(s)

Human Pose and Action Recognition

Resumo

IET Image ProcessingVolume 9, Issue 5 p. 381-388 Research ArticlesFree Access Spelled sign word recognition using key frame Rajeshree S. Rokade, Corresponding Author Rajeshree S. Rokade [email protected] SGGS Institute of Engineering and Technology, Vishnupuri, Nanded, Maharashtra, 431606 IndiaSearch for more papers by this authorDharmpal D. Doye, Dharmpal D. Doye SGGS Institute of Engineering and Technology, Vishnupuri, Nanded, Maharashtra, 431606 IndiaSearch for more papers by this author Rajeshree S. Rokade, Corresponding Author Rajeshree S. Rokade [email protected] SGGS Institute of Engineering and Technology, Vishnupuri, Nanded, Maharashtra, 431606 IndiaSearch for more papers by this authorDharmpal D. Doye, Dharmpal D. Doye SGGS Institute of Engineering and Technology, Vishnupuri, Nanded, Maharashtra, 431606 IndiaSearch for more papers by this author First published: 01 May 2015 https://doi.org/10.1049/iet-ipr.2012.0691Citations: 5AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Abstract In this study, the authors present a new system for sign language hand gesture recognition. Using video input, the system can recognise any spelled word or alphabetic sequence signed in American Sign Language. The three main steps in the recognition process include detection of the region of interest (the hands), detection of key frames and recognition of gestures from these key frames. The proposed segmentation algorithm distinguishes regions of interest from both uniform and non-uniform backgrounds with an efficiency of 95%. The proposed key frame detection algorithm achieves an efficiency of 96.50%. A rotation-invariant algorithm for feature extraction is additionally proposed, which provides an overall gesture recognition efficiency of 84.2%. 1 Introduction A number of sign languages are used in India; however, in Indian schools for the deaf, American Sign Language (ASL) is the standard for instruction. To extend the power and reach of ASL, we have aimed to develop a real-time sign recognition system that enables signers to interact with non-signers without requiring an interpreter. Such a system would prove useful for situations in which regular speech and audio are infeasible or limited, such as scuba diving, floor trading, paramilitary engagements and so on. Sign language recognition remains a comprehensive problem due to the complexity of visual analysis and rapidity of structural changes in signed gestures. Moreover, gestures are not universal; most countries have their own version or interpretation of a sign language. Therefore an efficient sign recognition system can prove very helpful toward internationalising sign-based communications. The first challenge in any type of gesture recognition process is segmentation. Many segmentation algorithms show high efficiency rates for a particular database of gestures; however, they are quickly rendered less effective as video backgrounds degrade, or as personal hand positions deviate from the centre of captured frames. Peer et al. [1], for example, proposed an RGB-based image segmentation technique that remains highly sensitive to changes in lighting conditions. Stergiopoulou and Papamarkos [2] proposed a YCbCr-based image segmentation technique that requires a high degree of background uniformity. Rokade et al. [3] applied a segmentation algorithm in which the hand position must be relatively stable, while other approaches [4] have required specialised gloves and peculiar background colours. None of these systems provides a proper means for real-time human–computer interaction. El-Sawah et al. [5] designed a prototype for three-dimensional (3D) hand tracking and dynamic gesture recognition. Other visual techniques commonly used for hand detection, such as those emphasising motion, edges and background subtraction [6, 7], fail to identify the location of the hand when background objects are moving. Hasan and Mishra [6] used the hue, saturation, lightness (HSV) colour model to extract a 'skin-like' hand region by estimating the parameter values for skin pigments; a Laplacian filter was then applied to detect hand edges. Stergiopoulou and Papamarkos [8] used the YCbCr colour model to segment the hand, and Ghobadi et al. [9] achieved segmentation by clustering image pixels. Lamberti and Camastra [10] used the hue, saturation, intensity (HSI) colour model to segment the bare hand, while Maraqa and Abu-Zaiter [7] used HSI on input from a specially coloured glove. The second challenge in gesture recognition is differentiating sign gestures from the transitional movements between these gestures. If the speed of gestures varies too much, the appearance of key frames may be too unpredictable to produce good analytical estimates. Bhuyan et al. [11] used both video object plane (VOP) and key VOP trajectories to improve dynamic recognition of hand gestures. Bhuyan et al. [12] extracted trajectory length, hand orientation, average velocity, minimum velocity and maximum velocity from gesture trajectories for gestural control of a robot. Yang et al. [13] proposed an algorithm for extraction and classification of 2D motion based on image-captured motion trajectories. Kim et al. [14] proposed an algorithm dynamic hand gesture recognition using a cellular neural network (CNN) model with 3D fields. Kim and Song [15] and Verma and Dev [16] used statistical modelling for gesture recognition. Other researchers have used the hidden Markov model (Murthy and Jadon [17]), Kalman filtering (Mo et al. [18]) and artificial neural networks (Maraqa and Abu-Zaiter [7]). Unfortunately, none of these techniques addresses the difficulties of discriminating sign language gestures from inter-gestural movements. The inherent challenges of recognising sign language gestures can be summarised as follows: Recognition algorithms must work for a large, unclosed set of words. Segmentation algorithms must differentiate hand movements from complex, non-uniform backgrounds. Segmentation algorithms must function independently of hand size and colour. Key frame detection algorithms must distinguish gesture and non-gesture frames in near real-time. Gesture recognition must be rotation invariant to support practical implementations. The remainder of this paper is organised as follows. In Section 2, the proposed hand segmentation algorithm is described. In Section 3, the proposed key frame detection algorithm is presented. The feature extraction algorithm is described in Section 4. The experimental results are outlined in Section 5 and in Section 6, concluding remarks are provided. 2 Hand segmentation algorithm All video sequences contain both gesture and non-gesture frames. Example gestures for 'd', 'o', 'w' and 'n' are shown in Fig. 1. Fig. 1Open in figure viewerPowerPoint Gestures for 'd', 'o', 'w' and 'n', respectively, from left to right A database of videos with complex and non-uniform backgrounds is used. Erosion operations are performed according to (1) (as described in [19]) to determine the boundaries of input image A. Note that image B is a function producing a flat, disk-shaped, structuring sub-image of image A, and Ø is the empty set (1)where z is the translating parameter of image B over image A, indicates the erosion and Ac is the complement of A. Erosion of image A by image B is the set of all structuring element origin locations, whereby translated B has no overlap with the background of A. In reconstructing the eroded image [20], the bright regions surrounded by dark regions become smaller and the dark regions surrounded by bright regions become larger. Furthermore, small bright spots in the images disappear, whereas small dark spots become larger. The effect is most marked in places where image intensity rapidly changes. Based on (2), we then perform dilation [19] on the reconstructed image containing structuring element B(2)where ⊕ represents the dilation and is the reflection of image B. The effect of the dilation operation is to set the foreground colour to that of the background pixels with a neighbouring foreground pixel. Such pixels lie at edges of white regions, and the foreground regions grow. The complement of the dilated image is calculated using the following equation (3)At this point, the image is converted to YCbCr. For human skin colour, the Cb range is 80–105, and the Cr range is 130–165 [3]. If the values of both Cb and Cr for any pixel are in the above ranges, the pixels are converted to 0; otherwise, the pixel is converted to 1. These conversions yield a binary image. Assume that the hand region is larger than those of other body parts present in the background of a given image. If the background contains other body parts or objects of skin colour, the binary image may reflect two or more objects. Connected components in the binary image are then labelled. A binary image with a background body part and the corresponding segmented output image are shown in Fig. 2. Fig. 2Open in figure viewerPowerPoint Binary image with a background body part and the corresponding segmented output image a Binary image with background body part b Output image 3 Key frame detection algorithm Both gesture and non-gesture frames are present in the sign language video input. Distinguishing between these types of frames is challenging because the exact number of frames required to sign a word cannot be predicted; moreover, the size and distance of the gesturing hand may greatly vary. We begin by resizing all frames to 150 × 150 pixels. Edge detection for the binary image is performed using an EX-OR operation on every consecutive pixel value according to the following equation (4)where tw is the total number of rows and tc is the total number of columns. Suppose p alphabetic signs compose a spelled word. Although only a single frame is required to recognise such signs, a number of frames are required to move from one alphabetic sign to another. The pixel-wise difference of all edge-detected frames is computed in vector q using the following equation (5)where k = 1, 2, 3, …, T − 1, and T is the total number of frames in the video. Next, we compute Z vector using the following equation (6)where i = 2, 3, …, p. Consider, Z(1) = 1 and Z(t + 1) = T − 1. Z indicates the frames of maximum hand movement between one alphabetic sign and another. At n = p and i = 2, the frame of maximum hand movement is found. At n = t − 1 and i = 3, the second frame of maximum hand movement is found. In this way, the final frame of maximum hand movement is found at n = 2 and i = p. The frame in the middle of two consecutive frames of maximum hand movement is a 'key frame'. When an alphabetic sign is made, the hand is nearly steady for a fraction of a second. Maximum hand movement only occurs between signs. Using (7), we identify key_frames as the middle frames of two consecutive frames of the maximum hand movement. These key_frames represent edge-detected frames in which a meaningful gesture or alphabetic sign is present. Gestures are recognised from key frames detected using the algorithm outlined in Fig. 3. (7)where i = 1, 2, 3, …, p. Fig. 3Open in figure viewerPowerPoint Key frame detection algorithm 4 Proposed feature extraction algorithm In feature extraction, the coordinates of all boundary points are transformed from Cartesian to polar. The angle of the hand, in particular, makes a significant difference in results. Accordingly, our algorithm maintains rotational invariance for hand angles between −90° and +90°. First, we find C (the centre of the hand region) by taking the average of all x and y coordinates of the black pixels (as seen in the segmented image of Fig. 4a). Next, we find middle point m of the wrist of the hand. Note that the wrist portion is generally connected either to the leftmost, rightmost, or lowermost portion of the image (where the remainder of the arm has been cropped). In the case of Fig. 4a, the wrist connects to the lowermost portion of the image (cropped for a vertically aligned arm). If the subject is signing at a −90° rotation with the right-hand side, the wrist will be connected to the rightmost portion of the image. We find the 'tilt angle' between the vertical line from point m and the C–m line. For Fig. 4b, the tilt angle is calculated as −10°. Fig. 4Open in figure viewerPowerPoint Image and tilt angle a Segmented image b Tilt angle The steps of the proposed feature extraction algorithm are as follows: Find the centre of hand C by averaging the edge pixels. Find wrist middle point m. Draw a vertical line through point m. Draw a line through points C and m. Compute the tilt angle (the angle between C and m). Compute theta, the sum of the actual angle, and the tilt angle of the edge point using (8). Compute the distance between the point C and the edge point. Compute the feature vector, which uses the edge point's theta and a magnitude equal to the distance. Only edge-detected key frames are used as input to the feature extraction algorithm. For the key frame shown in Fig. 5a, assume the tilt angle is 0°. For random points x1 to x10 from this key frame, find C by taking the average of all x and y coordinates of the black pixels. Based on C, create four quadrants in the image. For all points x1 to x10, convert Cartesian coordinates to polar coordinates using (8) and (9) and the tilt angle. Based on this, x will be the horizontal distance of a point from C, and y will be the vertical distance of a point from C. If the point is located in the first quadrant, then treat the x and y distances as positive; if the point is in the second quadrant, then handle the x distance as negative and the y distance as positive; if the point is in the third quadrant, then treat the x and y distances as negative; and if the point is in the fourth quadrant, then regard the x distance as positive and the y distance as negative (8) (9) Fig. 5Open in figure viewerPowerPoint Edge-detected key frames a Key frame b Random points x1 to x10 and centre C for the key frame Table 1 shows the relevant features of all edge points x1 to x10. For C of (80, 82), the 'distances' of points x1 to x10 from C are shown in the fourth and fifth columns, and the theta and distance values are shown in the sixth and seventh columns. Note that points x2 and x3 are on the same line; however, their distances from C are different. The situation is the same for points x7 and x8. Theta increases from 0 to +3.141 over the first to second quadrants, and from −3.141 to 0 over the third and fourth quadrants. The polar coordinate pattern of key frame points x1 to x10 is shown in Fig. 6a; the overall pattern of the key frame is shown in Fig. 6b. Table 1. Features of points x1 to x10 for key frames Points X Y x from C y from C Theta Distance x1 130 82 50 0 0 50.000 x2 118 36 38 46 0.880 59.665 x3 142 8 62 74 0.873 96.540 x4 80 36 0 46 1.570 46.000 x5 55 54 −25 28 2.299 37.536 x6 33 82 −47 0 3.141 47.000 x7 65 101 −15 −19 −2.239 24.207 x8 37 135 −43 −53 −2.252 68.249 x9 80 102 0 −20 −1.570 20.000 x10 99 102 19 −20 −0.811 27.586 Fig. 6Open in figure viewerPowerPoint Overall pattern of the key frame a Key frame pattern using all x1 to x10 points b Complete key frame pattern The number of edge points will vary across key frames. Rather than considering all theta points as such, only point multiples of 0.05 from the edge-detected key frame are considered as edge points. In Fig. 6a, the thetas of x2 and x3 are 0.880 and 0.873, respectively. Instead of considering both thetas, only 0.875 is considered, along with the distance of that point, which is closer to the 0.875 theta value, by ignoring the remaining point distances. In doing so, however, important information may be lost. To avoid this loss, a thick border edge of three pixels is used so that, if information of one pixel is lost, it will likely be compensated by information from a nearby pixel. More theta values may be used to achieve higher recognition rates. Note that even if theta granularity is reduced to less than 0.05, recognition does not significantly improve. We tested the number of features for each alphabetic sign using thetas between 0.01 and 0.1 at multiples of 0.01. The partial features of 'A' to 'Z' alphabetic signs are shown in Table 2. As theta increased, the number of features decreased and ultimately degraded recognition efficiency. For thetas of 0.05, 0.04 and 0.03, the number of features was nearly the same. Therefore a theta of 0.05 was established as a standard for an image size of 150 × 150 pixels. The features of key frames for our 20 training video sequences were averaged and used as training features. Table 2. Non-zero features for 'A' to 'Z' alphabetic signs Division of theta 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 A 110 110 110 91 78 69 60 55 B 109 108 109 92 74 62 62 54 C 107 105 106 94 75 60 61 53 D 109 109 109 95 80 59 62 54 E 105 107 106 90 81 69 58 53 F 111 110 112 95 79 68 54 56 G 114 119 119 92 76 67 61 59 H 111 110 111 94 75 65 59 55 I 102 106 107 91 74 64 54 53 J 104 108 106 90 78 69 61 53 K 110 117 114 87 81 72 60 57 L 110 112 114 88 72 72 64 57 M 101 104 105 91 71 64 57 52 N 105 106 106 92 69 66 58 53 O 119 108 107 94 72 60 60 53 P 119 120 121 90 74 62 59 60 Q 112 108 109 89 75 61 56 54 R 109 108 109 91 78 60 54 54 S 112 111 115 90 81 64 57 57 T 110 111 111 84 82 66 62 55 U 104 105 108 89 75 68 60 54 V 117 119 119 91 79 68 57 59 W 120 125 120 92 80 70 62 60 X 119 100 112 95 81 71 60 56 Y 119 120 117 97 82 72 58 58 Z 115 111 110 91 81 68 59 55 In this way, different alphabetic gestures can be represented by distinct patterns of features (theta and distance). Gestures are recognised using the following (10)where n = 1, 2, 3,…, N; N = 26 and m = all edge points in a frame or image. 5 Experimental results Videos containing signed words are typically recorded at 25 fps. We used a database of videos with two to four alphabetic signs, each covering 100 frames. To measure the efficiency of the proposed segmentation algorithm, 100 images with complex backgrounds were tested. All images were first manually segmented. The results of the segmentation algorithm were then compared with these manual segmentations, showing that the algorithm achieved a 95% efficiency rate. Two hundred key frames were tested and checked against our key frame gesture detection algorithm, showing that the algorithm achieved a 96.50% efficiency rate. The proposed gesture recognition system was verified using 22 ASL videos of words signed by two individuals. The average recognition rates for 20 images of each alphabetic sign were used for training. Some groups of alphabetic signs (e.g. 'C', 'E', 'S', 'N', 'V' and 'K') were quite similar. These similarities directly impeded the efficiency of the system. For other signs, the efficiency rate was very high. The recognition results for all alphabetic signs are given in Table 3 in the form of a confusion matrix. The rows represent the expressed alphabetic signs; the columns show the percentages of times these signs were recognised as given. For instance, the 'C' sign was correctly recognised 50% of the time; however, it was mistaken for an 'A' 10% of the time and for an 'E' 40% of the time. Overall, the proposed algorithm showed an 84.2% recognition rate, whereas the Bhuyan method [21] and Nobert and Peter method [22] achieved 80.7 and 77.30%, respectively. When the background was moving, the Bhuyan method completely failed at detecting the hand region, whereas the proposed algorithm readily detected it. The Nobert and Peter method showed low recognition rates for signs without raised fingers (e.g. 'C', 'E', 'O', 'S' and 'T'). A comparison of the recognition rates of the three methods is outlined in Table 4. Fig. 7 shows the recognition rate and computational times for signs 'A' to 'Z'. With the proposed system, the largest amount of time was required for segmentation; the overall time required for segmentation, key frame detection and feature extraction was lower than for the other methods. Table 3. Confusion matrix for all alphabetic signs A B C D E F G H I J K L M N O P Q R S T U V W X Y Z X 10% A 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 1 0 5 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 1 0 0 0 6 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 H 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 J 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 K 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 L 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 0 2 5 0 0 0 3 0 0 0 0 0 0 0 P 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 7 0 0 0 0 1 0 0 0 0 0 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 2 0 0 0 0 0 S 3 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 3 0 0 0 0 0 0 0 T 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 U 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 Table 4. Comparison of recognition rates Gesture Proposed method Bhuyan's method Nobert and Peter's method Recognition, % Time, sec Recognition, % Time, sec Rate Time, sec A 100 41 100 49 100 51 B 100 40 100 50 90 53 C 50 39 40 48 40 54 D 100 39 100 47 100 54 E 60 40 50 46 40 53 F 100 40 100 48 70 52 G 100 39 100 51 100 53 H 100 39 100 52 90 54 I 100 38 100 48 80 51 J 100 39 90 49 80 52 K 50 41 50 50 50 53 L 100 42 100 48 100 54 M 100 39 100 47 90 52 N 100 39 100 47 90 53 O 50 41 60 48 50 53 P 70 39 60 50 100 53 Q 60 38 60 51 80 53 R 80 41 70 50 60 52 S 30 39 40 48 20 52 T 90 40 80 50 60 53 U 70 39 60 50 90 52 V 50 42 50 45 40 53 W 100 40 100 48 100 51 X 100 40 100 46 90 51 Y 100 41 90 50 100 52 Z 100 39 100 50 100 53 Average 84.2% 39.7 80.7% 48.69 77.30% 52.64 Fig. 7Open in figure viewerPowerPoint Recognition rate and computational times a Recognition rate for 'A' to 'Z' signs b Computational time for 'A' to 'Z' signs Further experiments were performed on frequently confused words formed by similar signs: 'case', 'cave', 'nest', 'down', 'rock' and 'xavi'. These words were expressed by two signers in 20 videos each. The results are shown in Table 5. Note that the proposed algorithm maintained its average recognition rate for these frequently confused words, even as the performance of the other algorithms noticeably degraded. In addition, the computation times and recognition rates were the same for both subjects. The Bhuyan and Nobert/Peter algorithms tended to fail at recognising rotated hand gestures, even when, specifically in the Nobert and Peter method [22], initialisation or masking for segmentation was used. In contrast, the proposed algorithm effectively handled segmentation, even when the background was in motion; this capability was not possible using the other two algorithms [21, 22]. Furthermore, our key frame detection algorithm reduced computational time because it enabled feature extraction to be applied to fewer frames. All of these benefits, summarised below, highlight the novelty and value of the proposed system: The proposed system offers a single feature that directly corresponds to the location of the finger or palm. Table 5. Recognition rates for frequently confused words Test Recognised as Proposed method Bhuyan's method Nobert and Peter's method Word Person 1 Person 2 Person 1 Person 2 Person 1 Person 2 case Case 16 17 15 17 14 14 Sase 1 1 2 1 1 1 Nase 1 1 1 1 2 2 Sace 1 0 1 1 1 2 Sacs 1 0 0 0 2 0 Saes 0 1 1 0 0 1 cave Cave 17 17 16 17 15 14 Save 1 0 1 1 1 2 Nave 1 1 1 1 1 2 Eave 0 1 1 1 1 2 Cake 1 1 1 0 2 0 nest Nest 16 15 16 15 13 15 Sest 1 1 1 0 2 1 Nent 0 1 1 2 1 1 Nect 1 1 1 1 1 1 Ncst 1 1 0 1 1 1 Ncet 1 1 1 1 2 1 down Down 16 18 15 15 16 14 Nown 2 1 3 3 2 2 Sown 1 1 1 2 1 2 Dowc 1 0 1 0 1 2 rock Rock 17 16 15 15 15 14 Rnck 1 1 1 2 2 1 Rsck 1 1 1 2 1 1 Rock 1 1 2 1 1 1 Rocv 0 1 0 0 1 2 Rncv 1 0 1 0 0 1 Xavi Xavi 18 17 17 16 14 14 Xaki 2 3 3 4 6 6 percentage 83.33 83.33 78.33 79.16 72.5 70.83 average percentage 83.33% 78.74% 71.66% Its feature extraction is simple and fast. Its feature size is relatively small; therefore its automatic classification is faster and more accurate. Its features are rotation-, size- and colour-invariant. It reliably recognises frequently confused words, even across different signers. 6 Conclusions In this paper, we presented a novel system for sign language word recognition. We demonstrated that the proposed system is faster and more robust than existing methods. The major contributions of our work are 2-fold. For one, our hand segmentation method can handle time-varying illumination and achieve better performance than other segmentation methods. Second, our feature extraction algorithm is simpler yet more effective than other algorithms. The proposed work is easy to implement and can be efficiently incorporated in a gesture recognition system. Our future research will focus on extending hand detection for small sentence recognition. 7 References 1Peer P., Kovac J., and Solina F.: ' Human skin colour clustering for face detection'. EUROCON –Int. Conf. on Computer as a Tool', August 2010, pp. 144– 148 2Stergiopoulou E., and Papamarkos N.: ' A new technique for hand gesture recognition'. Proc. of IEEE Int. Conf. on Image Processing, Atlanta, October 2006, pp. 2657– 2660 3Rokade R., Doye D., and Kokare M.: ' Hand gesture recognition by thinning method'. Proc. of IEEE Int. Conf. on Digital Image Processing, March 2009, pp. 284– 287 4Xu D.: ' A neural network approach for hand gesture recognition in virtual reality driving training system of SPG'. 18th Int. Conf. on Pattern Recognition (ICPR'06), 2006, pp. 519– 522 5El-Sawah A., Georganas N.D., and Petriu E.M.: 'A prototype for 3-D hand tracking and posture estimation', IEEE Trans. Instrum. Meas., 2008, 57, (8), pp. 1627– 1636 (doi: http://doi.org/10.1109/TIM.2008.925725) 6Hasan M.M., and Mishra P.K.: 'HSV brightness factor matching for gesture recognition system', Int. J. Image Process., 2010, 4, (5), pp. 456– 467 7Maraqa M., and Abu-Zaiter R.: ' Recognition of Arabic Sign Language (ArSL) using recurrent neural networks'. IEEE First Int. Conf. on the Applications of Digital Information and Web Technologies, ICADIWT 2008, pp. 478– 48 8Stergiopoulou E., and Papamarkos N.: 'Hand gesture recognition using a neural network shape fitting technique', Eng. Appl. Artif. Intell., 2009, 22, (8), pp. 1141– 1158 (doi: http://doi.org/10.1016/j.engappai.2009.03.008) 9Ghobadi S.E., Loepprich O.E., Ahmadov F., Bernshausen J., Hartmann K., and Loffeld O.: 'Real time hand based robot control using multimodal images', Int. J. Comput. Sci., 2008, 35, (4), pp. 110– 121 10Lamberti L., and Camastra F.: ' Real-time hand gesture recognition using a colour glove' (Springer-Verlag, Berlin, Heidelberg, ICIAP, 2011), pp. 365– 373 11Bhuyan M.K., Ghosh D., and Bora P.K.: ' Finite state representation of hand gesture using key video object plane'. Proc. of IEEE Region 10 Asia-Pacific Conf., 2004, pp. 579– 582 12Bhuyan M.K., Ghosh D., and Bora P.K.: ' Designing of human computer interactive platform for robotic applications'. TENCON IEEE Region, November 2005, pp. 1– 5 13Yang M.-H., Ahuja N., and Tabb M.: 'Extraction of 2D motion trajectories and its application to hand gesture recognition', IEEE Trans. Pattern Anal. Mach. Intell., 2002, 24, (8), pp. 1061– 1074 (doi: http://doi.org/10.1109/TPAMI.2002.1023803) 14Kim H.-J., Lee J.S., and Park J.-H.: ' Dynamic hand gesture recognition using a CNN model with 3d receptive fields'. IEEE Int. Conf. on Neural Networks and Signal Processing, Zhenjiang, China, 8–10 June 2008, pp. 14– 19 15Kim J., and Song M.: ' Three dimensional gesture recognition using PCA of stereo images and modified matching algorithm'. IEEE Fifth Int. Conf. on Fuzzy Systems and Knowledge Discovery, FSKD, Jinan Shandong, 2008, pp. 116– 120 16Verma R., and Dev A.: ' Vision based hand gesture recognition using finite state machines and fuzzy logic'. IEEE Int. Conf. on Ultra Modern Telecommunications and Workshops, ICUMT, Petersburg, 2009, pp. 1– 6 17Murthy G.R.S., and Jadon R.S.: 'A review of vision based hand gestures recognition', Int. J. Inf. Technol. Know. Manage., 2009, 2, (2), pp. 405– 410 18Mo S., Cheng S., and Xing X.: ' Hand gesture segmentation based on improved Kalman filter and TSL skin colour model'. Int. Conf. on Multimedia Technology, Hangzhou, 2011, pp. 111– 116 19Gonzalez R.C., Woods R.E., and Eddins S.L.: ' Digital image processing using MATLAB' (Pearson Education, 2004), pp. 337– 345 20Guo H.: ' Image restoration with morphological erosion and exemplar-based texture synthesis'. Sixth Int. Conf. on Wireless Communications Networking and Mobile Computing (WiCOM), 2010, pp. 1– 4 21Bhuyan M.K.: 'FSM-based recognition of dynamic hand gestures via gesture summarization using key video object planes', World Acad. Sci. Eng. Technol., 2012, 68, pp. 724– 735 22Norbert B., and Szolgay P.: ' Vision based human-machine interface via hand gestures', ECCTD 2007, Seville, Spain, August 2007, pp. 496– 499 Citing Literature Volume9, Issue5May 2015Pages 381-388 FiguresReferencesRelatedInformation

Referência(s)
Altmetric
PlumX