Adaptive scene‐text binarisation on images captured by smartphones
2016; Institution of Engineering and Technology; Volume: 10; Issue: 7 Linguagem: Inglês
10.1049/iet-ipr.2015.0695
ISSN1751-9667
AutoresAmira Belhedi, Beatriz Marcotegui,
Tópico(s)Image Retrieval and Classification Techniques
ResumoIET Image ProcessingVolume 10, Issue 7 p. 515-523 Research ArticlesFree Access Adaptive scene-text binarisation on images captured by smartphones Amira Belhedi, Corresponding Author Amira Belhedi belhedi.amira@yahoo.fr MINES ParisTech, PSL Research University, CMM – Centre for Mathematical Morphology, 35 rue Saint Honoré – Fontainebleau, FranceSearch for more papers by this authorBeatriz Marcotegui, Beatriz Marcotegui MINES ParisTech, PSL Research University, CMM – Centre for Mathematical Morphology, 35 rue Saint Honoré – Fontainebleau, FranceSearch for more papers by this author Amira Belhedi, Corresponding Author Amira Belhedi belhedi.amira@yahoo.fr MINES ParisTech, PSL Research University, CMM – Centre for Mathematical Morphology, 35 rue Saint Honoré – Fontainebleau, FranceSearch for more papers by this authorBeatriz Marcotegui, Beatriz Marcotegui MINES ParisTech, PSL Research University, CMM – Centre for Mathematical Morphology, 35 rue Saint Honoré – Fontainebleau, FranceSearch for more papers by this author First published: 01 July 2016 https://doi.org/10.1049/iet-ipr.2015.0695Citations: 7AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract The authors address, in this study, a new adaptive binarisation method on images captured by smartphones. This work is part of an application for visually impaired people assistance, which aims at making text information accessible to people who cannot read it. The main advantage of the proposed method is that the windows underlying the local thresholding process are automatically adapted to the image content. This avoids the problematic parameter setting of local thresholding approaches, difficult to adapt to a heterogeneous database. The adaptive windows are extracted based on ultimate opening (a morphological operator) and then used as thresholding windows to perform a local Otsu's algorithm. The authors' method is evaluated and compared with the Niblack, Sauvola, Wolf, toggle mapping morphological segmentation (TMMS) and maximally stable extremal regions methods on a new challenging database introduced by them. Their database is acquired by visually impaired people in real conditions. It contains 4000 annotated characters (available online for research purposes). Experiments show that the proposed method outperforms classical binarisation methods for degraded images such as low-contrasted or blurred images, very common in their application. 1 Introduction Smartphones are opening new possibilities for enhancing user's view of the world, providing applications such as geo-localisation, augmented reality and so on. This has become possible thanks to the high sensor resolution and computational power of these devices. In the framework of LINX project, we develop a smartphone application allowing visually impaired people to get access to textual information in their every-day life. A critical step for this project is to identify regions of interest in the images. One way to do this is to produce a binary image. However, image binarisation in this context is hard: in addition to the absence of prior information on image content, acquired images can be of low quality. In fact, image acquisition conditions are not under control: taken by visually impaired people, several issues can arise such as blur, noise, bad lighting conditions and so on. Several works have been devoted to finding a relevant and efficient binarisation method. Some of them perform globally applying the same threshold to the whole image. One of the best known in the literature is Otsu's algorithm [1]. Despite its performance on clean documents, it is not well suited to uneven illumination and to the presence of random noise. Other works perform locally, adapting the threshold to each image region. A popular method in the literature is proposed by Niblack [2]. It is based on calculating a pixel-wise threshold by gliding a rectangular window over the grey-level image. The threshold T for the centre pixel of the window is defined as (1)with m and s, respectively, the mean and the variance of the grey value in the window and k a negative constant. The main limitation of this method is the noise created in regions that do not contain any text, since a threshold is also applied in these regions. Sauvola et al. [3] addresses this problem by normalising the standard deviation by its dynamic range. This method outperforms the latter one, except in case of low-contrast text regions. A solution is proposed by Wolf et al. [4] to overcome this drawback. The threshold formula is changed in order to normalise the contrast and the mean grey level of the image. More recent local methods have been proposed (e.g. [5, 6]). An interesting method that solves Sauvola's algorithm limitations is proposed by Lazzara and Géraud [7]. It is a multi-scale version of Sauvola's method. According to the authors, this method outperforms the previous ones. However, it is not robust to very blurred text regions: they can be entirely removed or only partially detected with this method (i.e. demonstrated in Section 4). Local approaches give better results compared to global ones, but often require more parameters to be tuned. The most difficult part is to find the optimal parameters' values for the set of images to be processed. However, adjusting those parameters with no prior knowledge of image content is difficult. In particular, adjusting the window size parameter with no information about the text size is not possible. That is the main reason why we propose a new adaptive scene-text binarisation method that does not require a window size adjustment, since adaptive local windows (regions of interest) are automatically extracted from the image. Other reasons motivate us to propose this method: (i) the amount of false alarms created in regions that do not contain any text and (ii) the missing detection problem produced by existing methods in blurred and low-contrasted text regions. Our approach is mainly based on a simple local Otsu's algorithm [1] performed on regions of interest automatically extracted from the image. Regions of interest are detected with ultimate opening (UO) [8] weighted by the area stability of regions. The UO is a residual morphological operator that detects regions with the highest contrast and has been used successfully for text detection [9]. In addition to grayscale information used by UO, the proposed method uses area stability information, derived from maximally stable extremal regions (MSER) method, to favour the detection of regions with a more stable area. The MSER [10-12] method is commonly used for scene text segmentation purposes [13-18]. It detects regions that are stable over a range of thresholds. It performs well most of the time, but has problems on blurry images and characters with very low contrast [13]. This constraint (area stability weight) is introduced in order to avoid the detection of regions with great changes in areas that are probably related to the merging of different characters or to the presence of noise. Our binarisation technique is the first step in the LINX processing chain. Further steps consist in characterising extracted regions and classifying them in text or non-text regions. Therefore, our objective is to maximise the recall (the number of detected characters). 2 Linx dataset and ground truth generation Some public text databases with ground truth are available. However, most of them provide ground truth at the level of words (e.g. ICDAR'15 database [19]): they are suitable to evaluate text localisation, but not for text binarisation. Other databases with character-level ground truth exist, such as DIBCO [20], EPITA [21], IIIT 5K-word [22] and so on, but they only contain text documents. In LINX project, we are not limited to text documents. Therefore, we had to produce our own annotated dataset (from LINX database). 2.1 LINX dataset The dataset used is a subset of LINX database. It contains 16 images acquired with smartphone cameras by visually impaired people. Some images are shown in Fig. 1. In spite of the reduced number of images, an important number of words is present: about 1200 words with 4000 characters. It varies from text documents to products exposed in a supermarket, very blurred text, noisy regions, shadow regions, high saturated regions, small and large texts, light and dark texts and so on. Fig. 1Open in figure viewerPowerPoint Some images from LINX dataset 2.2 Ground truth generation To validate our method (described in the following section) and to compare it to existing methods, it is crucial to generate an accurate ground truth. This is a challenging problem, especially for the blurred regions of the dataset. For that, we used a semi-automatic approach. For each image of the dataset, a manually selected global threshold that maximises the number of detected characters (i.e. maximising the recall) is first applied and a binary image is generated. After that, all splitted or merged characters are, respectively, manually connected or disconnected, and all false detected regions are manually deleted. We note that, for blurred regions, the global thresholding does not work well. To solve that, different local thresholding methods [2-4] are performed on these regions and the best binarisation result, for each region, is selected. However, even after applying several local thresholding methods, some words are not correctly segmented into characters by any of them. We then chose to keep them as such and to add a flag (GTLevel) in the ground truth file indicating the ground truth level (if GTLevel = 1 it is a character-level, otherwise it is a word-level). Note that the overall procedure is applied on the image and on its inverse in order to detect both light and dark characters. Then results are merged in the same binary image and another flag (textPolarity) is added in the generated ground truth (if textPolarity = 1 it is light text, otherwise, it is a dark one). Finally, rectangular bounding boxes are computed from each connected component of the obtained binary images, since we have chosen a rectangle-based validation (described in Section 4). Even if a pixel-wise validation seems a better approach, we gave up the idea as it requires an accurate pixel-level ground truth that is extremely difficult to obtain in practice and may favour the method used for its generation. The generated ground truth contains 3913 rectangular bounding boxes: 3253 characters and 660 words, 2216 bounding boxes with dark text and 1697 with light text. Some ground truth rectangles, cropped from our dataset, are shown in Fig. 2. LINX dataset and ground truth are available online [http://www.cmm.mines-paristech.fr/Projects/LINX]. Fig. 2Open in figure viewerPowerPoint Ground truth rectangles cropped from our dataset. It is a very challenging dataset that varies from very large and high-contrast text regions to very small, noisy and blurred text regions. Its annotation was a hard task, especially for low-contrasted or blurred regions. As explained in the text, all local binarisation methods failed to segment some words into single letters. They were kept as such in the generated ground truth. Some examples are shown here 3 Adaptive scene-text binarisation We propose a new adaptive scene-text binarisation that does not require manual adjustment of the window size to the image content, and reduces the number of false alarms. The proposed method performs a local Otsu's algorithm on adaptive windows that are automatically extracted from the image. The adaptive windows extraction is mainly based on mathematical morphology. In this section, we first describe the dataset pre-processing, then we detail the adaptive windows detection and finally we present the binarisation method based on adaptive windows. 3.1 Pre-processing The dataset used in this study is a set of smartphone images, usually colour images. Given in the one hand that luminance contains essential information for binarisation process, and on the other hand that we need a computationally efficient method, we start by converting colour to greyscale images. For this purpose, we chose to use the following conversion formula: Lum a = 0.2126 × R + 0.7152 × G + 0.0722 × B. A pre-processing filtering step is also required to reduce image noise. We chose to perform a bilateral filter [23] of size 3 and σgrey empirically fixed to 20. 3.2 Adaptive windows detection In general, the text is contrasted compared to the background (clear text on dark background or dark text on clear background), otherwise it cannot be read. The proposed method is based on this feature: it detects the high-contrast regions in the image and obviously, in its inverse, based on the UO operator introduced by Beucher [8]. The UO is a morphological operator based on numerical residues that detects the highest contrasted connected components in the image. The operator successively applies a series of openings γi (opening of size i) with structuring elements of increasing sizes, i. Then, the residues between successive openings are computed: ri = γi − γi+1 and the maximum residue is kept for each pixel. Thus, this operator has two significant outputs for each pixel x: R(I) which gives the value of the maximal residue (contrast information), called the transformation in the literature, and q(I) which indicates the size of the opening leading to this residue (the structure size that contains the considered pixel) called associated function (2)The UO has been extended by Retornaz and Marcotegui [24] to use an attribute opening [25] such as width, height and so on. The new definition of the transformation R and the associated function q are obtained by replacing, in (2), γi by the considered attribute opening. In this case, the associated function q indicates information linked with the considered attribute. An example is shown in Fig. 3 to illustrate the intermediate steps of UO calculation. Fig. 3Open in figure viewerPowerPoint UO computation step by step. UO attribute used in this example is the height of the connected component a Input image I b–e Results of height openings with sizes 1, 2, 3 and 4 f–h Computed residues r1, r2 and r3. Opening of size 1 (γ1) does not change the image, γ2 removes one maxima and generates the first residue r1. Opening of size 3 (γ3) removes larger regions and generates the second residue r2. At the end, γ4 removes all regions and generates the residue r3. The last step of UO computation consists in generating the two resulting images i Transformation R(I) j Associated function q(I). For each pixel, the maximum residue ri is selected and recoded in R(I) and the size of the opening leading to this residue is recorded in q(I). For example, the maximum residue of the pixel located in the third line of the last column (=4) was selected from r1 and the opening size leading to r1 is equal to 2 For our application, we chose to use the extension of the UO with the height attribute, since it is the most suited one for text detection. We set the largest opening considered equal to (1/3) of the image height, since characters are rarely larger. This choice is made in order to avoid artefacts that occur with a larger opening size. Very small regions (area < 15) are also not considered in order to avoid useless process. Note that the associated function q is not used in this study, we only use the transformation function R. We chose to use this morphological operator for many reasons. First, it has the capacity of highlighting regions with the highest contrast which is suited for text detection. Then, it has the advantage to be a non-parametric multi-scale operator and does not require any prior knowledge of image content. Finally, it can be performed in real time using the fast implementation based on image maxtree representation [26]. Despite its performance, this operator has the limitation to produce a lot of connected components, most of them do not correspond to characters. To reduce the number of false positives appearing with the UO, we propose to introduce a weighted function αi within the residue computation (3), αi being the region area stability inspired from MSER method [10] (used successfully as weighted function in [27]). The weighed residue is defined as (3)with αi the weighted function computed, for each pixel x, as follows (4)with areai is the structure area (containing x) obtained from opening γi. Using this weight function, the residue of connected component with low area stability is artificially reduced, compared to that with high area stability. The computation of the UO with area stability weight is illustrated on simulated data in Fig. 4. Fig. 4Open in figure viewerPowerPoint UO weighted by area stability step-by-step computation. Same input image I of Fig. 3 is used here. First, the weighted functions (area stability of regions) α1, α2 and α3 are computed (images a–c). Then, they are used with their corresponding opening (Figs. 3b–e) to generate weighted residues , and (images d–f). Then, the outputs of the UO weighted by area stability function Rα(I) and qα(I) are deduced (images g and h). Comparing Figs. 3i and 4g, we can observe that Rα contains less noise than R a b c d e f g Rα(I) h qα(I) An example on real images is shown in Fig. 5c: R produces many spurious regions in the background (in red) that do not correspond to text regions. A thresholding of R can help removing these regions, but may also remove low-contrast text regions. The use of area stability weight avoids the detection of regions with important changes in area that are probably related to the presence of noise, artefacts and contrast variation in the background (Fig. 5c) or an unintended connection between components (see, e.g. characters 'INE' in Fig. 5g). Another example of the obtained transformation Rα is shown in Fig. 5d. The number of false alarms is considerably reduced. We also observe that characters are better separated with Rα. An example is shown in the same figure. The transformation R connects three characters ('INE') as shown in Fig. 5g, whereas Rα disconnects them (Fig. 5h). Note that the area stability weight function is easily introduced with the UO implementation based on maxtree. In the tree, we only need to add to each node n, the area of its corresponding region and then, to weight in each UO iteration the computed residue with the node area stability value (see [26] for more details about tree-based UO computation). Fig. 5Open in figure viewerPowerPoint Comparison between the transformation obtained (c), (g) with UO and (d), (h) with UO weighted by area stability function. The color map of these images is modied for a better illustration (random gray levels are used to better illustrate the dierent residues values). The number of false alarms is reduced with area stability weight and characters are better segmented a Input image b Crop from I c R d Ra e Input image f Crop from I g R h Ra The transformation Rα (area stability weighted) is thresholded with a global low threshold (set to 1) to remove regions with very low residues (low contrast or low area stability) leading to the image Bα defined as follows The binary image Bα is used to extract the adaptive windows used in the next step of binarisation. These adaptive windows correspond to the rectangular bounding boxes of Bα connected components. An example is shown in Fig. 6c. Fig. 6Open in figure viewerPowerPoint An example of a Rα > 1 b Bα c Adaptive window: rectangular bounding boxes (in grey) (results obtained on the image shown in Fig. 5b) The substantial improvement brought by the adaptive windows based on the UO is shown in Section 4. The fixed size windows used by the Niblack, Sauvola and Wolf methods are replaced by adaptive windows and better results are obtained. 3.3 Binarisation A binarisation process is applied to the original image on adaptive windows defined in previous section. In the simplest case, an adaptive window contains a character in its background. A simple Otsu thresholding process performs well in these cases. More complex windows may correspond to a set of characters gathered in their containing support. This may happen due to the fact that we have chosen a very low threshold (Rα > 1) in order to ensure the detection of low-contrast characters. An example of this situation is shown in Fig. 7. Applying an Otsu threshold on such adaptive windows does not detect the characters, but the region containing them (see, e.g. Fig. 7d). A multi-level Otsu algorithm, with three classes (one for the characters, a second one for the merged region and the third one for the background), is required in this case. To detect this merging situation we analyse Rα in each connected component (CC) of Bα, that we note . If the CC contains very different Rα values and the most common value (the mode) is significantly lower than its maximum value, we assume that the CC has merged significant regions. Thus, a merging situation is declared if the following two conditions are satisfied: mode ≤ (max/2), with mode the value that appears most often in and max the maximum value of . If this condition holds, the CC contains regions with a contrast twice higher than the contrast of the largest part of it (the mode). This is the first hint indicating that significant regions are missed. Fig. 7Open in figure viewerPowerPoint Example of binarisation merging problem illustrated on a Crop from the original image (Fig. 5a). This crop corresponds to an adaptive window defined by Bα of figure (c). The word ('RENAULT') is contained in a low-contrast region b c Bα d Otsu algorithm applied to (a), all characters are missing. Analysing , the merging conditions are satisfied (mode = 8 ≤ (max/2) = (22/2) and modePercentage = 0.8 > 0.7) e Multi-level Otsu with three classes is performed, all characters are correctly segmented modePercentage > 0.7, with modePercentage the percentage of pixels with Rα value equal to mode. This condition confirms that the low-contrasted region of the CC covers a significant part of it. This is in general the case when low contrasted region surround significant area. An example of this process is illustrated in Fig. 7. The input is a crop from the original image corresponding to an adaptive window Fig. 7a. The corresponding is shown in Fig. 7b. If Otsu's algorithm is performed in this window, these characters will not be detected, as shown in Fig. 7d. Analysing , a merging situation is detected. The maximum value in is equal to 22 (max = 22) and characters are surrounded by a low-contrast region, its value in is equal to 8 (mode = 8). Thus, both merging conditions are satisfied. Then, the multi-level Otsu with three classes is performed in this adaptive window of the input image. The obtained result is shown in Fig. 7e. The word 'RENAULT' is well segmented. We observe that character edges after performing the binarisation step are cleaner than Bα (Fig. 6b) and merged characters of Bα are correctly segmented. 3.4 Post-processing A post-processing is applied to the image obtained from the previous step in order to remove very small regions. We use a small area opening of size 15 (the image resolution is about 3200 × 2400). We show in Fig. 8 an example of the final result. Comparing this final result with Bα image (Fig. 6b), we can state that the binarisation step improves characters detection. Fig. 8Open in figure viewerPowerPoint Binary image obtained with our method (input image shown in Fig. 5a) 4 Validation In this section, we validate the performance of the proposed method and compare it to the best known methods in the literature. The dataset used and its ground truth generation are detailed in Section 2. In the following, the experimental protocol is first presented and the obtained results are then discussed. 4.1 Evaluation protocol The evaluation is performed comparing a list G of ground truth bounding boxes Gi,i=1..|G| with a list D of binarised objects bounding boxes Dj,j=1..|D| (with |G| and |D| are the number of bounding boxes, respectively, in G and D). The rectangle-based evaluation method presented by Wolf and Jolion [28] is used. This choice is made for many reasons. First, it supports one-to-one, as well as one-to-many (splits) and many-to-one matches (merges). Then, the precision and recall measures are computed at the object level by imposing quality constraints (recall constraint tr and precision constraint tp) to the matched rectangles. This gives a better estimation of false alarms and correct detections than the direct accumulation of rectangle overlaps. This is briefly explained in the following. The matching between G and D rectangles is determined according to conditions of different matching types (one-to-one, split and merge) based on tr and tp. Then for each Gi, the recall value ri is defined as follows and for each Dj, the precision value pj is defined as follows with fsc(k) is a parameter that controls the punishment amount. In our experiments, merges and splits are severely punished by setting fsc(k) = (1/1 + log (k)), which corresponds to the fragmentation metric introduced in [29]. The recall constraint tr is set to 0.7 and the precision constraint tp to 0.4 (value recommended by Wolf). Obviously, splits in case of word-level annotation are not punished, i.e. if Gi matches against several detected rectangles and its ground truth flag GTlevel (introduced in Section 2.2) is not equal to 1, then its recall ri is set to 1. Another A flag which is saved in the ground truth file is used for recall and precision computation: the textPolarity flag. As mentioned above, the same image can contain dark and clear texts. Then we perform each tested method twice, on the image and on its inverse. Then, for each Gi, we compute the recall value from the suitable resulting image according to the textPolarity flag value and we select its(their) matching rectangle(s) from D, for precision computation. The false alarms that correspond to Dj that do not match against any Gi must be taken into account in precision computation. They can be selected from one of the two resulting images. In our experiments, we select them from the dominant polarity (based on the textPolarity flag). This choice seems appropriate and obviously does not influence the recall measure. 4.2 Results For all methods presented in this section, the images are pre- and post-processed as described in Sections 3.1 and 3.4. We first verify that the use of adaptive windows based on the UO, extracted from Bα, enhances substantially the results of local binarisation methods. For that, three methods from the literature that use fixed window sizes (Niblack [2], Sauvola et al. [3] and Wolf et al. [4]) are tested on our dataset. Note that these methods are better adapted for document binarisation purposes, even if they are currently cited in scene-text localisation approaches too. We use the implementation provided by Wolf [30]. For each of them, the optimal k value recommended by its author is used i.e. −0.2, 0.34 and 0.5 for, respectively, Niblack, Sauvola and Wolf methods. Concerning the window size value, we set it to 40 × 40 (default values of the distributed code [30]). Applying Niblack and Wolf methods on adaptive windows, Bα, instead of fixed ones, enhances substantially the results and mainly the mean precision (Table 1). It is increased from 8.4 to 50.2% in the case of Niblack's algorithm, and from 41.9 to 52.4% in case of Wolf's algorithm. The mean recall is also improved for both of them and exceeds 90% in case of Wolf's algorithm. However, we do not observe an improvement of Sauvola's algorithm with adaptive windows. This is probably due to its main limitation: the miss detection of low-contrast regions. Table 1. Mean recall (R) precision (P) and F-measure (F) comparison of fixed and adaptive windows binarisation methods Fixed windows size Adaptive windows R, % P, % F, % R, % P, % F, % Niblack 86.1 8.4 15.3 89.8 50.2 64.4 Sauvola 65.4 36.9 47.2 62.1 37.3 46.7 Wolf 67.7 41.9 51.7 93.0 52.4 67.0 We compare now our approach with three methods that do not require a window size adjustment to the image contents. The multi-scale version of Sauvola's method presented by Lazzara and Géraud [7], the morphological algorithm based on the toggle mapping operator TMMS [31] (ranked 2nd out of 43 in DIBCO 2009 challenge [32]) and the MSER method [10] (the most cited for scene-text localisation [33]). Here, for TMMS and multi-scale version of Sauvola's methods, the implementation provided by the respective authors with their recommended parameters is used. For multi-scale Sauvola's algorithm, k = 0.34, the window size at scale 1 is w = 101 and the first subsampling ratio s = 3. For TMMS algorithm, the hysteresis thresholds cminL = 20, cminH = 45 and the thickness parameter p = 50. For MSER, the OpenCV implementation [34] is used with the parameters leading to the best results on our dataset: the Δ value = 0, the minimum area = 15, the maximum variation between areas = 0.25 and the minimum MSER diversity = 0.2. The obtained results are presented in Table 2. A recall and pre
Referência(s)