Script-Independent Text Segmentation from Document Images

Script-Independent Text Segmentation from Document Images

Parul Sahare, Jitendra V. Tembhurne, Mayur R. Parate, Tausif Diwan, Sanjay B. Dhok
Copyright: © 2022 |Pages: 21
DOI: 10.4018/IJACI.313967
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Document image analysis finds broad application in the digital world for the purpose of information retrieval. This includes optical character recognition (OCR), indexing of digital libraries, web image processing, etc. One of the important steps in this field is text segmentation. This segmentation becomes complicated for the documents containing text of uneven spacing and characters of varying font sizes. In this paper, script-independent text-line segmentation and word segmentation algorithms are presented. Fast marching method is used for text-line segmentation, whereas wavelet transform with connected components (CCs) labeling is used for word segmentation. Fast marching method is used as a region growing process that detects potential text-lines. For word segmentation, energy map is calculated using wavelet transform to create text-blocks. Both the proposed algorithms are evaluated on different databases containing documents of different scripts, where highest text-line and word segmentation accuracies of 98.9% and 99.1%, respectively, are obtained.
Article Preview
Top

1. Introduction

Document image analysis deals in two stages, i.e. text segmentation and text recognition. Thus, the main aim of the overall process is to localize text regions that will further help for recognition (Sahare & Dhok, 2018a, 2019a). Printed document images (e.g. bank drafts, stamp papers, etc.) contain text with uniform inter and intra spacing and their characters have definite height and width. Consequently, various parameters like centroids and characters spacing help in text segmentation (Louloudis et al., 2008, 2009). On the other side, handwritten documents or in freestyle handwritten documents (e.g. historical documents, examination answer sheets, etc.), non-uniform skews, characters and lines spacing are present. As a result, the complexity of text segmentation gets increased. This condition becomes worse when these two texts (printed and handwritten) are intermixed (Sahare & Dhok, 2017b), (Sahare & Dhok, 2018b, 2019b). It is observed that approaches that use Connected Components (CCs) (Louloudis et al., 2008, 2009) and projection profiles (Manmatha & Rothfeder, 2005) for text-line segmentation are script-dependent. These approaches find it hard to handle handwritten documents and skews. In addition, very few literatures like (Y. Li et al., 2008), (Soora & Deshpande, 2018) have addressed noise and skew related problems. To tackle these issues, a text-line segmentation algorithm is proposed using fast marching method, which does not depends upon the structural properties of text. There are number of papers available on the topic of text-line segmentation, however, none of the research papers done this particular work using fast marching method. To the best of our knowledge, this is one of the initial works carried out detail study and implementation of fast marching method for text-line segmentation from document images. The motivation for using fast marching method is that document image generally consists of two regions, namely text and background. For text-line segmentation, these regions can be considered as wave fronts, which move in outward direction. Each particle of these wave fronts is like a black pixel of the text region, which is considered as a node of a graph. These black pixels move towards other nodes using the cost function described by fast marching method. Therefore, fast marching method segments text-lines more precisely in the form of growing regions within the document images. This algorithm extracts text-lines from the documents without prior knowledge of the script geometry. This is an advantage of proposed text-line segmentation algorithm over other script-dependent algorithms. Further, word segmentation algorithm is designed using wavelet transform and CCs analysis (Sahare & Dhok, 2018b). This algorithm is employed on each segmented text-line. Using wavelet transform, energy map is calculated and then Gaussian filtering is applied. This is followed by CCs analysis to segment words in the form of text-blocks.

1.1. Contribution

In this paper, following contributions are made:

  • (i)

    With the prior understanding of text-line being horizontal in nature, guiding map is formed through state of the art Gaussian low pass filter in asymmetrical form, which helps to determine the text-line boundary.

  • (ii)

    Unlike the state of the art approaches, here, closed curve in the form of two-dimensional interface propagation is estimated, which in the form of growing text-line region is represented.

  • (iii)

    Proposed text-line segmentation approach enjoys the capability to process noisy, complex layout and oriented documents and also became script-independent.

  • (iv)

    To capture information that differentiates between text and background regions, energy map is utilized through wavelet transform during word segmentation.

  • (v)

    Focus of the attention regions (words within the text-line) are arisen to form text-blocks representation by convolving with Gaussian low-pass filter with precise standard deviation values.

  • (vi)

    Word segmentation framework can be applied to noisy documents and directly on the document image as well.

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024)
Volume 14: 1 Issue (2023)
Volume 13: 6 Issues (2022): 1 Released, 5 Forthcoming
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing