JBIG2 Supported by OCR
Radim Hatlapatka
Masaryk University, Faculty of Informatics,
Botanická 68a, 602 00 Brno, Czech Republic
208155@mail.muni.cz
Abstract. Digital Mathematical libraries contain a large volume of PDF
documents containing scanned text. In this paper we describe how this
documents can be compressed and thus provide them more effectively to
the users. We introduce a JBIG2 standard for compressing bitonal images
such as scanned text and we discuss issues if OCR is used for improving
the compression ratio of jbig2enc open-source encoder. For this purpose
we have designed API for using OCR in jbig2enc which we describe in this
paper together with already achieved results.
Keywords: jbig2enc, JBIG2, PDF size optimization, compression, DML,
OCR, pdfJbIm, DML-CZ, EuDML
Smaller is faster and safer too. (Stephen Adams, Google)
1 Motivation
Digital mathematical libraries (DMLs) contain a large volume of documents with
scanned text (more than 80% of EuDML is scanned), which are mostly created
by scanning older papers which were written and published earlier and their
digital versions are already lost. Documents created this way are referred to as
retro-born digital documents.
Research in math is influenced greatly by older articles and papers. When
something new in math is discovered or researched, it is often based on older
papers and discoveries. To make research more comfortable users require easy
access to these kinds of documents. Thus DMLs need to provide documents that
are easy to both find and access.
One user demand is for quick and easy access to documents. This means not
only the ability to find where a document can be downloaded from, but also the
ability to access it from the user’s computer. The time to access a document is
highly dependent on its size, which can be reduced using a good compression
method. Documents in DMLs are mostly stored as PDF documents, which is
probably the most widely-used document format on the Internet. In PDF, images
are stored using various compression methods. One supported method is the
JBIG2 compression method, which offers great compression ratios [1].
The bachelor thesis, JBIG2 Compression by Radim Hatlapatka [2], introduced
a method for compressing PDF documents using the JBIG2 standard
2 Radim Hatlapatka
with jbig2enc open-source encoder. This tool has been re-named pdfJbIm [3].
The improvement to jbig2enc introduced in this bachelor thesis improves the
compression ratio of jbig2enc on average by a further 10%.
The results of the comparison of this tool with other tools have been published
in [6]. In DML 2010, an article was published about the newer version of pdfJbIm
and the results achieved with data stored in DML-CZ [7]. We are developing
additional improvements of the jbig2enc encoder which use the results of an
OCR engine to decide if two symbols are equivalent and thus should be stored
only once in the dictionary. If they are found equivalent, the OCR engine can
also help to decide which of them should be stored in the dictionary and be used
for reconstructing an image when it is decompressed, thereby improving the
quality of the image.
In this paper, we introduce the JBIG2 standard (see Section 2) and discuss
issues that need to be addressed when OCR is used for compressing images to
achieve the best possible results. We focus on issues connected to documents with
math (see Section 3) and we describe a jbig2enc interface designed for using an
OCR engine (see Section 4). Finally, we show some experimental results achieved
when using Tesseract as the OCR engine (see Section 5).
2 Introduction to JBIG2
JBIG2 is a standard for compression of bitonal images developed by the Joint
Bi-level Image Experts Group. These are images that consist of two colours only
(usually black and white). The main area of such documents is a scanned text.
JBIG2 was published in 2000 as an international standard ITU T.88 [9] and one
year later as ISO/IEC 14492 [1]. It typically generates files that are three to five
times smaller than Fax Group 4 and two to four times smaller than JBIG1, which
was the previous standard released by the Joint Bi-level Image Experts Group [5].
JBIG2 also supports “perceptually lossless” coding. This is a special kind of
lossy compression which causes no visually noticeable loss. Scanned text often
contains flyspecks (tiny pieces of dirt) and perceptually lossless coding can help
to get rid of the flyspecks and thus increase the quality of the output image.
The content of each page may be segmented into several regions with specific
types of data. Most often it is segmented to a text region for text data, a halftone
region for halftone images1
and a generic region for the rest. In some situations, it
is better to use the generic region for a specific type of data rather than a specific
region such as the halftone region; in other situations, the converse is true.
The JBIG2 encoder segments text regions into components that most often
correspond to symbols. For each set of equivalent symbols, one representant is
chosen. The representant contains stored bitmap data and each occurrence of
the symbol then directs to that representant with information about its position.
These symbols must be encoded (both in the dictionary containing representants
and also their occurrences). JBIG2 uses modified versions of Arithmetic and
1
More about halftone can be found at http://en.wikipedia.org/wiki/Halftone
JBIG2 Supported by OCR 3
Huffman coding. Huffman coding is used mostly by faxes because of its lower
computation demands, even though Arithmetic coding produces slightly better
results.
JBIG2 supports a multi-page compression used for symbol coding (the coding
of text regions). Any symbol that is used on more than one page is stored in
a global dictionary. Such symbols need to be stored only once; space needed to
store documents is thereby further reduced.
3 Specific Aspects of Using OCR in JBIG2 and Jbig2enc
OCR and image compression according to JBIG2 standard are very similar. In
both cases, it is necessary to segment an image into components that are further
processed. In OCR, there is necessary to detect text blocks and to detect individual
symbols in order to be able to recognize them. In JBIG2, it is also necessary to
detect text blocks and ideally also to detect individual symbols in order to be
able to achieve the maximum compression ratio. There is one main difference
between image compression that follows the JBIG2 standard, and OCR. In OCR,
there is necessary to provide results even when OCR engine is uncertain, and to
have the OCR engine trained to know the symbols contained in the image. When
compressing an image with perceptually lossless compression encoder cannot
afford errors but it has an advantage in that if it is uncertain about a symbol it
can classify it as a new symbol, thereby preventing unwanted errors. It also does
not need to know font information in advance.
When rendering text from images using OCR, the most important part is to
recognize what is written, not in what format and fonts. This information is also
welcome, but it is not the most important. If you want to compress an image
there is necessary to differentiate between symbols in different fonts because
the font information can be of value to the user. When using OCR in detecting
equivalent symbols, it is necessary to take this information into account, and
if this information is not provided by OCR itself, it needs to be handled by
additional methods.
It is also necessary to take into account that atypical symbols can appear in
documents, and they need to be handled correctly as well. From the OCR point
of view, math symbols can be considered atypical. It is either possible to use a
specialized OCR which handles math such as the Infty Reader [8], or to detect
that it is math and process it specially.
4 Jbig2enc API for Using OCR
Our API consists of two parts (modules): one represents the data structure
holding the results of OCR and one represents methods for running the OCR
engine and retrieving its results.
Our goal is to make the API for holding OCR results and the API for using
OCR engine as adaptable as possible in order to allow easy interchangeability of
OCR engines and thus prevent unnecessary modifications of existing code.
4 Radim Hatlapatka
Because jbig2enc is written in C++, our improvement and API also need to
be in C/C++. We decided to use an object hierarchy which allows the creation
of a common class with required methods; for creating new modules we used
an inheritance. A new module using a different OCR engine is easily made by
inheriting the relevant class and implementing defined methods. This methods
are implemented specifically for the OCR engine that is actually used.
For holding OCR results, we need to allow the storage of additional data
specific to the specific OCR engine that can be used to improve the comparison of
representants and thus create a specific similarity function which is most suitable
for that OCR engine.
Figure 1 shows classes representing the interface for using an OCR engine.
On the left are classes representing the module for using an OCR engine and its
function. Class TesseractOcr is an example of a module which uses Tesseract
as the OCR engine. On the right classes holding results of OCR recognition are
described. There is a main class for storing just simple structures with representative
symbols. For holding OCR results, it is necessary to store additional
information such as text recognized by the OCR engine and its confidence level.
For this purpose, the class OcrResult is crated, which can be extended and thus
new classed can easily be created to store additional information provided by
the OCR engine.
Fig. 1. Jbig2enc API for using an OCR engine
5 Experimental Results
In this Section, we introduce the results we achieved using our prototype version
of the improved jbig2enc encoder which uses Tesseract as the OCR engine.
In order to show results of the created prototype, we compress set of more
than 800 PDF documents. It is a set of PDFs selected randomly from collection
JBIG2 Supported by OCR 5
of Czech digital mathematical library. Documents are selected in order to cover
different types of PDFs from different eras. For this purpose PDFs are chosen
from different journals and from papers published in different years.
For compressing PDF documents pdfJbIm [3] is used. It uses our prototype
version of the improved jbig2enc encoder. The prototype improves compression
ratio by additional two percents in comparison with the previous improvement
of the jbig2enc encoder [6].
The achievement is shown in Table 1 and in graph in Figure 2. All shown
results are in KB. In order to prevent errors even on documents with an extremely
bad quality, default thresholding value2
is set to minimize potential loss of data.
The documents also contain nonbitonal images that are ignored.
Table 1. Results of an enhanced jbig2enc encoder
Number of
pages
Original
PDF
Original jbig2enc Improved jbig2enc
without OCR
Improved jbig2enc
with OCR
1 107.11 88.64 (82.8%) 86.72 (81%) 84.63 (79%)
2 240.76 203.19 (84.3%) 198.47 (82.4%) 193.83 (80.5%)
3 353.87 296.73 (83.9%) 288.11 (81.4%) 281.21 (79.5%)
4 476.82 401.13 (84.1%) 388.85 (81.6%) 379.38 (79.6%)
5 592.42 499.82 (84.4%) 484.31 (81.7%) 472.61 (79.8%)
6 722.71 609.02 (84.3%) 590.66 (81.7%) 576.42 (79.8%)
7 822.41 691.49 (84.1%) 667.13 (81.1%) 650.51 (79.1%)
8 949.18 800.55 (84.3%) 775.36 (81.7%) 756.16 (79.7%)
9 1,080.05 913.35 (84.6%) 880.55 (81.5%) 858.71 (79.5%)
10 1,161.09 975.56 (84%) 936.53 (80.6%) 913.19 (78.6%)
Figure 3 represents an original image (TIFF G4 compressed) that is further
compressed according to the JBIG2 standard. The image size is 118 KB. Figures 4
(size 8.6 KB) and 5 (size 8.2 KB) are images compressed according JBIG2 standard
using the jbig2enc open-source encoder [4]. Their sizes are around 8 KB which is
a significant reduction from the original image. The difference between Figures 4
and 5 is that Figure 5 is compressed by the improved jbig2enc as described in
Section 4. It uses Tesseract as the OCR engine.
In these figures, there is no visible loss of data or image quality, and without
searching for differences in detail they look the same. Figure 6 shows the difference
between the original image and the image compressed using the jbig2enc
encoder which uses Tesseract as the OCR engine. Figure 7 shows how the output
image changes when the jbig2enc encoder uses an OCR engine to improve its
compression ratio, and as side effect has the potential to improve the quality of
the output image.
2
Thresholding value used by the jbig2enc encoder even if no improvement is used. It
determines if two symbols should be considered equivalent or not
6 Radim Hatlapatka
1 2 3 4 5 6 7 8 9 10
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
Original Original jbig2enc
Improved Jbig2enc without OCR Improved Jbig2enc with OCR
Number of
pages
SizeinkB
Fig. 2. Compression results of jbig2enc and its improved versions
6 Conclusion and Future Work
As we have shown, using OCR in jbig2enc further improves its compression
ratio. We defined the API, which is independent of the OCR engine used. As a
default OCR engine, we used the Tesseract OCR, but this can be easily replaced
by another OCR engine by creating a new module specifically for use with that
OCR engine which implements the defined API and by changing one line in the
existing code to determine which OCR engine is used.
We have shown that by using OCR, we are able to choose which representant
of several equivalent ones is better in terms of visual quality, thus improving the
quality of the image.
The similarity distance function is not yet fully balanced for simultaneously
maximizing the compression ratio and preventing errors. This needs to be solved
and well tested. For this purpose, we intend to create semi-automatic testing by
experimenting on images for which we know output will be created correctly
without quality reduction, and for which we know a recognized number of
different symbols. We shall then simulate the decrease in quality caused by
scanning the image and see to what extent the result changes.
We intend to create a universal language dictionary containing all of the symbols
used in European languages, including math symbols, and train Tesseract
for it. This should prevent the need for the user to set all used languages and
improve speed if more languages are used. More dictionaries means recurrence
of the same symbols, thus creating a bigger collection than necessary. While
JBIG2 Supported by OCR 7
Fig. 3. Original image before JBIG2 compression (TIFF G4 compressed, size: 118 KB)
Fig. 4. Images compressed according to standard JBIG2 without OCR usage (size: 8.6 KB)
Fig. 5. Image compressed according to standard JBIG2 with OCR usage (size: 8.2 KB)
8 Radim Hatlapatka
Fig. 6. Difference between Fig. 3 and Fig. 5
Fig. 7. Difference between Fig. 4 and Fig. 5
JBIG2 Supported by OCR 9
recognizing symbols, the OCR engine compares data with the existing collection
of symbols created based on the dictionaries provided to determine what symbol
is represented by the image. The size of this collection influences the amount of
comparisons needed to determine which is the best candidate.
There is also a plan for creating a module for using Infty as the OCR engine
and to use its math recognition support to improve the compression ratio for
documents containing lots of math.
Acknowledgement
This work has been financed in part by the European Union through its Competitiveness
and Innovation Programme (Information and Communication Technologies
Policy Support Programme, “Open access to scientific information”,
Grant Agreement No. 250503).
References
1. Committee, J.: 14492 FCD. ISO/IEC JTC 1/SC 29/WG 1 (1999), http://www.jpeg.org/
public/fcd14492.pdf
2. Hatlapatka, R.: JBIG2 komprese (Bachelor thesis written in Czech, JBIG2 compression).
Masaryk University, Faculty of Informatics (advisor Petr Sojka), Brno, Czech Republic
(2010)
3. Hatlapatka, R.: PDF Recompression using JBIG2. [online] (2012), http://nlp.fi.muni.
cz/projekty/eudml/pdfRecompression/
4. Langley, A.: Homepage of jbig2enc encoder. [online], http://github.com/agl/
jbig2enc
5. SG, S.J.: JBIG Maui Meeting Press Release (December 1999), http://www.jpeg.org/
public/mauijbig.pdf
6. Sojka, P., Hatlapatka, R.: Document engineering for a digital library: PDF recompression
using JBIG2 and other optimizations of PDF documents. In: Proceedings of the 10th
ACM symposium on Document engineering. pp. 3–12. DocEng ’10, ACM, New York,
NY, USA (2010), http://doi.acm.org/10.1145/1860559.1860563
7. Sojka, P., Hatlapatka, R.: PDF Enhancements Tools for a Digital Library: pdfJbIm and
pdfsign. In: DML 2010 Towards a Digital Mathematics Library. pp. 45–55. Masaryk
University, Brno, Czech Republic (2010)
8. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY – An Integrated
OCR System for Mathematical Documents. In: Proceedings of ACM Symposium on
Document Engineering 2003. ACM, Grenoble (2003), http://www.inftyproject.org/
articles/2003_DocEng_Suzuki.zip
9. Union, I.T.: ITU-T Recommendation T.88. ITU-T Recommendation T.88 (2000), http:
//www.itu.int/rec/T-REC-T.88-200002-I/en