OCReract.jl

OCReract is a simple Julia wrapper of the well-known OCR engine called Tesseract OCR. It is intended to be a very simple package used for two goals:

In disk: Run tesseract command from a Julia session to load an image in disk and write the results in a text file.
In memory: Process an image loaded in memory and get OCR results as a string in a Julia session.

Installation

The Tesseract OCR engine must be installed manually. On ubuntu, this may be as simple as

sudo apt-get install -y tesseract-ocr

but the installation instructions are the authoritative source.

The Julia wrapper can be installed using the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add OCReract

Usage

In this simple example, we will process the following image through the two options mentioned:

Test Image

In disk

Let's execute run_tesseract to process the image from repository's test folder, and then cat the resulting text file.

julia> using OCReract
julia> img_path = "test/files/noisy.png";
julia> res_path = "/tmp/res.txt";
julia> run_tesseract(img_path, res_path);
julia> read(`cat $res_path`, String)
"Noisy image\nto test\nOCReract.jl\n\f"

OCReract uses JuliaImages module to process images in memory. So, the image should be loaded with Images module (or the lighter-weight combination using ImageCore, FileIO) to then execute run_tesseract to retrieve the result as a String.

julia> using Images
julia> using OCReract
julia> img_path = "https://raw.githubusercontent.com/leferrad/OCReract.jl/master/test/files/noisy.png";
julia> img = load(img_path);
julia> res_text = run_tesseract(img);
julia> println(strip(res_text))
Noisy image
to test
OCReract.jl

OCReract.OCReract — Module

OCReract is a simple Julia wrapper of the well-known OCR engine called Tesseract.

Here, a simple example of usage:

Example

julia> using Images
julia> using OCReract
julia> img_path = "/path/to/img.png";
# In disk
julia> run_tesseract(img_path, "/tmp/res.txt", psm=3, oem=1)
# In memory
julia> img = load(img_path);
julia> res_text = run_tesseract(img, psm=3, oem=1);
julia> println(strip(res_text));

For more information, check the homepage in https://github.com/leferrad/OCReract.jl.

source

OCReract.check_tesseract_installed — Function

checktesseractinstalled()

This function checks if Tesseract is installed in the system by running the command tesseract --version. If the command is not recognized, an error is logged.

Examples

julia> using OCReract;
julia> check_tesseract_installed()

source

OCReract.get_tesseract_version — Function

get_tesseract_version() -> String

Function to get the version of Tesseract installed in the system. The version is extracted from the first line of the output of the command tesseract --version.

Returns

String: version of Tesseract installed

Examples

julia> using OCReract;
julia> get_tesseract_version()

source

OCReract.check_tesseract_installed — Method

checktesseractinstalled()

This function checks if Tesseract is installed in the system by running the command tesseract --version. If the command is not recognized, an error is logged.

Examples

julia> using OCReract;
julia> check_tesseract_installed()

source

OCReract.get_tesseract_version — Method

get_tesseract_version() -> String

Function to get the version of Tesseract installed in the system. The version is extracted from the first line of the output of the command tesseract --version.

Returns

String: version of Tesseract installed

Examples

julia> using OCReract;
julia> get_tesseract_version()

source

OCReract.run_tesseract — Method

run_tesseract(image, extra_args...; kwargs...) -> String

Function to run Tesseract over an image in memory, and get the results in a String. Errors / Warnings are reported through Logging, so no exceptions are thrown.

Arguments

image: Image to be processed, in a format compatible with Images module.
extra_args::String...: Optional arguments to change the nature of the output (e.g, "tsv")

Keywords

lang::Union{String, Nothing} Language to be configured in Tesseract (optional)
psm::Integer: Page segmentation modes (PSM):
- psm=0: Orientation and script detection (OSD) only.
- psm=1: Automatic page segmentation with OSD.
- psm=2: Automatic page segmentation, but no OSD, or OCR.
- psm=3: Fully automatic page segmentation, but no OSD. (Default)
- psm=4: Assume a single column of text of variable sizes.
- psm=5: Assume a single uniform block of vertically aligned text.
- psm=6: Assume a single uniform block of text.
- psm=7: Treat the image as a single text line.
- psm=8: Treat the image as a single word.
- psm=9: Treat the image as a single word in a circle.
- psm=10: Treat the image as a single character.
- psm=11: Sparse text. Find as much text as possible in no particular order.
- psm=12: Sparse text with OSD.
- psm=13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
oem::Integer: OCR Engine modes (OEM):
- oem=0: Legacy engine only.
- oem=1: Neural nets LSTM engine only. (Default)
- oem=2: Legacy + LSTM engines.
- oem=3: Default, based on what is available.
kwargs: Other key-value pairs to be sent to Tesseract command as "-c" config variables. You can check the options with tesseract --print-parameters.

Returns

String: text extracted, or empty string in case an error occurs

Examples

julia> using Images;
julia> using OCReract;
julia> img_path = "/path/to/img.png";
julia> img = load(img_path);
julia> res_text = run_tesseract(img, psm=3, oem=1);
julia> println(strip(res_text));

source

OCReract.run_tesseract — Method

run_tesseract(input_path, output_path, extra_args...; kwargs...) -> Bool

Wrapper function to run Tesseract over a image stored in disk, and write the results in a given path. Errors / Warnings are reported through Logging, so no exceptions are thrown.

Arguments

input_path::String: Path to the image to be processed
output_path::String: Path to the text result to be written
extra_args::String...: Optional arguments to change the nature of the output (e.g, "tsv")

Keywords

lang::Union{String, Nothing} Language to be configured in Tesseract (optional).
psm::Integer: Page segmentation modes (PSM):
- psm=0: Orientation and script detection (OSD) only.
- psm=1: Automatic page segmentation with OSD.
- psm=2: Automatic page segmentation, but no OSD, or OCR.
- psm=3: Fully automatic page segmentation, but no OSD. (Default)
- psm=4: Assume a single column of text of variable sizes.
- psm=5: Assume a single uniform block of vertically aligned text.
- psm=6: Assume a single uniform block of text.
- psm=7: Treat the image as a single text line.
- psm=8: Treat the image as a single word.
- psm=9: Treat the image as a single word in a circle.
- psm=10: Treat the image as a single character.
- psm=11: Sparse text. Find as much text as possible in no particular order.
- psm=12: Sparse text with OSD.
- psm=13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
oem::Integer: OCR Engine modes (OEM):
- oem=0: Legacy engine only.
- oem=1: Neural nets LSTM engine only. (Default)
- oem=2: Legacy + LSTM engines.
- oem=3: Default, based on what is available.
kwargs: Other key-value pairs to be sent to Tesseract command as "-c" config variables. You can check the options with tesseract --print-parameters.

Returns

Bool: indicating whether execution was successful or not

Examples

julia> using OCReract;
julia> img_path = "/path/to/img.png";
julia> out_path = "/tmp/tesseract_result.txt";
julia> run_tesseract(img_path, out_path, psm=3, oem=1)

source