OCReract.jl

OCReract is a simple Julia wrapper of the well-known OCR engine called Tesseract OCR. It is intended to be a very simple package used for two goals:

  1. In disk: Run tesseract command from a Julia session to load an image in disk and write the results in a text file.
  2. In memory: Process an image loaded in memory and get OCR results as a string in a Julia session.

Installation

The Tesseract OCR engine must be installed manually. On ubuntu, this may be as simple as

$ sudo apt-get install -y tesseract-ocr

but the installation instructions are the authoritative source.

The Julia wrapper can be installed using the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add OCReract

Usage

In this simple example, we will process the following image through the two options mentioned:

Test Image

In disk

Let's execute run_tesseract to process the image from repository's test folder, and then cat the resulting text file.

julia> using OCReract
julia> img_path = "test/files/noisy.png";
julia> res_path = "/tmp/res.txt";
julia> run_tesseract(img_path, res_path);
julia> read(`cat $res_path`, String)
"Noisy image\nto test\nOCReract.jl\n\f"

In memory

OCReract uses JuliaImages module to process images in memory. So, the image should be loaded with Images module (or the lighter-weight combination using ImageCore, FileIO) to then execute run_tesseract to retrieve the result as a String.

julia> using Images
julia> using OCReract
julia> img_path = "https://raw.githubusercontent.com/leferrad/OCReract.jl/master/test/files/noisy.png";
julia> img = load(img_path);
julia> res_text = run_tesseract(img);
julia> println(strip(res_text))
Noisy image
to test
OCReract.jl

API Reference

OCReract.run_tesseract โ€” Method
run_tesseract(image, extra_args...; kwargs...) -> String

Function to run Tesseract over an image in memory, and get the results in a String. Errors / Warnings are reported through Logging, so no exceptions are thrown.

Arguments

  • image: Image to be processed, in a format compatible with Images module.
  • extra_args::String...: Optional arguments to change the nature of the output (e.g, "tsv")

Keywords

  • lang::Union{String, Nothing} Language to be configured in Tesseract (optional)
  • psm::Integer: Page segmentation modes (PSM):
    • psm=0: Orientation and script detection (OSD) only.
    • psm=1: Automatic page segmentation with OSD.
    • psm=2: Automatic page segmentation, but no OSD, or OCR.
    • psm=3: Fully automatic page segmentation, but no OSD. (Default)
    • psm=4: Assume a single column of text of variable sizes.
    • psm=5: Assume a single uniform block of vertically aligned text.
    • psm=6: Assume a single uniform block of text.
    • psm=7: Treat the image as a single text line.
    • psm=8: Treat the image as a single word.
    • psm=9: Treat the image as a single word in a circle.
    • psm=10: Treat the image as a single character.
    • psm=11: Sparse text. Find as much text as possible in no particular order.
    • psm=12: Sparse text with OSD.
    • psm=13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
  • oem::Integer: OCR Engine modes (OEM):
    • oem=0: Legacy engine only.
    • oem=1: Neural nets LSTM engine only. (Default)
    • oem=2: Legacy + LSTM engines.
    • oem=3: Default, based on what is available.
  • kwargs: Other key-value pairs to be sent to Tesseract command as "-c" config variables. You can check the options with tesseract --print-parameters.

Returns

  • String: text extracted, or empty string in case an error occurs

Examples

julia> using Images;
julia> using OCReract;
julia> img_path = "/path/to/img.png";
julia> img = load(img_path);
julia> res_text = run_tesseract(img, psm=3, oem=1);
julia> println(strip(res_text));
source
OCReract.run_tesseract โ€” Method
run_tesseract(input_path, output_path, extra_args...; kwargs...) -> Bool

Wrapper function to run Tesseract over a image stored in disk, and write the results in a given path. Errors / Warnings are reported through Logging, so no exceptions are thrown.

Arguments

  • input_path::String: Path to the image to be processed
  • output_path::String: Path to the text result to be written
  • extra_args::String...: Optional arguments to change the nature of the output (e.g, "tsv")

Keywords

  • lang::Union{String, Nothing} Language to be configured in Tesseract (optional).
  • psm::Integer: Page segmentation modes (PSM):
    • psm=0: Orientation and script detection (OSD) only.
    • psm=1: Automatic page segmentation with OSD.
    • psm=2: Automatic page segmentation, but no OSD, or OCR.
    • psm=3: Fully automatic page segmentation, but no OSD. (Default)
    • psm=4: Assume a single column of text of variable sizes.
    • psm=5: Assume a single uniform block of vertically aligned text.
    • psm=6: Assume a single uniform block of text.
    • psm=7: Treat the image as a single text line.
    • psm=8: Treat the image as a single word.
    • psm=9: Treat the image as a single word in a circle.
    • psm=10: Treat the image as a single character.
    • psm=11: Sparse text. Find as much text as possible in no particular order.
    • psm=12: Sparse text with OSD.
    • psm=13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
  • oem::Integer: OCR Engine modes (OEM):
    • oem=0: Legacy engine only.
    • oem=1: Neural nets LSTM engine only. (Default)
    • oem=2: Legacy + LSTM engines.
    • oem=3: Default, based on what is available.
  • kwargs: Other key-value pairs to be sent to Tesseract command as "-c" config variables. You can check the options with tesseract --print-parameters.

Returns

  • Bool: indicating whether execution was successful or not

Examples

julia> using OCReract;
julia> img_path = "/path/to/img.png";
julia> out_path = "/tmp/tesseract_result.txt";
julia> run_tesseract(img_path, out_path, psm=3, oem=1)
source