OCReract.jl
OCReract is a simple Julia wrapper of the well-known OCR engine called Tesseract OCR. It is intended to be a very simple package used for two goals:
- In disk: Run
tesseract
command from a Julia session to load an image in disk and write the results in a text file. - In memory: Process an image loaded in memory and get OCR results as a string in a Julia session.
Installation
The Tesseract OCR engine must be installed manually. On ubuntu, this may be as simple as
$ sudo apt-get install -y tesseract-ocr
but the installation instructions are the authoritative source.
The Julia wrapper can be installed using the Julia package manager. From the Julia REPL, type ]
to enter the Pkg REPL mode and run:
pkg> add OCReract
Usage
In this simple example, we will process the following image through the two options mentioned:
In disk
Let's execute run_tesseract
to process the image from repository's test folder, and then cat
the resulting text file.
julia> using OCReract
julia> img_path = "test/files/noisy.png";
julia> res_path = "/tmp/res.txt";
julia> run_tesseract(img_path, res_path);
julia> read(`cat $res_path`, String)
"Noisy image\nto test\nOCReract.jl\n\f"
In memory
OCReract
uses JuliaImages module to process images in memory. So, the image should be loaded with Images
module (or the lighter-weight combination using ImageCore, FileIO
) to then execute run_tesseract
to retrieve the result as a String
.
julia> using Images
julia> using OCReract
julia> img_path = "https://raw.githubusercontent.com/leferrad/OCReract.jl/master/test/files/noisy.png";
julia> img = load(img_path);
julia> res_text = run_tesseract(img);
julia> println(strip(res_text))
Noisy image
to test
OCReract.jl
API Reference
OCReract.run_tesseract
โ Methodrun_tesseract(image, extra_args...; kwargs...) -> String
Function to run Tesseract over an image in memory, and get the results in a String
. Errors / Warnings are reported through Logging
, so no exceptions are thrown.
Arguments
image
: Image to be processed, in a format compatible withImages
module.extra_args::String...
: Optional arguments to change the nature of the output (e.g,"tsv"
)
Keywords
lang::Union{String, Nothing}
Language to be configured in Tesseract (optional)psm::Integer
: Page segmentation modes (PSM):psm=0
: Orientation and script detection (OSD) only.psm=1
: Automatic page segmentation with OSD.psm=2
: Automatic page segmentation, but no OSD, or OCR.psm=3
: Fully automatic page segmentation, but no OSD. (Default)psm=4
: Assume a single column of text of variable sizes.psm=5
: Assume a single uniform block of vertically aligned text.psm=6
: Assume a single uniform block of text.psm=7
: Treat the image as a single text line.psm=8
: Treat the image as a single word.psm=9
: Treat the image as a single word in a circle.psm=10
: Treat the image as a single character.psm=11
: Sparse text. Find as much text as possible in no particular order.psm=12
: Sparse text with OSD.psm=13
: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
oem::Integer
: OCR Engine modes (OEM):oem=0
: Legacy engine only.oem=1
: Neural nets LSTM engine only. (Default)oem=2
: Legacy + LSTM engines.oem=3
: Default, based on what is available.
kwargs
: Other key-value pairs to be sent to Tesseract command as "-c" config variables. You can check the options withtesseract --print-parameters
.
Returns
String
: text extracted, or empty string in case an error occurs
Examples
julia> using Images;
julia> using OCReract;
julia> img_path = "/path/to/img.png";
julia> img = load(img_path);
julia> res_text = run_tesseract(img, psm=3, oem=1);
julia> println(strip(res_text));
OCReract.run_tesseract
โ Methodrun_tesseract(input_path, output_path, extra_args...; kwargs...) -> Bool
Wrapper function to run Tesseract over a image stored in disk, and write the results in a given path. Errors / Warnings are reported through Logging
, so no exceptions are thrown.
Arguments
input_path::String
: Path to the image to be processedoutput_path::String
: Path to the text result to be writtenextra_args::String...
: Optional arguments to change the nature of the output (e.g,"tsv"
)
Keywords
lang::Union{String, Nothing}
Language to be configured in Tesseract (optional).psm::Integer
: Page segmentation modes (PSM):psm=0
: Orientation and script detection (OSD) only.psm=1
: Automatic page segmentation with OSD.psm=2
: Automatic page segmentation, but no OSD, or OCR.psm=3
: Fully automatic page segmentation, but no OSD. (Default)psm=4
: Assume a single column of text of variable sizes.psm=5
: Assume a single uniform block of vertically aligned text.psm=6
: Assume a single uniform block of text.psm=7
: Treat the image as a single text line.psm=8
: Treat the image as a single word.psm=9
: Treat the image as a single word in a circle.psm=10
: Treat the image as a single character.psm=11
: Sparse text. Find as much text as possible in no particular order.psm=12
: Sparse text with OSD.psm=13
: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
oem::Integer
: OCR Engine modes (OEM):oem=0
: Legacy engine only.oem=1
: Neural nets LSTM engine only. (Default)oem=2
: Legacy + LSTM engines.oem=3
: Default, based on what is available.
kwargs
: Other key-value pairs to be sent to Tesseract command as "-c" config variables. You can check the options withtesseract --print-parameters
.
Returns
Bool
: indicating whether execution was successful or not
Examples
julia> using OCReract;
julia> img_path = "/path/to/img.png";
julia> out_path = "/tmp/tesseract_result.txt";
julia> run_tesseract(img_path, out_path, psm=3, oem=1)