Neural spell checking using Transformers and Graph Neural Networks

This project is about detecting and correcting spelling errors using Transformers and Graph Neural Networks. Visit the documentation (which also includes this README) for more information on how to reproduce results and train your own models.

Installation

Clone the repository

git clone git@github.com:bastiscode/spell_check.git

Install from source (alternatively you can use Docker)

make install

Usage

There are three main ways to use this project:

Command line
Python API
Web app

We recommend that you start out with the Web app using the Docker setup which provides the best user experience overall.

Command line interfaces

After installation there will be four commands available in your environment:

nsec for neural spelling error correction
nsed for neural spelling error detection
ntr for neural tokenization repair
nserver for running a spell checking server

By default the first three commands take input from stdin, run their respective task on the input line by line and print their output line by line to stdout. The fourth command allows you to start a spell checking server that gives you access to all models via a JSON API. It is e.g. used as backend for the Web app. The server can be configured using a yaml file (see server_config/server.yaml for the configuration options).

Let’s look at some basic examples on how to use the command line tools.

Spelling error correction using nsec

# correct text by piping into nsec,
echo "This is an incorect sentense!" | nsec
cat path/to/file.txt | nsec

# by using the -c flag,
nsec -c "This is an incorect sentense!"

# by passing a file,
nsec -f path/to/file.txt

# or by starting an interactive session
nsec -i

# you can also use spelling correction to detect errors on word and sequence level
nsec -c "This is an incorect sentense!" --convert-output word_detections
nsec -c "This is an incorect sentense!" --convert-output sequence_detections

Spelling error detection using nsed

# detect errors by piping into nsed,
echo "This is an incorect sentense!" | nsed
cat path/to/file.txt | nsed

# by using the -d flag,
nsed -d "This is an incorect sentense!"

# by passing a file,
nsed -f path/to/file.txt

# or by starting an interactive session
nsed -i

# you can also use word level spelling error detections to detect sequence level errors
nsed -d "This is an incorect sentense!" --convert-sequence

Tokenization repair using ntr

# repair text by piping into ntr,
echo "Thisis an inc orect sentens e!" | ntr
cat path/to/file.txt | ntr

# by using the -r flag,
ntr -r "Thisis an inc orect sentens e!"

# by passing a file,
ntr -f path/to/file.txt

# or by starting an interactive session
ntr -i

You can also combine the ntr, nsed, and nsec commands in a variety of ways. Some examples are shown below.

# repair and detect
echo "Repi arand core ct tihs sen tens!" | ntr | nsed
# to view both the repaired text and the detections use
echo "Repi arand core ct tihs sen tens!" | ntr | nsed --sec-out

# repair and correct
echo "Repi arand core ct tihs sen tens!" | ntr | nsec

# repair and correct a file and save the output
ntr -f path/to/file.txt | nsec --progress -o path/to/output_file.txt

# repair, detect and correct
# (this pipeline uses the spelling error detection output
# to guide the spelling error correction model to correct only the misspelled words)
echo "Repi arand core ct tihs sen tens!" | ntr | nsed --sec-out | nsec --sed-in

# some detection and correction models (e.g. tokenization repair+, tokenization repair++, transformer with tokenization repair nmt)
# can natively deal with incorrect whitespacing in text, so there is no need to use ntr before them if you want to process
# text with whitespacing errors
nsed -d "core ct thissen tense!" -m "sed words:tokenization repair+" --sec-out
nsec -c "core ct thissen tense!" -m "tokenization repair++"
nsec -c "core ct thissen tense!" -m "transformer with tokenization repair nmt"

There are a few other command line options available for the nsec, nsed and ntr commands. Inspect them by passing the -h / --help flag to the commands.

Running a spell checking server using nserver

# start the spell checking server, we provide a default config at server_config/server.yaml
nserver -c path/to/config.yaml

Python API

We also provide a Python API for you to use spell checking models directly in code. Below are basic code examples on how to use the API. For the full documentation of all classes, methods, etc. provided by the Python API see the nsc package documentation.

Spelling error correction

from nsc import SpellingErrorCorrector, get_available_spelling_error_correction_models

# show all spelling error correction models
print(get_available_spelling_error_correction_models())

# use a pretrained model
sec = SpellingErrorCorrector.from_pretrained()
# correct errors in text
correction = sec.correct_text("Tihs text has erors!")
print(correction)
# correct errors in file
corrections = sec.correct_file("path/to/file.txt")
print(correction)

Spelling error detection

from nsc import SpellingErrorDetector, get_available_spelling_error_detection_models

# show all spelling error detection models
print(get_available_spelling_error_detection_models())

# use a pretrained model
sed = SpellingErrorDetector.from_pretrained()
# detect errors in text
detection = sed.detect_text("Tihs text has erors!")
print(detection)
# detect errors in file
detections = sed.detect_file("path/to/file.txt")
print(detections)

Tokenization repair

from nsc import TokenizationRepairer, get_available_tokenization_repair_models

# show all tokenization repair models
print(get_available_tokenization_repair_models())

# use a pretrained model
tr = TokenizationRepairer.from_pretrained()
# repair tokenization in text
repaired_text = tr.repair_text("Ti hstext h aserors!")
print(repaired_text)
# repair tokenization in file
repaired_file = tr.repair_file("path/to/file.txt")
print(repaired_file)

Web app

The web app source can be found under webapp/build/web. You can host it e.g. using Pythons built-in http server:

python -m http.server -d directory webapp/build/web 8080

The web app expects a spell checking server running under the same hostname on port 44444 (we are currently looking into ways to be able to set the hostname and port for the server endpoint at runtime). If you are running the web app locally you can simply start the spell checking server using nserver:

nserver -c server_config/server.yaml

Then go to your browser to port 8080 to use the web app.

Docker

This project can also be run using Docker. Inside the Docker container both the Command line interfaces and Python API are available for you to use. You can also evaluate model predictions on benchmarks.

Build the Docker image:

make build_docker

Start a Docker container:

make run_docker

You can also pass additional Docker arguments to the make commands by specifying DOCKER_ARGS. For example, to be able to access your GPUs inside the container run make run_docker DOCKER_ARGS="--gpus all". As another example, to mount an additional directory inside the container use make run_docker DOCKER_ARGS="-v /path/to/outside/directory:/path/to/container/directory".

We provide two convenience commands for starting a spell checking server or hosting the web app:

Start a spell checking server on port <outside_port>:

make run_docker_server DOCKER_ARGS="-p <outside_port>:44444"

Start the web app on port <outside_port> (also runs the spell checking server):

make run_docker_webapp DOCKER_ARGS="-p <outside_port>:8080"

Hint

If you build the Docker image on an AD Server you probably want to use wharfer instead of Docker. To do that call the make commands with the additional argument DOCKER_CMD=wharfer, e.g. make build_docker DOCKER_CMD=wharfer.

Note

The Docker setup is only intended to be used for running the command line tools, Python API, or web app with pretrained or your own models and evaluating benchmarks, but not for training.

Note

Running the Docker container with GPU support assumes that you have the NVIDIA Container Toolkit installed.

Pretrained models

The following table shows all pretrained models available for the CLI, Python API, and web app:

Table
Task (CLI tool/Python class)	Model	Description	Default
tokenization repair (`ntr`/`TokenizationRepairer`)	eo large arxiv with errors	Large-sized Transformer model (12 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt).	X
tokenization repair (`ntr`/`TokenizationRepairer`)	eo medium arxiv with errors	Medium-sized Transformer model (6 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt).
tokenization repair (`ntr`/`TokenizationRepairer`)	eo small arxiv with errors	Small-sized Transformer model (3 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt).
tokenization repair (`ntr`/`TokenizationRepairer`)	tokenization repair+	Same as eo medium arxiv with errors, available here for completeness.
tokenization repair (`ntr`/`TokenizationRepairer`)	tokenization repair++	Same as eo medium arxiv with errors, available here for completeness.
sed words (`nsed`/`SpellingErrorDetector`)	gnn+	Attentional Graph Neural Network which processes language graphs with fully connected word nodes, word features and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations.	X
sed words (`nsed`/`SpellingErrorDetector`)	gnn+ neuspell	Attentional Graph Neural Network which processes language graphs with fully connected word nodes, word features and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)
sed words (`nsed`/`SpellingErrorDetector`)	transformer	Regular transformer processing a sequence of sub-word tokens. Predicts spelling errors on word level using the aggregated sub-word representations per word.
sed words (`nsed`/`SpellingErrorDetector`)	transformer neuspell	Regular transformer processing a sequence of sub-word tokens. Predicts spelling errors on word level using the aggregated sub-word representations per word. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)
sed words (`nsed`/`SpellingErrorDetector`)	transformer+	Regular transformer processing a sequence of sub-word tokens. Before predicting spelling errors, sub-word representations within a word are aggregated and enriched with word features to obtain word representations. Predicts spelling errors on word level using those word representations.
sed words (`nsed`/`SpellingErrorDetector`)	transformer+ neuspell	Regular transformer processing a sequence of sub-word tokens. Before predicting spelling errors, sub-word representations within a word are aggregated and enriched with word features to obtain word representations. Predicts spelling errors on word level using those word representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)
sed words (`nsed`/`SpellingErrorDetector`)	gnn	Attentional Graph Neural Network which processes language graphs with fully connected word nodes and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations.
sed words (`nsed`/`SpellingErrorDetector`)	gnn neuspell	Attentional Graph Neural Network which processes language graphs with fully connected word nodes and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)
sed words (`nsed`/`SpellingErrorDetector`)	tokenization repair+	Transformer based model that detects errors in sequences by first correcting the tokenization and then detecting spelling errors for each word in the repaired text.
sed words (`nsed`/`SpellingErrorDetector`)	tokenization repair++	Transformer based model that detects errors in sequences by first correcting the tokenization and then detecting spelling errors for each word in the repaired text. Different from tokenization repair+ because this model was trained to also correct spelling errors (it is also available in nsec).
sec (`nsec`/`SpellingErrorCorrector`)	transformer words nmt	Transformer model that corrects sequences by translating each word individually from misspelled to correct.	X
sec (`nsec`/`SpellingErrorCorrector`)	transformer words nmt neuspell	Transformer model that corrects sequences by translating each word individually from misspelled to correct. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)
sec (`nsec`/`SpellingErrorCorrector`)	tokenization repair++	Transformer based model that corrects sequences by first correcting the tokenization, then detecting spelling errors for each word in the repaired text and then translating every detected misspelled word to its corrected version.
sec (`nsec`/`SpellingErrorCorrector`)	transformer nmt	Transformer model that translates a sequence with spelling errors into a sequence without spelling errors.
sec (`nsec`/`SpellingErrorCorrector`)	transformer nmt neuspell	Transformer model that translates a sequence with spelling errors into a sequence without spelling errors. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)
sec (`nsec`/`SpellingErrorCorrector`)	transformer with tokenization repair nmt	Transformer model that translates a sequence with spelling and tokenization errors into a sequence without spelling and tokenization errors. Different from transformer nmt because this model tokenizes into characters and was trained on text with spelling and tokenization errors, whereas transformer nmt tokenizes into sub-words and was trained only on text with spelling errors.

Training and reproducing results

You can skip the following training section if you only want to use pretrained models and not train your own models.

Training

Before starting training you need the get the training data. Everything you need (preprocessed samples, tokenizer, dictionaries, etc.) can be found under /nfs/students/sebastian-walter/masters_thesis/data.

You also need to set the following two special environment variables:

# set the data directory
export NSC_DATA_DIR=/nfs/students/sebastian-walter/masters_thesis/data

# set the config directory (necessary to be able to compose
# the final training config from sub-configs)
export NSC_CONFIG_DIR=spell_checking/configs

Note

Of course you can also copy the training data folder to every other place you like and adjust NSC_DATA_DIR accordingly. But keep in mind that this folder is very large (> 1TB).

After that you can train your own models using a training config. All of the training configs this project used to train models can be found here.

You might have to further configure a training config by setting additional environment variables. Let’s look at an example where we want to train a spelling error detection Graph Neural Network. The config for this task looks like the following:

variant: ${from_file:variant/sed_words.yaml}
model: ${from_file:model/model_for_sed_words.yaml}
optimizer: ${from_file:optimizer/adamw.yaml}
lr_scheduler: ${from_file:lr_scheduler/step_with_0.05_warmup.yaml}

experiment_dir: ${oc.env:NSC_EXPERIMENT_DIR}
data_dir: ${oc.env:NSC_DATA_DIR}
datasets:
  - ${oc.env:NSC_DATA_DIR}/processed/wikidump_paragraphs_sed_words_and_sec
  - ${oc.env:NSC_DATA_DIR}/processed/bookcorpus_paragraphs_sed_words_and_sec
dataset_limits:
  - ${oc.env:NSC_DATA_LIMIT}
  - ${oc.env:NSC_DATA_LIMIT}
val_splits:
  - 2500
  - 2500

experiment_name: ${oc.env:NSC_EXPERIMENT_NAME,dummy}
epochs: ${oc.env:NSC_EPOCHS,1}
batch_max_length: ${oc.env:NSC_BATCH_MAX_LENGTH,4096}
bucket_span: ${oc.env:NSC_BUCKET_SPAN,4}
log_per_epoch: ${oc.env:NSC_LOG_PER_EPOCH,100}
eval_per_epoch: ${oc.env:NSC_EVAL_PER_EPOCH,4}
seed: 22
mixed_precision: ${oc.env:NSC_MIXED_PRECISION,true}

Values looking like ${from_file:<file_path>} refer to other config files relative to the NSC_CONFIG_DIR. When the training config is composed, the contents of the referred config files will replace these values.

Values looking like ${oc.env:<env_var_name>,<default>} refer to environment variables and an optional default that will be set if the environment variable is not found. If there is no default you will be required to set the environment variable, otherwise you receive an error message.

In our example we need to set values for the environment variables NSC_DATA_LIMIT (can be used to limit the number of samples per training dataset) and NSC_EXPERIMENT_DIR (directory path where the logs and checkpoints will be saved). Once we have set these variables we can start the training. Since the training script is written to support distributed training we need to use torchrun to launch the script:

# set the environment variables
export NSC_DATA_LIMIT=100000 # set data limit to 100,000 samples per dataset
export NSC_EXPERIMENT_DIR=experiments # directory path where the experiment will be saved

# to train locally / on a single node
torchrun --nnodes=1 nsc/train.py --config spell_checking/configs/sed_words.yaml

You can also resume training for an existing experiment if you had to abort training for some reason:

# resume training from latest checkpoint of an experiment
torchrun --nnodes=1 nsc/train.py --resume <path_to_experiment_directory>

As an alternative you can set one of the NSC_CONFIG or NSC_RESUME environment variables and use the train.sh script to start training. This script additionally provides functionality to start distributed training on SLURM clusters. Training using this script would look something like this:

# set the environment variables
export NSC_DATA_LIMIT=100000 # set data limit to 100,000 samples per dataset
export NSC_EXPERIMENT_DIR=experiments # directory path where the experiment will be saved

## LOCAL training
# start new training run using a config
NSC_CONFIG=spell_checking/configs/sed_words.yaml spell_checking/scripts/train.sh

# resume training from latest checkpoint of an experiment
NSC_RESUME=<path_to_experiment_directory> spell_checking/scripts/train.sh

## SLURM training
# starting distributed training on a SLURM cluster using sbatch
# requires you to set the NSC_WORLD_SIZE environment variable (total number of GPUs used for training)
# if you e.g. want to train on 4 nodes with 2 GPUs each set NSC_WORLD_SIZE=8
NSC_CONFIG=spell_checking/configs/sed_words.yaml NSC_WORLD_SIZE=8 sbatch --nodes=4 --ntasks-per-node=2 --gres=gpu:2 spell_checking/scripts/train.sh

# if you are in an interactive SLURM session (started e.g. with srun)
# you probably want to train as if you are running locally, set NSC_FORCE_LOCAL=true and
# start training without sbatch
NSC_FORCE_LOCAL=true NSC_CONFIG=spell_checking/configs/sed_words.yaml spell_checking/scripts/train.sh

To retrain the models of this project see the train_slurm_<task>.sh scripts in the scripts directory which were used for training all models. These scripts do nothing more than setting some environment variables and calling the train.sh script.

Note

Using the train_slurm_<task>.sh scripts for training is only possible on a SLURM cluster since they call the train.sh script using SLURMs sbatch command.

Reproduce

We make all models that are needed to reproduce the results on the projects’ benchmarks available as pretrained models. All pretrained models can be accessed either through the command line interface (nsec, nsed, ntr), the Python API, or the web app.

All benchmarks as well as our and the baseline results on them can be found in this directory. Every benchmark follows the same directory structure:

<task>/<benchmark_group>/<benchmark_split>/corrupt.txt
<task>/<benchmark_group>/<benchmark_split>/correct.txt

Here corrupt.txt is the input containing misspelled text and correct.txt is the groundtruth output. We provide a evaluation script that can be used to evaluate model predictions on a given benchmark. Alternatively the web app also features a section where you can evaluate model predictions on benchmarks.

As an example, lets look at the steps that are necessary to evaluate our gnn+ model for word level spelling error detection on the wikidump realistic benchmark using the command line interface:

Run model on benchmark:

nsed -m "sed words:gnn+" \  # choose the model
-f spell_checking/benchmarks/test/sed_words/wikidump/realistic/corrupt.txt \  # input file
-o gnn_plus_predictions.txt  # save output to file

Evaluate model predictions:

python spell_checking/benchmarks/scripts/evaluate.py \
sed_words \  # benchmark type
spell_checking/benchmarks/test/sed_words/wikidump/realistic/corrupt.txt \  # input file
spell_checking/benchmarks/test/sed_words/wikidump/realistic/correct.txt \  # groundtruth file
gnn_plus_predictions.txt  # predicted file

Hint

By default a pretrained model is downloaded as a zip file and then extracted when you first use it. Since some models are quite large this can take some time. To cut this time all pretrained models can also be found as zip files in the directory /nfs/students/sebastian-walter/masters_thesis/zipped. If you set the env variable NSC_DOWNLOAD_DIR to this directory, the models are loaded from this directory and must not be downloaded first. If you are running this project using Docker you can mount the directory to the containers download directory by specifying an additional volume flag: make run_docker DOCKER_ARGS="-v /nfs/students/sebastian-walter/masters_thesis/zipped:/download.

Hint

To access the benchmarks if you are running this project with Docker you can mount the benchmark directory inside the Docker container using make run_docker DOCKER_ARGS="-v $(pwd)/spell_checking/benchmarks/test:/benchmarks". The Docker container also provides additional commands for evaluating benchmarks that are basically wrappers around the evaluation script mentioned above.

nsc package

class nsc.BeamSearch

Bases: nsc.api.sec.Search

Beam search: Maintain and expand the top beam_width paths.

__init__(beam_width=5)

Parameters: beam_width (int) –
Return type: None

beam_width: int = 5

class nsc.BestFirstSearch

Bases: nsc.api.sec.Search

Best first search: Always expand the highest scoring path out of all paths encountered so far.

__init__()

Return type: None

class nsc.GreedySearch

Bases: nsc.api.sec.Search

Greedy search: Always go to the highest scoring next node.

__init__()

Return type: None

class nsc.SampleSearch

Bases: nsc.api.sec.Search

Sample search: Go to a random node of the top_k highest scoring next nodes.

__init__(top_k=5)

Parameters: top_k (int) –
Return type: None

top_k: int = 5

class nsc.Score

Bases: object

Determines how paths during/after decoding are scored/rescored.

The default mode “log_likelihood” is to score a candidate using the token sequence log likelihood given as the sum of all token log probabilities. For some search methods we optionally support normalization by the token sequence length (so shorter paths are not preferred):

score = sum(log_probabilities) / (len(log_probabilities)^alpha)

Alpha here can be used to state a preference for shorter or longer sequences, if alpha > 1 longer sequences are preferred, if alpha < 1 shorter sequences are preferred, and if alpha = 0 the score equals the sum of the log probabilities.

Other supported modes are “dictionary” and “diff_from_input”. They only allow candidates that either contain dictionary words only or differ from the input text. For “dictionary” mode a prefix index must be specified, so that we are able to check if a decoded text is a prefix of or equal to a dictionary word.

__init__(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None)

Parameters

normalize_by_length (bool) –
alpha (float) –
mode (str) –
prefix_index (Optional[nsc.data.index.PrefixIndex]) –

Return type

None

alpha: float = 1.0

mode: str = 'log_likelihood'

normalize_by_length: bool = True

prefix_index: Optional[nsc.data.index.PrefixIndex] = None

class nsc.SpellingErrorCorrector

Bases: nsc.api.utils._APIBase

Spelling error correction

Class to run spelling error correction models.

__init__(model_dir, device, **kwargs)

Spelling error correction constructor.

Do not use this explicitly. Use the static SpellingErrorCorrector.from_pretrained() and SpellingErrorCorrector.from_experiment() methods instead.

Parameters

model_dir (str) – directory of the model to load
device (Union[str, int]) – device to load the model in
kwargs (Dict[str, Any]) –

Return type

None

correct_file(input_file_path, output_file_path=None, detections=None, search=GreedySearch(), score=Score(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None), batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True, **kwargs)

Correct spelling errors in a file.

Parameters

input_file_path (str) – path to an input file, which will be corrected line by line
output_file_path (Optional[str]) – path to an output file, where corrected text will be saved line by line
detections (Optional[Union[str, List[List[int]]]]) – spelling error detections (from a SpellingErrorDetector) to guide the correction, can either be a path to a file containing detections or a list of lists of integers
search (nsc.api.sec.Search) – Search instance to determine the search method to use for decoding
score (nsc.api.sec.Score) – Score instance to determine how to score search paths during decoding
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –

Return type

Optional[List[str]]

Returns: corrected file as list of strings if output_file_path is not specified else None

correct_text(inputs, detections=None, search=GreedySearch(), score=Score(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None), batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False, **kwargs)

Correct spelling errors in text.

Parameters

inputs (Union[str, List[str]]) – text to correct given as a single string or a list of strings
detections (Optional[Union[List[int], List[List[int]]]]) – spelling error detections (from a SpellingErrorDetector) to guide the correction, if inputs is a single str, detections must be a list of integers, otherwise if inputs is a list of strings, detections should be a list of lists of integers
search (nsc.api.sec.Search) – Search instance to determine the search method to use for decoding
score (nsc.api.sec.Score) – Score instance to determine how to score search paths during decoding
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –

Return type

Union[str, List[str]]

Returns: corrected text as string or list of strings

static from_experiment(experiment_dir, device='cuda')

Create a new NSC instance using your own experiment.

Parameters

experiment_dir (str) – path to the experiment directory
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.sec.SpellingErrorCorrector

Returns: NSC instance

static from_pretrained(task='sec', model='transformer words nmt', device='cuda', download_dir=None, cache_dir=None, force_download=False)

Create a new NSC instance using a pretrained model.

Parameters

task (str) – name of the task
model (str) – name of the pretrained model
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
download_dir (Optional[str]) – directory where the compressed model will be downloaded to
cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to
force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir

Return type

nsc.api.sec.SpellingErrorCorrector

Returns: NSC instance

property mixed_precision_enabled: bool

Check if mixed precision is enabled (precision is set to something else than fp32).

Returns: bool whether mixed precision is enabled

property model_name: str

Gives the name of the NSC model in use.

Returns: name of the model

set_precision(precision)

Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.

Parameters: precision (str) – precision identifier (one of {fp32, fp16, bfp16})
Return type: None

Returns: None

suggest(text, n, detections=None, **kwargs)

Generate suggestions for possible corrections of a text.

Note that if you use this with a model that translates each word individually (e.g. transformer word), the i-th suggestion string is built by joining the i-th suggestions for all words (they do not necessarily form a useful sentence together). To obtain n suggestions for each word in the text you can simply split every suggestion string on whitespaces.

Parameters

text (str) – text to generate suggestions for
n (int) – number of suggestions to generate
detections (Optional[List[int]]) – optional detections to exclude some words
kwargs (Any) –

Return type

List[str]

Returns: suggestions as list of strings

property task_name: str

Check which NSC task is run.

Returns: name of the task

to(device)

Move the model to a different device.

Parameters: device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)
Return type: nsc.api.utils._APIBase

Returns: self

class nsc.SpellingErrorDetector

Bases: nsc.api.utils._APIBase

Spelling error detection

Class to run spelling error detection models.

__init__(model_dir, device, **kwargs)

Spelling error detection constructor.

Do not use this explicitly. Use the static SpellingErrorDetector.from_pretrained() and SpellingErrorDetector.from_experiment() methods instead.

Parameters

model_dir (str) – directory of the model to load
device (Union[str, int]) – device to load the model in
kwargs (Dict[str, Any]) –

Return type

None

detect_file(input_file_path, output_file_path=None, threshold=0.5, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True, **kwargs)

Detect spelling errors in a file.

Parameters

input_file_path (str) – path to an input file, which will be checked for spelling errors line by line
output_file_path (Optional[str]) – path to an output file, where the detections will be saved line by line
threshold (float) – set detection threshold (0 < threshold < 1)
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –

Return type

Optional[Tuple[List[List[int]], List[str]]]

Returns: tuple of detections as list of lists of integers and output strings as list of strings: if output_file_path is not specified else None

detect_text(inputs, threshold=0.5, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False, **kwargs)

Detect spelling errors in text.

Parameters

inputs (Union[str, List[str]]) – text to check for errors given as a single string or a list of strings
threshold (float) – set detection threshold (0 < threshold < 1)
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –

Return type

Tuple[Union[List[int], List[List[int]]], Union[str, List[str]]]

Returns: tuple of detections as list of integers or list of lists of integers and output strings as: str or list of strings

static from_experiment(experiment_dir, device='cuda')

Create a new NSC instance using your own experiment.

Parameters

experiment_dir (str) – path to the experiment directory
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.sed.SpellingErrorDetector

Returns: NSC instance

static from_pretrained(task='sed words', model='gnn+', device='cuda', download_dir=None, cache_dir=None, force_download=False)

Create a new NSC instance using a pretrained model.

Parameters

task (str) – name of the task
model (str) – name of the pretrained model
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
download_dir (Optional[str]) – directory where the compressed model will be downloaded to
cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to
force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir

Return type

nsc.api.sed.SpellingErrorDetector

Returns: NSC instance

property mixed_precision_enabled: bool

Check if mixed precision is enabled (precision is set to something else than fp32).

Returns: bool whether mixed precision is enabled

property model_name: str

Gives the name of the NSC model in use.

Returns: name of the model

set_precision(precision)

Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.

Parameters: precision (str) – precision identifier (one of {fp32, fp16, bfp16})
Return type: None

Returns: None

property task_name: str

Check which NSC task is run.

Returns: name of the task

to(device)

Move the model to a different device.

Parameters: device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)
Return type: nsc.api.utils._APIBase

Returns: self

class nsc.TokenizationRepairer

Bases: nsc.api.utils._APIBase

Tokenization repair

Class to run tokenization repair models.

__init__(model_dir, device, **kwargs)

Tokenization repair constructor.

Do not use this explicitly. Use the static TokenizationRepairer.from_pretrained() and TokenizationRepairer.from_experiment() methods instead.

Parameters

model_dir (str) – directory of the model to load
device (Union[str, int]) – device to load the model in
kwargs (Dict[str, Any]) –

Return type

None

static from_experiment(experiment_dir, device='cuda')

Create a new NSC instance using your own experiment.

Parameters

experiment_dir (str) – path to the experiment directory
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.tokenization_repair.TokenizationRepairer

Returns: NSC instance

static from_pretrained(task='tokenization repair', model='eo large arxiv with errors', device='cuda', download_dir=None, cache_dir=None, force_download=False)

Create a new NSC instance using a pretrained model.

Parameters

task (str) – name of the task
model (str) – name of the pretrained model
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
download_dir (Optional[str]) – directory where the compressed model will be downloaded to
cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to
force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir

Return type

nsc.api.tokenization_repair.TokenizationRepairer

Returns: NSC instance

property mixed_precision_enabled: bool

Check if mixed precision is enabled (precision is set to something else than fp32).

Returns: bool whether mixed precision is enabled

property model_name: str

Gives the name of the NSC model in use.

Returns: name of the model

repair_file(input_file_path, output_file_path=None, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True)

Repair whitespaces in a file.

Parameters

input_file_path (str) – path to an input file, which will be repaired line by line
output_file_path (Optional[str]) – path to an output file, where repaired text will be saved line by line
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar

Return type

Optional[List[str]]

Returns: repaired file as list of strings if output_file_path is not specified else None

repair_text(inputs, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False)

Repair whitespaces in text.

Parameters

inputs (Union[str, List[str]]) – text to repair given as a single string or a list of strings
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar

Return type

Union[str, List[str]]

Returns: repaired text as string or list of strings

set_precision(precision)

Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.

Parameters: precision (str) – precision identifier (one of {fp32, fp16, bfp16})
Return type: None

Returns: None

property task_name: str

Check which NSC task is run.

Returns: name of the task

to(device)

Move the model to a different device.

Parameters: device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)
Return type: nsc.api.utils._APIBase

Returns: self

nsc.get_available_spelling_error_correction_models()

Get available spelling error correction models

Returns: list of spelling error correction model infos each containing the task name, model name and a short description

Return type: List[nsc.api.utils.ModelInfo]

nsc.get_available_spelling_error_detection_models()

Get available spelling error detection models

Returns: list of spelling error detection model infos each containing the task name, model name and a short description

Return type: List[nsc.api.utils.ModelInfo]

nsc.get_available_tokenization_repair_models()

Get available tokenization repair models

Returns: list of tokenization repair model infos each containing the task name, model name and a short description

Return type: List[nsc.api.utils.ModelInfo]