Neural spell checking using Transformers and Graph Neural Networks

This project is about detecting and correcting spelling errors using Transformers and Graph Neural Networks. Visit the documentation (which also includes this README) for more information on how to reproduce results and train your own models.

Installation

Clone the repository

git clone git@github.com:bastiscode/spell_check.git

Install from source (alternatively you can use Docker)

make install

Usage

There are three main ways to use this project:

  1. Command line

  2. Python API

  3. Web app

We recommend that you start out with the Web app using the Docker setup which provides the best user experience overall.

Command line interfaces

After installation there will be four commands available in your environment:

  1. nsec for neural spelling error correction

  2. nsed for neural spelling error detection

  3. ntr for neural tokenization repair

  4. nserver for running a spell checking server

By default the first three commands take input from stdin, run their respective task on the input line by line and print their output line by line to stdout. The fourth command allows you to start a spell checking server that gives you access to all models via a JSON API. It is e.g. used as backend for the Web app. The server can be configured using a yaml file (see server_config/server.yaml for the configuration options).

Let’s look at some basic examples on how to use the command line tools.

Spelling error correction using nsec

# correct text by piping into nsec,
echo "This is an incorect sentense!" | nsec
cat path/to/file.txt | nsec

# by using the -c flag,
nsec -c "This is an incorect sentense!"

# by passing a file,
nsec -f path/to/file.txt

# or by starting an interactive session
nsec -i

# you can also use spelling correction to detect errors on word and sequence level
nsec -c "This is an incorect sentense!" --convert-output word_detections
nsec -c "This is an incorect sentense!" --convert-output sequence_detections

Spelling error detection using nsed

# detect errors by piping into nsed,
echo "This is an incorect sentense!" | nsed
cat path/to/file.txt | nsed

# by using the -d flag,
nsed -d "This is an incorect sentense!"

# by passing a file,
nsed -f path/to/file.txt

# or by starting an interactive session
nsed -i

# you can also use word level spelling error detections to detect sequence level errors
nsed -d "This is an incorect sentense!" --convert-sequence

Tokenization repair using ntr

# repair text by piping into ntr,
echo "Thisis an inc orect sentens e!" | ntr
cat path/to/file.txt | ntr

# by using the -r flag,
ntr -r "Thisis an inc orect sentens e!"

# by passing a file,
ntr -f path/to/file.txt

# or by starting an interactive session
ntr -i

You can also combine the ntr, nsed, and nsec commands in a variety of ways. Some examples are shown below.

# repair and detect
echo "Repi arand core ct tihs sen tens!" | ntr | nsed
# to view both the repaired text and the detections use
echo "Repi arand core ct tihs sen tens!" | ntr | nsed --sec-out

# repair and correct
echo "Repi arand core ct tihs sen tens!" | ntr | nsec

# repair and correct a file and save the output
ntr -f path/to/file.txt | nsec --progress -o path/to/output_file.txt

# repair, detect and correct
# (this pipeline uses the spelling error detection output
# to guide the spelling error correction model to correct only the misspelled words)
echo "Repi arand core ct tihs sen tens!" | ntr | nsed --sec-out | nsec --sed-in

# some detection and correction models (e.g. tokenization repair+, tokenization repair++, transformer with tokenization repair nmt)
# can natively deal with incorrect whitespacing in text, so there is no need to use ntr before them if you want to process
# text with whitespacing errors
nsed -d "core ct thissen tense!" -m "sed words:tokenization repair+" --sec-out
nsec -c "core ct thissen tense!" -m "tokenization repair++"
nsec -c "core ct thissen tense!" -m "transformer with tokenization repair nmt"

There are a few other command line options available for the nsec, nsed and ntr commands. Inspect them by passing the -h / --help flag to the commands.

Running a spell checking server using nserver

# start the spell checking server, we provide a default config at server_config/server.yaml
nserver -c path/to/config.yaml

Python API

We also provide a Python API for you to use spell checking models directly in code. Below are basic code examples on how to use the API. For the full documentation of all classes, methods, etc. provided by the Python API see the nsc package documentation.

Spelling error correction

from nsc import SpellingErrorCorrector, get_available_spelling_error_correction_models

# show all spelling error correction models
print(get_available_spelling_error_correction_models())

# use a pretrained model
sec = SpellingErrorCorrector.from_pretrained()
# correct errors in text
correction = sec.correct_text("Tihs text has erors!")
print(correction)
# correct errors in file
corrections = sec.correct_file("path/to/file.txt")
print(correction)

Spelling error detection

from nsc import SpellingErrorDetector, get_available_spelling_error_detection_models

# show all spelling error detection models
print(get_available_spelling_error_detection_models())

# use a pretrained model
sed = SpellingErrorDetector.from_pretrained()
# detect errors in text
detection = sed.detect_text("Tihs text has erors!")
print(detection)
# detect errors in file
detections = sed.detect_file("path/to/file.txt")
print(detections)

Tokenization repair

from nsc import TokenizationRepairer, get_available_tokenization_repair_models

# show all tokenization repair models
print(get_available_tokenization_repair_models())

# use a pretrained model
tr = TokenizationRepairer.from_pretrained()
# repair tokenization in text
repaired_text = tr.repair_text("Ti hstext h aserors!")
print(repaired_text)
# repair tokenization in file
repaired_file = tr.repair_file("path/to/file.txt")
print(repaired_file)

Web app

Web app screenshot

The web app source can be found under webapp/build/web. You can host it e.g. using Pythons built-in http server:

python -m http.server -d directory webapp/build/web 8080

The web app expects a spell checking server running under the same hostname on port 44444 (we are currently looking into ways to be able to set the hostname and port for the server endpoint at runtime). If you are running the web app locally you can simply start the spell checking server using nserver:

nserver -c server_config/server.yaml

Then go to your browser to port 8080 to use the web app.

Docker

This project can also be run using Docker. Inside the Docker container both the Command line interfaces and Python API are available for you to use. You can also evaluate model predictions on benchmarks.

Build the Docker image:

make build_docker

Start a Docker container:

make run_docker

You can also pass additional Docker arguments to the make commands by specifying DOCKER_ARGS. For example, to be able to access your GPUs inside the container run make run_docker DOCKER_ARGS="--gpus all". As another example, to mount an additional directory inside the container use make run_docker DOCKER_ARGS="-v /path/to/outside/directory:/path/to/container/directory".

We provide two convenience commands for starting a spell checking server or hosting the web app:

Start a spell checking server on port <outside_port>:

make run_docker_server DOCKER_ARGS="-p <outside_port>:44444"

Start the web app on port <outside_port> (also runs the spell checking server):

make run_docker_webapp DOCKER_ARGS="-p <outside_port>:8080"

Hint

If you build the Docker image on an AD Server you probably want to use wharfer instead of Docker. To do that call the make commands with the additional argument DOCKER_CMD=wharfer, e.g. make build_docker DOCKER_CMD=wharfer.

Note

The Docker setup is only intended to be used for running the command line tools, Python API, or web app with pretrained or your own models and evaluating benchmarks, but not for training.

Note

Running the Docker container with GPU support assumes that you have the NVIDIA Container Toolkit installed.

Pretrained models

The following table shows all pretrained models available for the CLI, Python API, and web app:

Table

Task (CLI tool/Python class)

Model

Description

Default

tokenization repair (ntr/TokenizationRepairer)

eo large arxiv with errors

Large-sized Transformer model (12 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt).

X

tokenization repair (ntr/TokenizationRepairer)

eo medium arxiv with errors

Medium-sized Transformer model (6 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt).

tokenization repair (ntr/TokenizationRepairer)

eo small arxiv with errors

Small-sized Transformer model (3 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt).

tokenization repair (ntr/TokenizationRepairer)

tokenization repair+

Same as eo medium arxiv with errors, available here for completeness.

tokenization repair (ntr/TokenizationRepairer)

tokenization repair++

Same as eo medium arxiv with errors, available here for completeness.

sed words (nsed/SpellingErrorDetector)

gnn+

Attentional Graph Neural Network which processes language graphs with fully connected word nodes, word features and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations.

X

sed words (nsed/SpellingErrorDetector)

gnn+ neuspell

Attentional Graph Neural Network which processes language graphs with fully connected word nodes, word features and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)

sed words (nsed/SpellingErrorDetector)

transformer

Regular transformer processing a sequence of sub-word tokens. Predicts spelling errors on word level using the aggregated sub-word representations per word.

sed words (nsed/SpellingErrorDetector)

transformer neuspell

Regular transformer processing a sequence of sub-word tokens. Predicts spelling errors on word level using the aggregated sub-word representations per word. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)

sed words (nsed/SpellingErrorDetector)

transformer+

Regular transformer processing a sequence of sub-word tokens. Before predicting spelling errors, sub-word representations within a word are aggregated and enriched with word features to obtain word representations. Predicts spelling errors on word level using those word representations.

sed words (nsed/SpellingErrorDetector)

transformer+ neuspell

Regular transformer processing a sequence of sub-word tokens. Before predicting spelling errors, sub-word representations within a word are aggregated and enriched with word features to obtain word representations. Predicts spelling errors on word level using those word representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)

sed words (nsed/SpellingErrorDetector)

gnn

Attentional Graph Neural Network which processes language graphs with fully connected word nodes and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations.

sed words (nsed/SpellingErrorDetector)

gnn neuspell

Attentional Graph Neural Network which processes language graphs with fully connected word nodes and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)

sed words (nsed/SpellingErrorDetector)

tokenization repair+

Transformer based model that detects errors in sequences by first correcting the tokenization and then detecting spelling errors for each word in the repaired text.

sed words (nsed/SpellingErrorDetector)

tokenization repair++

Transformer based model that detects errors in sequences by first correcting the tokenization and then detecting spelling errors for each word in the repaired text. Different from tokenization repair+ because this model was trained to also correct spelling errors (it is also available in nsec).

sec (nsec/SpellingErrorCorrector)

transformer words nmt

Transformer model that corrects sequences by translating each word individually from misspelled to correct.

X

sec (nsec/SpellingErrorCorrector)

transformer words nmt neuspell

Transformer model that corrects sequences by translating each word individually from misspelled to correct. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)

sec (nsec/SpellingErrorCorrector)

tokenization repair++

Transformer based model that corrects sequences by first correcting the tokenization, then detecting spelling errors for each word in the repaired text and then translating every detected misspelled word to its corrected version.

sec (nsec/SpellingErrorCorrector)

transformer nmt

Transformer model that translates a sequence with spelling errors into a sequence without spelling errors.

sec (nsec/SpellingErrorCorrector)

transformer nmt neuspell

Transformer model that translates a sequence with spelling errors into a sequence without spelling errors. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data)

sec (nsec/SpellingErrorCorrector)

transformer with tokenization repair nmt

Transformer model that translates a sequence with spelling and tokenization errors into a sequence without spelling and tokenization errors. Different from transformer nmt because this model tokenizes into characters and was trained on text with spelling and tokenization errors, whereas transformer nmt tokenizes into sub-words and was trained only on text with spelling errors.

Training and reproducing results

You can skip the following training section if you only want to use pretrained models and not train your own models.

Training

Before starting training you need the get the training data. Everything you need (preprocessed samples, tokenizer, dictionaries, etc.) can be found under /nfs/students/sebastian-walter/masters_thesis/data.

You also need to set the following two special environment variables:

# set the data directory
export NSC_DATA_DIR=/nfs/students/sebastian-walter/masters_thesis/data

# set the config directory (necessary to be able to compose
# the final training config from sub-configs)
export NSC_CONFIG_DIR=spell_checking/configs

Note

Of course you can also copy the training data folder to every other place you like and adjust NSC_DATA_DIR accordingly. But keep in mind that this folder is very large (> 1TB).

After that you can train your own models using a training config. All of the training configs this project used to train models can be found here.

You might have to further configure a training config by setting additional environment variables. Let’s look at an example where we want to train a spelling error detection Graph Neural Network. The config for this task looks like the following:

variant: ${from_file:variant/sed_words.yaml}
model: ${from_file:model/model_for_sed_words.yaml}
optimizer: ${from_file:optimizer/adamw.yaml}
lr_scheduler: ${from_file:lr_scheduler/step_with_0.05_warmup.yaml}

experiment_dir: ${oc.env:NSC_EXPERIMENT_DIR}
data_dir: ${oc.env:NSC_DATA_DIR}
datasets:
  - ${oc.env:NSC_DATA_DIR}/processed/wikidump_paragraphs_sed_words_and_sec
  - ${oc.env:NSC_DATA_DIR}/processed/bookcorpus_paragraphs_sed_words_and_sec
dataset_limits:
  - ${oc.env:NSC_DATA_LIMIT}
  - ${oc.env:NSC_DATA_LIMIT}
val_splits:
  - 2500
  - 2500

experiment_name: ${oc.env:NSC_EXPERIMENT_NAME,dummy}
epochs: ${oc.env:NSC_EPOCHS,1}
batch_max_length: ${oc.env:NSC_BATCH_MAX_LENGTH,4096}
bucket_span: ${oc.env:NSC_BUCKET_SPAN,4}
log_per_epoch: ${oc.env:NSC_LOG_PER_EPOCH,100}
eval_per_epoch: ${oc.env:NSC_EVAL_PER_EPOCH,4}
seed: 22
mixed_precision: ${oc.env:NSC_MIXED_PRECISION,true}

Values looking like ${from_file:<file_path>} refer to other config files relative to the NSC_CONFIG_DIR. When the training config is composed, the contents of the referred config files will replace these values.

Values looking like ${oc.env:<env_var_name>,<default>} refer to environment variables and an optional default that will be set if the environment variable is not found. If there is no default you will be required to set the environment variable, otherwise you receive an error message.

In our example we need to set values for the environment variables NSC_DATA_LIMIT (can be used to limit the number of samples per training dataset) and NSC_EXPERIMENT_DIR (directory path where the logs and checkpoints will be saved). Once we have set these variables we can start the training. Since the training script is written to support distributed training we need to use torchrun to launch the script:

# set the environment variables
export NSC_DATA_LIMIT=100000 # set data limit to 100,000 samples per dataset
export NSC_EXPERIMENT_DIR=experiments # directory path where the experiment will be saved

# to train locally / on a single node
torchrun --nnodes=1 nsc/train.py --config spell_checking/configs/sed_words.yaml

You can also resume training for an existing experiment if you had to abort training for some reason:

# resume training from latest checkpoint of an experiment
torchrun --nnodes=1 nsc/train.py --resume <path_to_experiment_directory>

As an alternative you can set one of the NSC_CONFIG or NSC_RESUME environment variables and use the train.sh script to start training. This script additionally provides functionality to start distributed training on SLURM clusters. Training using this script would look something like this:

# set the environment variables
export NSC_DATA_LIMIT=100000 # set data limit to 100,000 samples per dataset
export NSC_EXPERIMENT_DIR=experiments # directory path where the experiment will be saved

## LOCAL training
# start new training run using a config
NSC_CONFIG=spell_checking/configs/sed_words.yaml spell_checking/scripts/train.sh

# resume training from latest checkpoint of an experiment
NSC_RESUME=<path_to_experiment_directory> spell_checking/scripts/train.sh

## SLURM training
# starting distributed training on a SLURM cluster using sbatch
# requires you to set the NSC_WORLD_SIZE environment variable (total number of GPUs used for training)
# if you e.g. want to train on 4 nodes with 2 GPUs each set NSC_WORLD_SIZE=8
NSC_CONFIG=spell_checking/configs/sed_words.yaml NSC_WORLD_SIZE=8 sbatch --nodes=4 --ntasks-per-node=2 --gres=gpu:2 spell_checking/scripts/train.sh

# if you are in an interactive SLURM session (started e.g. with srun)
# you probably want to train as if you are running locally, set NSC_FORCE_LOCAL=true and
# start training without sbatch
NSC_FORCE_LOCAL=true NSC_CONFIG=spell_checking/configs/sed_words.yaml spell_checking/scripts/train.sh

To retrain the models of this project see the train_slurm_<task>.sh scripts in the scripts directory which were used for training all models. These scripts do nothing more than setting some environment variables and calling the train.sh script.

Note

Using the train_slurm_<task>.sh scripts for training is only possible on a SLURM cluster since they call the train.sh script using SLURMs sbatch command.

Reproduce

We make all models that are needed to reproduce the results on the projects’ benchmarks available as pretrained models. All pretrained models can be accessed either through the command line interface (nsec, nsed, ntr), the Python API, or the web app.

All benchmarks as well as our and the baseline results on them can be found in this directory. Every benchmark follows the same directory structure:

  • <task>/<benchmark_group>/<benchmark_split>/corrupt.txt

  • <task>/<benchmark_group>/<benchmark_split>/correct.txt

Here corrupt.txt is the input containing misspelled text and correct.txt is the groundtruth output. We provide a evaluation script that can be used to evaluate model predictions on a given benchmark. Alternatively the web app also features a section where you can evaluate model predictions on benchmarks.

As an example, lets look at the steps that are necessary to evaluate our gnn+ model for word level spelling error detection on the wikidump realistic benchmark using the command line interface:

  1. Run model on benchmark:

nsed -m "sed words:gnn+" \  # choose the model
-f spell_checking/benchmarks/test/sed_words/wikidump/realistic/corrupt.txt \  # input file
-o gnn_plus_predictions.txt  # save output to file
  1. Evaluate model predictions:

python spell_checking/benchmarks/scripts/evaluate.py \
sed_words \  # benchmark type
spell_checking/benchmarks/test/sed_words/wikidump/realistic/corrupt.txt \  # input file
spell_checking/benchmarks/test/sed_words/wikidump/realistic/correct.txt \  # groundtruth file
gnn_plus_predictions.txt  # predicted file

Hint

By default a pretrained model is downloaded as a zip file and then extracted when you first use it. Since some models are quite large this can take some time. To cut this time all pretrained models can also be found as zip files in the directory /nfs/students/sebastian-walter/masters_thesis/zipped. If you set the env variable NSC_DOWNLOAD_DIR to this directory, the models are loaded from this directory and must not be downloaded first. If you are running this project using Docker you can mount the directory to the containers download directory by specifying an additional volume flag: make run_docker DOCKER_ARGS="-v /nfs/students/sebastian-walter/masters_thesis/zipped:/download.

Hint

To access the benchmarks if you are running this project with Docker you can mount the benchmark directory inside the Docker container using make run_docker DOCKER_ARGS="-v $(pwd)/spell_checking/benchmarks/test:/benchmarks". The Docker container also provides additional commands for evaluating benchmarks that are basically wrappers around the evaluation script mentioned above.

nsc package

class nsc.BeamSearch

Bases: nsc.api.sec.Search

Beam search: Maintain and expand the top beam_width paths.

__init__(beam_width=5)
Parameters

beam_width (int) –

Return type

None

beam_width: int = 5
class nsc.BestFirstSearch

Bases: nsc.api.sec.Search

Best first search: Always expand the highest scoring path out of all paths encountered so far.

__init__()
Return type

None

class nsc.GreedySearch

Bases: nsc.api.sec.Search

Greedy search: Always go to the highest scoring next node.

__init__()
Return type

None

class nsc.SampleSearch

Bases: nsc.api.sec.Search

Sample search: Go to a random node of the top_k highest scoring next nodes.

__init__(top_k=5)
Parameters

top_k (int) –

Return type

None

top_k: int = 5
class nsc.Score

Bases: object

Determines how paths during/after decoding are scored/rescored.

The default mode “log_likelihood” is to score a candidate using the token sequence log likelihood given as the sum of all token log probabilities. For some search methods we optionally support normalization by the token sequence length (so shorter paths are not preferred):

score = sum(log_probabilities) / (len(log_probabilities)^alpha)

Alpha here can be used to state a preference for shorter or longer sequences, if alpha > 1 longer sequences are preferred, if alpha < 1 shorter sequences are preferred, and if alpha = 0 the score equals the sum of the log probabilities.

Other supported modes are “dictionary” and “diff_from_input”. They only allow candidates that either contain dictionary words only or differ from the input text. For “dictionary” mode a prefix index must be specified, so that we are able to check if a decoded text is a prefix of or equal to a dictionary word.

__init__(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None)
Parameters
  • normalize_by_length (bool) –

  • alpha (float) –

  • mode (str) –

  • prefix_index (Optional[nsc.data.index.PrefixIndex]) –

Return type

None

alpha: float = 1.0
mode: str = 'log_likelihood'
normalize_by_length: bool = True
prefix_index: Optional[nsc.data.index.PrefixIndex] = None
class nsc.SpellingErrorCorrector

Bases: nsc.api.utils._APIBase

Spelling error correction

Class to run spelling error correction models.

__init__(model_dir, device, **kwargs)

Spelling error correction constructor.

Do not use this explicitly. Use the static SpellingErrorCorrector.from_pretrained() and SpellingErrorCorrector.from_experiment() methods instead.

Parameters
  • model_dir (str) – directory of the model to load

  • device (Union[str, int]) – device to load the model in

  • kwargs (Dict[str, Any]) –

Return type

None

correct_file(input_file_path, output_file_path=None, detections=None, search=GreedySearch(), score=Score(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None), batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True, **kwargs)

Correct spelling errors in a file.

Parameters
  • input_file_path (str) – path to an input file, which will be corrected line by line

  • output_file_path (Optional[str]) – path to an output file, where corrected text will be saved line by line

  • detections (Optional[Union[str, List[List[int]]]]) – spelling error detections (from a SpellingErrorDetector) to guide the correction, can either be a path to a file containing detections or a list of lists of integers

  • search (nsc.api.sec.Search) – Search instance to determine the search method to use for decoding

  • score (nsc.api.sec.Score) – Score instance to determine how to score search paths during decoding

  • batch_size (int) – how many sequences to process at once

  • batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)

  • sort_by_length (bool) – sort the inputs by length before processing them

  • show_progress (bool) – display progress bar

  • kwargs (Any) –

Return type

Optional[List[str]]

Returns: corrected file as list of strings if output_file_path is not specified else None

correct_text(inputs, detections=None, search=GreedySearch(), score=Score(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None), batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False, **kwargs)

Correct spelling errors in text.

Parameters
  • inputs (Union[str, List[str]]) – text to correct given as a single string or a list of strings

  • detections (Optional[Union[List[int], List[List[int]]]]) – spelling error detections (from a SpellingErrorDetector) to guide the correction, if inputs is a single str, detections must be a list of integers, otherwise if inputs is a list of strings, detections should be a list of lists of integers

  • search (nsc.api.sec.Search) – Search instance to determine the search method to use for decoding

  • score (nsc.api.sec.Score) – Score instance to determine how to score search paths during decoding

  • batch_size (int) – how many sequences to process at once

  • batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)

  • sort_by_length (bool) – sort the inputs by length before processing them

  • show_progress (bool) – display progress bar

  • kwargs (Any) –

Return type

Union[str, List[str]]

Returns: corrected text as string or list of strings

static from_experiment(experiment_dir, device='cuda')

Create a new NSC instance using your own experiment.

Parameters
  • experiment_dir (str) – path to the experiment directory

  • device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.sec.SpellingErrorCorrector

Returns: NSC instance

static from_pretrained(task='sec', model='transformer words nmt', device='cuda', download_dir=None, cache_dir=None, force_download=False)

Create a new NSC instance using a pretrained model.

Parameters
  • task (str) – name of the task

  • model (str) – name of the pretrained model

  • device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

  • download_dir (Optional[str]) – directory where the compressed model will be downloaded to

  • cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to

  • force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir

Return type

nsc.api.sec.SpellingErrorCorrector

Returns: NSC instance

property mixed_precision_enabled: bool

Check if mixed precision is enabled (precision is set to something else than fp32).

Returns: bool whether mixed precision is enabled

property model_name: str

Gives the name of the NSC model in use.

Returns: name of the model

set_precision(precision)

Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.

Parameters

precision (str) – precision identifier (one of {fp32, fp16, bfp16})

Return type

None

Returns: None

suggest(text, n, detections=None, **kwargs)

Generate suggestions for possible corrections of a text.

Note that if you use this with a model that translates each word individually (e.g. transformer word), the i-th suggestion string is built by joining the i-th suggestions for all words (they do not necessarily form a useful sentence together). To obtain n suggestions for each word in the text you can simply split every suggestion string on whitespaces.

Parameters
  • text (str) – text to generate suggestions for

  • n (int) – number of suggestions to generate

  • detections (Optional[List[int]]) – optional detections to exclude some words

  • kwargs (Any) –

Return type

List[str]

Returns: suggestions as list of strings

property task_name: str

Check which NSC task is run.

Returns: name of the task

to(device)

Move the model to a different device.

Parameters

device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.utils._APIBase

Returns: self

class nsc.SpellingErrorDetector

Bases: nsc.api.utils._APIBase

Spelling error detection

Class to run spelling error detection models.

__init__(model_dir, device, **kwargs)

Spelling error detection constructor.

Do not use this explicitly. Use the static SpellingErrorDetector.from_pretrained() and SpellingErrorDetector.from_experiment() methods instead.

Parameters
  • model_dir (str) – directory of the model to load

  • device (Union[str, int]) – device to load the model in

  • kwargs (Dict[str, Any]) –

Return type

None

detect_file(input_file_path, output_file_path=None, threshold=0.5, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True, **kwargs)

Detect spelling errors in a file.

Parameters
  • input_file_path (str) – path to an input file, which will be checked for spelling errors line by line

  • output_file_path (Optional[str]) – path to an output file, where the detections will be saved line by line

  • threshold (float) – set detection threshold (0 < threshold < 1)

  • batch_size (int) – how many sequences to process at once

  • batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)

  • sort_by_length (bool) – sort the inputs by length before processing them

  • show_progress (bool) – display progress bar

  • kwargs (Any) –

Return type

Optional[Tuple[List[List[int]], List[str]]]

Returns: tuple of detections as list of lists of integers and output strings as list of strings

if output_file_path is not specified else None

detect_text(inputs, threshold=0.5, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False, **kwargs)

Detect spelling errors in text.

Parameters
  • inputs (Union[str, List[str]]) – text to check for errors given as a single string or a list of strings

  • threshold (float) – set detection threshold (0 < threshold < 1)

  • batch_size (int) – how many sequences to process at once

  • batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)

  • sort_by_length (bool) – sort the inputs by length before processing them

  • show_progress (bool) – display progress bar

  • kwargs (Any) –

Return type

Tuple[Union[List[int], List[List[int]]], Union[str, List[str]]]

Returns: tuple of detections as list of integers or list of lists of integers and output strings as

str or list of strings

static from_experiment(experiment_dir, device='cuda')

Create a new NSC instance using your own experiment.

Parameters
  • experiment_dir (str) – path to the experiment directory

  • device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.sed.SpellingErrorDetector

Returns: NSC instance

static from_pretrained(task='sed words', model='gnn+', device='cuda', download_dir=None, cache_dir=None, force_download=False)

Create a new NSC instance using a pretrained model.

Parameters
  • task (str) – name of the task

  • model (str) – name of the pretrained model

  • device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

  • download_dir (Optional[str]) – directory where the compressed model will be downloaded to

  • cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to

  • force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir

Return type

nsc.api.sed.SpellingErrorDetector

Returns: NSC instance

property mixed_precision_enabled: bool

Check if mixed precision is enabled (precision is set to something else than fp32).

Returns: bool whether mixed precision is enabled

property model_name: str

Gives the name of the NSC model in use.

Returns: name of the model

set_precision(precision)

Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.

Parameters

precision (str) – precision identifier (one of {fp32, fp16, bfp16})

Return type

None

Returns: None

property task_name: str

Check which NSC task is run.

Returns: name of the task

to(device)

Move the model to a different device.

Parameters

device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.utils._APIBase

Returns: self

class nsc.TokenizationRepairer

Bases: nsc.api.utils._APIBase

Tokenization repair

Class to run tokenization repair models.

__init__(model_dir, device, **kwargs)

Tokenization repair constructor.

Do not use this explicitly. Use the static TokenizationRepairer.from_pretrained() and TokenizationRepairer.from_experiment() methods instead.

Parameters
  • model_dir (str) – directory of the model to load

  • device (Union[str, int]) – device to load the model in

  • kwargs (Dict[str, Any]) –

Return type

None

static from_experiment(experiment_dir, device='cuda')

Create a new NSC instance using your own experiment.

Parameters
  • experiment_dir (str) – path to the experiment directory

  • device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.tokenization_repair.TokenizationRepairer

Returns: NSC instance

static from_pretrained(task='tokenization repair', model='eo large arxiv with errors', device='cuda', download_dir=None, cache_dir=None, force_download=False)

Create a new NSC instance using a pretrained model.

Parameters
  • task (str) – name of the task

  • model (str) – name of the pretrained model

  • device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)

  • download_dir (Optional[str]) – directory where the compressed model will be downloaded to

  • cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to

  • force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir

Return type

nsc.api.tokenization_repair.TokenizationRepairer

Returns: NSC instance

property mixed_precision_enabled: bool

Check if mixed precision is enabled (precision is set to something else than fp32).

Returns: bool whether mixed precision is enabled

property model_name: str

Gives the name of the NSC model in use.

Returns: name of the model

repair_file(input_file_path, output_file_path=None, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True)

Repair whitespaces in a file.

Parameters
  • input_file_path (str) – path to an input file, which will be repaired line by line

  • output_file_path (Optional[str]) – path to an output file, where repaired text will be saved line by line

  • batch_size (int) – how many sequences to process at once

  • batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)

  • sort_by_length (bool) – sort the inputs by length before processing them

  • show_progress (bool) – display progress bar

Return type

Optional[List[str]]

Returns: repaired file as list of strings if output_file_path is not specified else None

repair_text(inputs, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False)

Repair whitespaces in text.

Parameters
  • inputs (Union[str, List[str]]) – text to repair given as a single string or a list of strings

  • batch_size (int) – how many sequences to process at once

  • batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)

  • sort_by_length (bool) – sort the inputs by length before processing them

  • show_progress (bool) – display progress bar

Return type

Union[str, List[str]]

Returns: repaired text as string or list of strings

set_precision(precision)

Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.

Parameters

precision (str) – precision identifier (one of {fp32, fp16, bfp16})

Return type

None

Returns: None

property task_name: str

Check which NSC task is run.

Returns: name of the task

to(device)

Move the model to a different device.

Parameters

device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)

Return type

nsc.api.utils._APIBase

Returns: self

nsc.get_available_spelling_error_correction_models()

Get available spelling error correction models

Returns: list of spelling error correction model infos each containing the task name, model name and a short description

Return type

List[nsc.api.utils.ModelInfo]

nsc.get_available_spelling_error_detection_models()

Get available spelling error detection models

Returns: list of spelling error detection model infos each containing the task name, model name and a short description

Return type

List[nsc.api.utils.ModelInfo]

nsc.get_available_tokenization_repair_models()

Get available tokenization repair models

Returns: list of tokenization repair model infos each containing the task name, model name and a short description

Return type

List[nsc.api.utils.ModelInfo]