Neural spell checking using Transformers and Graph Neural Networks
This project is about detecting and correcting spelling errors using Transformers and Graph Neural Networks. Visit the documentation (which also includes this README) for more information on how to reproduce results and train your own models.
Installation
Clone the repository
git clone git@github.com:bastiscode/spell_check.git
Install from source (alternatively you can use Docker)
make install
Usage
There are three main ways to use this project:
Command line
Python API
Web app
We recommend that you start out with the Web app using the Docker setup which provides the best user experience overall.
Command line interfaces
After installation there will be four commands available in your environment:
nsec
for neural spelling error correctionnsed
for neural spelling error detectionntr
for neural tokenization repairnserver
for running a spell checking server
By default the first three commands take input from stdin, run their respective task on the input line by line and print their output line by line to stdout. The fourth command allows you to start a spell checking server that gives you access to all models via a JSON API. It is e.g. used as backend for the Web app. The server can be configured using a yaml file (see server_config/server.yaml for the configuration options).
Let’s look at some basic examples on how to use the command line tools.
Spelling error correction using nsec
# correct text by piping into nsec,
echo "This is an incorect sentense!" | nsec
cat path/to/file.txt | nsec
# by using the -c flag,
nsec -c "This is an incorect sentense!"
# by passing a file,
nsec -f path/to/file.txt
# or by starting an interactive session
nsec -i
# you can also use spelling correction to detect errors on word and sequence level
nsec -c "This is an incorect sentense!" --convert-output word_detections
nsec -c "This is an incorect sentense!" --convert-output sequence_detections
Spelling error detection using nsed
# detect errors by piping into nsed,
echo "This is an incorect sentense!" | nsed
cat path/to/file.txt | nsed
# by using the -d flag,
nsed -d "This is an incorect sentense!"
# by passing a file,
nsed -f path/to/file.txt
# or by starting an interactive session
nsed -i
# you can also use word level spelling error detections to detect sequence level errors
nsed -d "This is an incorect sentense!" --convert-sequence
Tokenization repair using ntr
# repair text by piping into ntr,
echo "Thisis an inc orect sentens e!" | ntr
cat path/to/file.txt | ntr
# by using the -r flag,
ntr -r "Thisis an inc orect sentens e!"
# by passing a file,
ntr -f path/to/file.txt
# or by starting an interactive session
ntr -i
You can also combine the ntr
, nsed
, and nsec
commands in a variety of ways.
Some examples are shown below.
# repair and detect
echo "Repi arand core ct tihs sen tens!" | ntr | nsed
# to view both the repaired text and the detections use
echo "Repi arand core ct tihs sen tens!" | ntr | nsed --sec-out
# repair and correct
echo "Repi arand core ct tihs sen tens!" | ntr | nsec
# repair and correct a file and save the output
ntr -f path/to/file.txt | nsec --progress -o path/to/output_file.txt
# repair, detect and correct
# (this pipeline uses the spelling error detection output
# to guide the spelling error correction model to correct only the misspelled words)
echo "Repi arand core ct tihs sen tens!" | ntr | nsed --sec-out | nsec --sed-in
# some detection and correction models (e.g. tokenization repair+, tokenization repair++, transformer with tokenization repair nmt)
# can natively deal with incorrect whitespacing in text, so there is no need to use ntr before them if you want to process
# text with whitespacing errors
nsed -d "core ct thissen tense!" -m "sed words:tokenization repair+" --sec-out
nsec -c "core ct thissen tense!" -m "tokenization repair++"
nsec -c "core ct thissen tense!" -m "transformer with tokenization repair nmt"
There are a few other command line options available for the nsec
, nsed
and ntr
commands. Inspect
them by passing the -h / --help
flag to the commands.
Running a spell checking server using nserver
# start the spell checking server, we provide a default config at server_config/server.yaml
nserver -c path/to/config.yaml
Python API
We also provide a Python API for you to use spell checking models directly in code. Below are basic code examples on how to use the API. For the full documentation of all classes, methods, etc. provided by the Python API see the nsc package documentation.
Spelling error correction
from nsc import SpellingErrorCorrector, get_available_spelling_error_correction_models
# show all spelling error correction models
print(get_available_spelling_error_correction_models())
# use a pretrained model
sec = SpellingErrorCorrector.from_pretrained()
# correct errors in text
correction = sec.correct_text("Tihs text has erors!")
print(correction)
# correct errors in file
corrections = sec.correct_file("path/to/file.txt")
print(correction)
Spelling error detection
from nsc import SpellingErrorDetector, get_available_spelling_error_detection_models
# show all spelling error detection models
print(get_available_spelling_error_detection_models())
# use a pretrained model
sed = SpellingErrorDetector.from_pretrained()
# detect errors in text
detection = sed.detect_text("Tihs text has erors!")
print(detection)
# detect errors in file
detections = sed.detect_file("path/to/file.txt")
print(detections)
Tokenization repair
from nsc import TokenizationRepairer, get_available_tokenization_repair_models
# show all tokenization repair models
print(get_available_tokenization_repair_models())
# use a pretrained model
tr = TokenizationRepairer.from_pretrained()
# repair tokenization in text
repaired_text = tr.repair_text("Ti hstext h aserors!")
print(repaired_text)
# repair tokenization in file
repaired_file = tr.repair_file("path/to/file.txt")
print(repaired_file)
Web app

The web app source can be found under webapp/build/web. You can host it e.g. using Pythons built-in http server:
python -m http.server -d directory webapp/build/web 8080
The web app expects a spell checking server running under the same hostname on port 44444 (we are currently looking
into ways to be able to set the hostname and port for the server endpoint at runtime). If you are running the web app
locally you can simply start the spell checking server using nserver
:
nserver -c server_config/server.yaml
Then go to your browser to port 8080 to use the web app.
Docker
This project can also be run using Docker. Inside the Docker container both the Command line interfaces and Python API are available for you to use. You can also evaluate model predictions on benchmarks.
Build the Docker image:
make build_docker
Start a Docker container:
make run_docker
You can also pass additional Docker arguments to the make commands by specifying DOCKER_ARGS
. For example,
to be able to access your GPUs inside the container run make run_docker DOCKER_ARGS="--gpus all"
. As another example,
to mount an additional directory inside the container use
make run_docker DOCKER_ARGS="-v /path/to/outside/directory:/path/to/container/directory"
.
We provide two convenience commands for starting a spell checking server or hosting the web app:
Start a spell checking server on port <outside_port>:
make run_docker_server DOCKER_ARGS="-p <outside_port>:44444"
Start the web app on port <outside_port> (also runs the spell checking server):
make run_docker_webapp DOCKER_ARGS="-p <outside_port>:8080"
Hint
If you build the Docker image on an AD Server you probably want to use wharfer instead of
Docker. To do that call the make commands with the additional argument DOCKER_CMD=wharfer
,
e.g. make build_docker DOCKER_CMD=wharfer
.
Note
The Docker setup is only intended to be used for running the command line tools, Python API, or web app with pretrained or your own models and evaluating benchmarks, but not for training.
Note
Running the Docker container with GPU support assumes that you have the NVIDIA Container Toolkit installed.
Pretrained models
The following table shows all pretrained models available for the CLI, Python API, and web app:
Task (CLI tool/Python class) |
Model |
Description |
Default |
---|---|---|---|
tokenization repair ( |
eo large arxiv with errors |
Large-sized Transformer model (12 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt). |
X |
tokenization repair ( |
eo medium arxiv with errors |
Medium-sized Transformer model (6 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt). |
|
tokenization repair ( |
eo small arxiv with errors |
Small-sized Transformer model (3 layers) that repairs sequences by predicting repair tokens for each character (ported from https://github.com/ad-freiburg/trt). |
|
tokenization repair ( |
tokenization repair+ |
Same as eo medium arxiv with errors, available here for completeness. |
|
tokenization repair ( |
tokenization repair++ |
Same as eo medium arxiv with errors, available here for completeness. |
|
sed words ( |
gnn+ |
Attentional Graph Neural Network which processes language graphs with fully connected word nodes, word features and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. |
X |
sed words ( |
gnn+ neuspell |
Attentional Graph Neural Network which processes language graphs with fully connected word nodes, word features and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data) |
|
sed words ( |
transformer |
Regular transformer processing a sequence of sub-word tokens. Predicts spelling errors on word level using the aggregated sub-word representations per word. |
|
sed words ( |
transformer neuspell |
Regular transformer processing a sequence of sub-word tokens. Predicts spelling errors on word level using the aggregated sub-word representations per word. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data) |
|
sed words ( |
transformer+ |
Regular transformer processing a sequence of sub-word tokens. Before predicting spelling errors, sub-word representations within a word are aggregated and enriched with word features to obtain word representations. Predicts spelling errors on word level using those word representations. |
|
sed words ( |
transformer+ neuspell |
Regular transformer processing a sequence of sub-word tokens. Before predicting spelling errors, sub-word representations within a word are aggregated and enriched with word features to obtain word representations. Predicts spelling errors on word level using those word representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data) |
|
sed words ( |
gnn |
Attentional Graph Neural Network which processes language graphs with fully connected word nodes and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. |
|
sed words ( |
gnn neuspell |
Attentional Graph Neural Network which processes language graphs with fully connected word nodes and fully connected sub-word cliques. Predicts spelling errors on word level using the word node representations. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data) |
|
sed words ( |
tokenization repair+ |
Transformer based model that detects errors in sequences by first correcting the tokenization and then detecting spelling errors for each word in the repaired text. |
|
sed words ( |
tokenization repair++ |
Transformer based model that detects errors in sequences by first correcting the tokenization and then detecting spelling errors for each word in the repaired text. Different from tokenization repair+ because this model was trained to also correct spelling errors (it is also available in nsec). |
|
sec ( |
transformer words nmt |
Transformer model that corrects sequences by translating each word individually from misspelled to correct. |
X |
sec ( |
transformer words nmt neuspell |
Transformer model that corrects sequences by translating each word individually from misspelled to correct. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data) |
|
sec ( |
tokenization repair++ |
Transformer based model that corrects sequences by first correcting the tokenization, then detecting spelling errors for each word in the repaired text and then translating every detected misspelled word to its corrected version. |
|
sec ( |
transformer nmt |
Transformer model that translates a sequence with spelling errors into a sequence without spelling errors. |
|
sec ( |
transformer nmt neuspell |
Transformer model that translates a sequence with spelling errors into a sequence without spelling errors. (pretrained without Neuspell and BEA misspellings, finetuned on Neuspell training data) |
|
sec ( |
transformer with tokenization repair nmt |
Transformer model that translates a sequence with spelling and tokenization errors into a sequence without spelling and tokenization errors. Different from transformer nmt because this model tokenizes into characters and was trained on text with spelling and tokenization errors, whereas transformer nmt tokenizes into sub-words and was trained only on text with spelling errors. |
Training and reproducing results
You can skip the following training section if you only want to use pretrained models and not train your own models.
Training
Before starting training you need the get the training data. Everything you need
(preprocessed samples, tokenizer, dictionaries, etc.) can be found under /nfs/students/sebastian-walter/masters_thesis/data
.
You also need to set the following two special environment variables:
# set the data directory
export NSC_DATA_DIR=/nfs/students/sebastian-walter/masters_thesis/data
# set the config directory (necessary to be able to compose
# the final training config from sub-configs)
export NSC_CONFIG_DIR=spell_checking/configs
Note
Of course you can also copy the training data folder to every other
place you like and adjust NSC_DATA_DIR
accordingly. But keep in mind that this
folder is very large (> 1TB).
After that you can train your own models using a training config. All of the training configs this project used to train models can be found here.
You might have to further configure a training config by setting additional environment variables. Let’s look at an example where we want to train a spelling error detection Graph Neural Network. The config for this task looks like the following:
variant: ${from_file:variant/sed_words.yaml}
model: ${from_file:model/model_for_sed_words.yaml}
optimizer: ${from_file:optimizer/adamw.yaml}
lr_scheduler: ${from_file:lr_scheduler/step_with_0.05_warmup.yaml}
experiment_dir: ${oc.env:NSC_EXPERIMENT_DIR}
data_dir: ${oc.env:NSC_DATA_DIR}
datasets:
- ${oc.env:NSC_DATA_DIR}/processed/wikidump_paragraphs_sed_words_and_sec
- ${oc.env:NSC_DATA_DIR}/processed/bookcorpus_paragraphs_sed_words_and_sec
dataset_limits:
- ${oc.env:NSC_DATA_LIMIT}
- ${oc.env:NSC_DATA_LIMIT}
val_splits:
- 2500
- 2500
experiment_name: ${oc.env:NSC_EXPERIMENT_NAME,dummy}
epochs: ${oc.env:NSC_EPOCHS,1}
batch_max_length: ${oc.env:NSC_BATCH_MAX_LENGTH,4096}
bucket_span: ${oc.env:NSC_BUCKET_SPAN,4}
log_per_epoch: ${oc.env:NSC_LOG_PER_EPOCH,100}
eval_per_epoch: ${oc.env:NSC_EVAL_PER_EPOCH,4}
seed: 22
mixed_precision: ${oc.env:NSC_MIXED_PRECISION,true}
Values looking like ${from_file:<file_path>}
refer to other config files relative to the NSC_CONFIG_DIR
. When the training
config is composed, the contents of the referred config files will replace these values.
Values looking like ${oc.env:<env_var_name>,<default>}
refer to environment variables and an optional default that will be set
if the environment variable is not found. If there is no default you will be required to set the environment variable, otherwise
you receive an error message.
In our example we need to set values for the environment variables NSC_DATA_LIMIT
(can be used to limit the number of samples per training dataset)
and NSC_EXPERIMENT_DIR
(directory path where the logs and checkpoints will be saved). Once we have set these variables we
can start the training. Since the training script is written to support distributed training we need to use torchrun
to launch the script:
# set the environment variables
export NSC_DATA_LIMIT=100000 # set data limit to 100,000 samples per dataset
export NSC_EXPERIMENT_DIR=experiments # directory path where the experiment will be saved
# to train locally / on a single node
torchrun --nnodes=1 nsc/train.py --config spell_checking/configs/sed_words.yaml
You can also resume training for an existing experiment if you had to abort training for some reason:
# resume training from latest checkpoint of an experiment
torchrun --nnodes=1 nsc/train.py --resume <path_to_experiment_directory>
As an alternative you can set one of the NSC_CONFIG
or NSC_RESUME
environment variables
and use the train.sh script to start training. This script additionally provides functionality to start distributed
training on SLURM clusters. Training using this script would look something like this:
# set the environment variables
export NSC_DATA_LIMIT=100000 # set data limit to 100,000 samples per dataset
export NSC_EXPERIMENT_DIR=experiments # directory path where the experiment will be saved
## LOCAL training
# start new training run using a config
NSC_CONFIG=spell_checking/configs/sed_words.yaml spell_checking/scripts/train.sh
# resume training from latest checkpoint of an experiment
NSC_RESUME=<path_to_experiment_directory> spell_checking/scripts/train.sh
## SLURM training
# starting distributed training on a SLURM cluster using sbatch
# requires you to set the NSC_WORLD_SIZE environment variable (total number of GPUs used for training)
# if you e.g. want to train on 4 nodes with 2 GPUs each set NSC_WORLD_SIZE=8
NSC_CONFIG=spell_checking/configs/sed_words.yaml NSC_WORLD_SIZE=8 sbatch --nodes=4 --ntasks-per-node=2 --gres=gpu:2 spell_checking/scripts/train.sh
# if you are in an interactive SLURM session (started e.g. with srun)
# you probably want to train as if you are running locally, set NSC_FORCE_LOCAL=true and
# start training without sbatch
NSC_FORCE_LOCAL=true NSC_CONFIG=spell_checking/configs/sed_words.yaml spell_checking/scripts/train.sh
To retrain the models of this project see the train_slurm_<task>.sh
scripts in the scripts directory which were used for training all models.
These scripts do nothing more than setting some environment variables and calling the train.sh script.
Note
Using the train_slurm_<task>.sh
scripts for training is only possible on a SLURM cluster
since they call the train.sh
script using SLURMs sbatch command.
Reproduce
We make all models that are needed to reproduce the results on the projects’ benchmarks available as
pretrained models.
All pretrained models can be accessed either through the command line interface (nsec
, nsed
, ntr
),
the Python API, or the web app.
All benchmarks as well as our and the baseline results on them can be found in this directory. Every benchmark follows the same directory structure:
<task>/<benchmark_group>/<benchmark_split>/corrupt.txt
<task>/<benchmark_group>/<benchmark_split>/correct.txt
Here corrupt.txt is the input containing misspelled text and correct.txt is the groundtruth output. We provide a evaluation script that can be used to evaluate model predictions on a given benchmark. Alternatively the web app also features a section where you can evaluate model predictions on benchmarks.
As an example, lets look at the steps that are necessary to evaluate our gnn+ model for word level spelling error detection on the wikidump realistic benchmark using the command line interface:
Run model on benchmark:
nsed -m "sed words:gnn+" \ # choose the model
-f spell_checking/benchmarks/test/sed_words/wikidump/realistic/corrupt.txt \ # input file
-o gnn_plus_predictions.txt # save output to file
Evaluate model predictions:
python spell_checking/benchmarks/scripts/evaluate.py \
sed_words \ # benchmark type
spell_checking/benchmarks/test/sed_words/wikidump/realistic/corrupt.txt \ # input file
spell_checking/benchmarks/test/sed_words/wikidump/realistic/correct.txt \ # groundtruth file
gnn_plus_predictions.txt # predicted file
Hint
By default a pretrained model is downloaded as a zip file and then extracted when you first use it. Since some models
are quite large this can take some time. To cut this time all pretrained models can also be found as zip files in the directory
/nfs/students/sebastian-walter/masters_thesis/zipped
. If you set the env variable
NSC_DOWNLOAD_DIR
to this directory, the models are loaded from this directory and must not be downloaded first.
If you are running this project using Docker you can mount the directory to the containers download directory
by specifying an additional volume flag:
make run_docker DOCKER_ARGS="-v /nfs/students/sebastian-walter/masters_thesis/zipped:/download
.
Hint
To access the benchmarks if you are running this project with Docker you can mount the benchmark directory
inside the Docker container using
make run_docker DOCKER_ARGS="-v $(pwd)/spell_checking/benchmarks/test:/benchmarks"
.
The Docker container also provides additional commands for evaluating benchmarks that are basically
wrappers around the evaluation script mentioned above.
nsc package
- class nsc.BeamSearch
Bases:
nsc.api.sec.Search
Beam search: Maintain and expand the top beam_width paths.
- __init__(beam_width=5)
- Parameters
beam_width (int) –
- Return type
None
- beam_width: int = 5
- class nsc.BestFirstSearch
Bases:
nsc.api.sec.Search
Best first search: Always expand the highest scoring path out of all paths encountered so far.
- __init__()
- Return type
None
- class nsc.GreedySearch
Bases:
nsc.api.sec.Search
Greedy search: Always go to the highest scoring next node.
- __init__()
- Return type
None
- class nsc.SampleSearch
Bases:
nsc.api.sec.Search
Sample search: Go to a random node of the top_k highest scoring next nodes.
- __init__(top_k=5)
- Parameters
top_k (int) –
- Return type
None
- top_k: int = 5
- class nsc.Score
Bases:
object
Determines how paths during/after decoding are scored/rescored.
The default mode “log_likelihood” is to score a candidate using the token sequence log likelihood given as the sum of all token log probabilities. For some search methods we optionally support normalization by the token sequence length (so shorter paths are not preferred):
score = sum(log_probabilities) / (len(log_probabilities)^alpha)
Alpha here can be used to state a preference for shorter or longer sequences, if alpha > 1 longer sequences are preferred, if alpha < 1 shorter sequences are preferred, and if alpha = 0 the score equals the sum of the log probabilities.
Other supported modes are “dictionary” and “diff_from_input”. They only allow candidates that either contain dictionary words only or differ from the input text. For “dictionary” mode a prefix index must be specified, so that we are able to check if a decoded text is a prefix of or equal to a dictionary word.
- __init__(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None)
- Parameters
normalize_by_length (bool) –
alpha (float) –
mode (str) –
prefix_index (Optional[nsc.data.index.PrefixIndex]) –
- Return type
None
- alpha: float = 1.0
- mode: str = 'log_likelihood'
- normalize_by_length: bool = True
- prefix_index: Optional[nsc.data.index.PrefixIndex] = None
- class nsc.SpellingErrorCorrector
Bases:
nsc.api.utils._APIBase
Spelling error correction
Class to run spelling error correction models.
- __init__(model_dir, device, **kwargs)
Spelling error correction constructor.
Do not use this explicitly. Use the static SpellingErrorCorrector.from_pretrained() and SpellingErrorCorrector.from_experiment() methods instead.
- Parameters
model_dir (str) – directory of the model to load
device (Union[str, int]) – device to load the model in
kwargs (Dict[str, Any]) –
- Return type
None
- correct_file(input_file_path, output_file_path=None, detections=None, search=GreedySearch(), score=Score(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None), batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True, **kwargs)
Correct spelling errors in a file.
- Parameters
input_file_path (str) – path to an input file, which will be corrected line by line
output_file_path (Optional[str]) – path to an output file, where corrected text will be saved line by line
detections (Optional[Union[str, List[List[int]]]]) – spelling error detections (from a SpellingErrorDetector) to guide the correction, can either be a path to a file containing detections or a list of lists of integers
search (nsc.api.sec.Search) – Search instance to determine the search method to use for decoding
score (nsc.api.sec.Score) – Score instance to determine how to score search paths during decoding
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –
- Return type
Optional[List[str]]
Returns: corrected file as list of strings if output_file_path is not specified else None
- correct_text(inputs, detections=None, search=GreedySearch(), score=Score(normalize_by_length=True, alpha=1.0, mode='log_likelihood', prefix_index=None), batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False, **kwargs)
Correct spelling errors in text.
- Parameters
inputs (Union[str, List[str]]) – text to correct given as a single string or a list of strings
detections (Optional[Union[List[int], List[List[int]]]]) – spelling error detections (from a SpellingErrorDetector) to guide the correction, if inputs is a single str, detections must be a list of integers, otherwise if inputs is a list of strings, detections should be a list of lists of integers
search (nsc.api.sec.Search) – Search instance to determine the search method to use for decoding
score (nsc.api.sec.Score) – Score instance to determine how to score search paths during decoding
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –
- Return type
Union[str, List[str]]
Returns: corrected text as string or list of strings
- static from_experiment(experiment_dir, device='cuda')
Create a new NSC instance using your own experiment.
- Parameters
experiment_dir (str) – path to the experiment directory
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
- Return type
Returns: NSC instance
- static from_pretrained(task='sec', model='transformer words nmt', device='cuda', download_dir=None, cache_dir=None, force_download=False)
Create a new NSC instance using a pretrained model.
- Parameters
task (str) – name of the task
model (str) – name of the pretrained model
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
download_dir (Optional[str]) – directory where the compressed model will be downloaded to
cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to
force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir
- Return type
Returns: NSC instance
- property mixed_precision_enabled: bool
Check if mixed precision is enabled (precision is set to something else than fp32).
Returns: bool whether mixed precision is enabled
- property model_name: str
Gives the name of the NSC model in use.
Returns: name of the model
- set_precision(precision)
Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.
- Parameters
precision (str) – precision identifier (one of {fp32, fp16, bfp16})
- Return type
None
Returns: None
- suggest(text, n, detections=None, **kwargs)
Generate suggestions for possible corrections of a text.
Note that if you use this with a model that translates each word individually (e.g. transformer word), the i-th suggestion string is built by joining the i-th suggestions for all words (they do not necessarily form a useful sentence together). To obtain n suggestions for each word in the text you can simply split every suggestion string on whitespaces.
- Parameters
text (str) – text to generate suggestions for
n (int) – number of suggestions to generate
detections (Optional[List[int]]) – optional detections to exclude some words
kwargs (Any) –
- Return type
List[str]
Returns: suggestions as list of strings
- property task_name: str
Check which NSC task is run.
Returns: name of the task
- to(device)
Move the model to a different device.
- Parameters
device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)
- Return type
nsc.api.utils._APIBase
Returns: self
- class nsc.SpellingErrorDetector
Bases:
nsc.api.utils._APIBase
Spelling error detection
Class to run spelling error detection models.
- __init__(model_dir, device, **kwargs)
Spelling error detection constructor.
Do not use this explicitly. Use the static SpellingErrorDetector.from_pretrained() and SpellingErrorDetector.from_experiment() methods instead.
- Parameters
model_dir (str) – directory of the model to load
device (Union[str, int]) – device to load the model in
kwargs (Dict[str, Any]) –
- Return type
None
- detect_file(input_file_path, output_file_path=None, threshold=0.5, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True, **kwargs)
Detect spelling errors in a file.
- Parameters
input_file_path (str) – path to an input file, which will be checked for spelling errors line by line
output_file_path (Optional[str]) – path to an output file, where the detections will be saved line by line
threshold (float) – set detection threshold (0 < threshold < 1)
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –
- Return type
Optional[Tuple[List[List[int]], List[str]]]
- Returns: tuple of detections as list of lists of integers and output strings as list of strings
if output_file_path is not specified else None
- detect_text(inputs, threshold=0.5, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False, **kwargs)
Detect spelling errors in text.
- Parameters
inputs (Union[str, List[str]]) – text to check for errors given as a single string or a list of strings
threshold (float) – set detection threshold (0 < threshold < 1)
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
kwargs (Any) –
- Return type
Tuple[Union[List[int], List[List[int]]], Union[str, List[str]]]
- Returns: tuple of detections as list of integers or list of lists of integers and output strings as
str or list of strings
- static from_experiment(experiment_dir, device='cuda')
Create a new NSC instance using your own experiment.
- Parameters
experiment_dir (str) – path to the experiment directory
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
- Return type
Returns: NSC instance
- static from_pretrained(task='sed words', model='gnn+', device='cuda', download_dir=None, cache_dir=None, force_download=False)
Create a new NSC instance using a pretrained model.
- Parameters
task (str) – name of the task
model (str) – name of the pretrained model
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
download_dir (Optional[str]) – directory where the compressed model will be downloaded to
cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to
force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir
- Return type
Returns: NSC instance
- property mixed_precision_enabled: bool
Check if mixed precision is enabled (precision is set to something else than fp32).
Returns: bool whether mixed precision is enabled
- property model_name: str
Gives the name of the NSC model in use.
Returns: name of the model
- set_precision(precision)
Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.
- Parameters
precision (str) – precision identifier (one of {fp32, fp16, bfp16})
- Return type
None
Returns: None
- property task_name: str
Check which NSC task is run.
Returns: name of the task
- to(device)
Move the model to a different device.
- Parameters
device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)
- Return type
nsc.api.utils._APIBase
Returns: self
- class nsc.TokenizationRepairer
Bases:
nsc.api.utils._APIBase
Tokenization repair
Class to run tokenization repair models.
- __init__(model_dir, device, **kwargs)
Tokenization repair constructor.
Do not use this explicitly. Use the static TokenizationRepairer.from_pretrained() and TokenizationRepairer.from_experiment() methods instead.
- Parameters
model_dir (str) – directory of the model to load
device (Union[str, int]) – device to load the model in
kwargs (Dict[str, Any]) –
- Return type
None
- static from_experiment(experiment_dir, device='cuda')
Create a new NSC instance using your own experiment.
- Parameters
experiment_dir (str) – path to the experiment directory
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
- Return type
Returns: NSC instance
- static from_pretrained(task='tokenization repair', model='eo large arxiv with errors', device='cuda', download_dir=None, cache_dir=None, force_download=False)
Create a new NSC instance using a pretrained model.
- Parameters
task (str) – name of the task
model (str) – name of the pretrained model
device (Union[str, int]) – device to load the model to (e.g. “cuda”, “cpu” or integer denoting GPU device index)
download_dir (Optional[str]) – directory where the compressed model will be downloaded to
cache_dir (Optional[str]) – directory where the downloaded, compressed model will be extracted to
force_download (bool) – force download the pretrained model again even if it already exists in the cache_dir
- Return type
Returns: NSC instance
- property mixed_precision_enabled: bool
Check if mixed precision is enabled (precision is set to something else than fp32).
Returns: bool whether mixed precision is enabled
- property model_name: str
Gives the name of the NSC model in use.
Returns: name of the model
- repair_file(input_file_path, output_file_path=None, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=True)
Repair whitespaces in a file.
- Parameters
input_file_path (str) – path to an input file, which will be repaired line by line
output_file_path (Optional[str]) – path to an output file, where repaired text will be saved line by line
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
- Return type
Optional[List[str]]
Returns: repaired file as list of strings if output_file_path is not specified else None
- repair_text(inputs, batch_size=16, batch_max_length_factor=None, sort_by_length=True, show_progress=False)
Repair whitespaces in text.
- Parameters
inputs (Union[str, List[str]]) – text to repair given as a single string or a list of strings
batch_size (int) – how many sequences to process at once
batch_max_length_factor (Optional[float]) – sets the maximum total length of a batch to be batch_max_length_factor * model_max_input_length, if a model e.g. has a max input length of 512 tokens and batch_max_length_factor is 4 then one batch will contain as many input sequences as fit within 512 * 4 = 2048 tokens (takes precedence over batch_size if specified)
sort_by_length (bool) – sort the inputs by length before processing them
show_progress (bool) – display progress bar
- Return type
Union[str, List[str]]
Returns: repaired text as string or list of strings
- set_precision(precision)
Set the inference precision to use. Default is standard 32bit full precision. Using 16bit floats or 16bit brain floats should only be used when you have a supported NVIDIA GPU. When running on CPU fp16 will be overwritten with bfp16 since only bfp16 is supported on CPU for now.
- Parameters
precision (str) – precision identifier (one of {fp32, fp16, bfp16})
- Return type
None
Returns: None
- property task_name: str
Check which NSC task is run.
Returns: name of the task
- to(device)
Move the model to a different device.
- Parameters
device (Union[str, int]) – device specifier (e.g. “cuda”, “cpu” or integer denoting GPU device index)
- Return type
nsc.api.utils._APIBase
Returns: self
- nsc.get_available_spelling_error_correction_models()
Get available spelling error correction models
Returns: list of spelling error correction model infos each containing the task name, model name and a short description
- Return type
List[nsc.api.utils.ModelInfo]
- nsc.get_available_spelling_error_detection_models()
Get available spelling error detection models
Returns: list of spelling error detection model infos each containing the task name, model name and a short description
- Return type
List[nsc.api.utils.ModelInfo]
- nsc.get_available_tokenization_repair_models()
Get available tokenization repair models
Returns: list of tokenization repair model infos each containing the task name, model name and a short description
- Return type
List[nsc.api.utils.ModelInfo]