SpaCy

spaCy
spaCy
Original author	Matthew Honnibal
Developers	Explosion AI, various
Initial release	February 2015; 10 years ago
Stable release	3.8.4 / 14 January 2025; 10 months ago
Repository	github.com/explosion/spaCy ;
Written in	Python, Cython
Operating system	Linux, Windows, macOS, OS X
Platform	Cross-platform
Type	Natural language processing
License	MIT License
Website	spacy.io

spaCy (/speɪˈsiː/ spay-SEE) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.^[3]^[4] The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.^[5]^[6] spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc.^[7]^[8] Using Thinc as its backend, spaCy features convolutional neural network models for part-of-speech tagging, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical neural network models to perform these tasks are available for 23 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for tokenization for more than 65 languages allows users to train custom models on their own datasets as well.^[9]

History

Version 1.0 was released on October 19, 2016, and included preliminary support for deep learning workflows by supporting custom processing pipelines.^[10] It further included a rule matcher that supported entity annotations, and an officially documented training API.
Version 2.0 was released on November 7, 2017, and introduced convolutional neural network models for 7 different languages.^[11] It also supported custom processing pipeline components and extension attributes, and featured a built-in trainable text classification component.
Version 3.0 was released on February 1, 2021, and introduced state-of-the-art transformer-based pipelines.^[12] It also introduced a new configuration system and training workflow, as well as type hints and project templates. This version dropped support for Python 2.

Main features

Non-destructive tokenization
"Alpha tokenization" support for over 65 languages^[13]
Built-in support for trainable pipeline components such as Named entity recognition, Part-of-speech tagging, dependency parsing, Text classification, Entity Linking and more
Statistical models for 19 languages^[14]
Multi-task learning with pretrained transformers like BERT
Support for custom models in PyTorch, TensorFlow and other frameworks
State-of-the-art speed and accuracy^[15]
Production-ready training system
Built-in visualizers for syntax and named entities
Easy model packaging, deployment and workflow management

Extensions and visualizers

Dependency parse tree visualization generated with the displaCy visualizer

spaCy comes with several extensions and visualizations that are available as free, open-source libraries:

Thinc: A machine learning library optimized for CPU usage and deep learning with text input.
sense2vec: A library for computing word similarities, based on Word2vec.^[16]
displaCy: An open-source dependency parse tree visualizer built with JavaScript, CSS and SVG.
displaCy^ENT: An open-source named entity visualizer built with JavaScript and CSS.

References

^ "Introducing spaCy". explosion.ai. 19 February 2015. Retrieved 2016-12-18.
^ "Release 3.8.4". 14 January 2025. Retrieved 29 January 2025.
^ Choi et al. (2015). It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool.
^ "Google's new artificial intelligence can't understand these sentences. Can you?". Washington Post. Retrieved 2016-12-18.
^ "Facts & Figures - spaCy". spacy.io. Retrieved 2020-04-04.
^ Bird, Steven; Klein, Ewan; Loper, Edward; Baldridge, Jason (2008). "Multidisciplinary instruction with the Natural Language Toolkit" (PDF). Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, ACL: 62. doi:10.3115/1627306.1627317. ISBN 9781932432145. S2CID 16932735.
^ "PyTorch, TensorFlow & MXNet". thinc.ai. Retrieved 2020-04-04.
^ "explosion/thinc". GitHub. Retrieved 2016-12-30.
^ "Models & Languages | spaCy Usage Documentation". spacy.io. Retrieved 2020-03-10.
^ "explosion/spaCy". GitHub. Retrieved 2021-02-08.
^ "explosion/spaCy". GitHub. Retrieved 2021-02-08.
^ "explosion/spaCy". GitHub. Retrieved 2021-02-08.
^ "Models & Languages - spaCy". spacy.io. Retrieved 2021-02-08.
^ "Models & Languages | spaCy Usage Documentation". spacy.io. Retrieved 2021-02-08.
^ "Benchmarks | spaCy Usage Documentation". spacy.io. Retrieved 2021-02-08.
^ Trask et al. (2015). sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.

External links

[1] "Introducing spaCy". explosion.ai. 19 February 2015. Retrieved 2016-12-18.

[wikidata-94d7553cd4b1c0d8888e78649dd474869e08e391-v20-2] "Release 3.8.4". 14 January 2025. Retrieved 29 January 2025.

[3] Choi et al. (2015). It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool.

[4] "Google's new artificial intelligence can't understand these sentences. Can you?". Washington Post. Retrieved 2016-12-18.

[5] "Facts & Figures - spaCy". spacy.io. Retrieved 2020-04-04.

[Bird-Klein-Loper-Baldridge-6] Bird, Steven; Klein, Ewan; Loper, Edward; Baldridge, Jason (2008). "Multidisciplinary instruction with the Natural Language Toolkit" (PDF). Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, ACL: 62. doi:10.3115/1627306.1627317. ISBN 9781932432145. S2CID 16932735.

[7] "PyTorch, TensorFlow & MXNet". thinc.ai. Retrieved 2020-04-04.

[8] "explosion/thinc". GitHub. Retrieved 2016-12-30.

[9] "Models & Languages | spaCy Usage Documentation". spacy.io. Retrieved 2020-03-10.

[10] "explosion/spaCy". GitHub. Retrieved 2021-02-08.

[11] "explosion/spaCy". GitHub. Retrieved 2021-02-08.

[12] "explosion/spaCy". GitHub. Retrieved 2021-02-08.

[13] "Models & Languages - spaCy". spacy.io. Retrieved 2021-02-08.

[14] "Models & Languages | spaCy Usage Documentation". spacy.io. Retrieved 2021-02-08.

[15] "Benchmarks | spaCy Usage Documentation". spacy.io. Retrieved 2021-02-08.

[16] Trask et al. (2015). sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

Pipeline/Model	Words per Second (CPU)	Words per Second (GPU)	Source
spaCy en_core_web_lg	10,014	14,954	spaCy Facts & Figures
spaCy en_core_web_trf	684	3,768	spaCy Facts & Figures
Stanza en_ewt	878	2,180	spaCy Facts & Figures

Model Variant	Approximate Size	Key Features	Example Accuracy (NER F1 on OntoNotes 5.0)
en_core_web_sm	12 MB	CPU-focused, no vectors	84%^[39]
en_core_web_md	31 MB	20k unique vectors (685k keys)	85%^[39]
en_core_web_lg	382 MB	343k unique vectors (685k keys), high accuracy	85.5%^[35]
en_core_web_trf	436 MB	RoBERTa integration	90%^[39]

History

Media collections

SpaCy

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

SpaCy

History

Main features

Extensions and visualizers

References

External links

SpaCy

Introduction

Overview

Design Philosophy

History

Origins and Founding

Major Version Releases

Architecture

Core Components

Processing Pipeline

Features

Linguistic Processing Capabilities

Performance Optimizations

Models and Training

Pre-trained Models

Custom Model Development

Extensions and Tools

Visualization and Debugging

Integrations and Ecosystem

Adoption and Impact

Community and Usage

Notable Applications and Case Studies

References

Add your contribution

Related Hubs

Contribute something