Sun. Nov 28th, 2021
Machine learning: spacy 3.1 passes predictions in the pipeline

The Berlin company Explosion AI has version 3.1 of the Python library spaCy, which is designed for Natural Language Processing (NLP), was published. Among the new features is the option to pass annotations about predictions from one component to another during training. A new component is also used to label arbitrary and potentially overlapping text passages.

Disclosure of predictions

Component training process is typically isolated: individual components have no visibility into the predictions of components ahead of them in the pipeline. The current release allows writing annotations during training, which can be accessed by other components. The new configuration setting training.annotating_components specifies which components write annotations.

In this way, for example, the information about the grammatical structure from the dependency of the parser can be used for tagging with the Tok2Vec extension, as the following example from the spaCy documentation shows:

[nlp] pipeline = ["parser", "tagger"] [components.tagger.model.tok2vec.embed] @architectures = "spacy.MultiHashEmbed.v1" width = ${components.tagger.model.tok2vec.encode.width} attrs = ["NORM","DEP"] rows = [5000,2500] include_static_vectors = false [training] annotating_components = ["parser"]

Annotations may come from both regular and frozen components (frozen_components). The latter are not updated during training. For non-frozen components, the procedure creates an overhead, as they cause a double run in training: The first updates the model, which serves as the basis for the predictions in the second run.

overlapping text passages

spaCy 3.1 introduces new SpanCategorizer component for labeling arbitrary text passages that may overlap or be nested. The component marked as experimental so far is supposed to cover those cases where Named Entity Recognition (NER) reaches its limits. The latter categorizes the individual entities of a text, which must, however, be cleanly separable for this purpose.

In parallel with the new component, Explosion AI has released a pre-release version of the annotation tool Prodigy, which offers, among other things, a new UI for annotating nested and overlapping passages. The annotations defined in it can be used as training data for SpanCategorizer.

Machine learning: spacy 3.1 passes predictions in the pipeline

Prodigy enables the labeling of overlapping text passages.

Other new features in spaCy 3.1 like the additional pipeline packages for Catalan and Danish as well as the direct connection to the Hugging Face Hub can be found on the ExplosionAI blog.

Leave a Reply