DocuScope

docuscospacy: Support for spaCy models trained on DocuScope and the CLAWS7 tagset

PyPI Version Downloads from PyPI Documentation Status Citable Zenodo DOI

The docuscospacy package contains a set of functions to facilitate the processing of tagged corpora using:

The documentation for docuscospacy is available on docuscospacy.readthedocs.org and the GitHub code repository is on github.com/browndw/docuscospacy.

Requirements and installation

docuscospacy works with Python 3.8 or newer (tested up to Python 3.10). It also requires spacy >= 3.3.

The recommended way of installing docuscospacy is to:

pip install docuscospacy

Features

Corpus analysis

The docuscospacy package supports the post-tagging generation of:

Outputs can be controlled either by part-of-speech or by DocuScope tag. Thus, can as noun and can as verb, for example, can be disambiguated.

Additionally, tagged multi-token sequences are aggregated for analysis. So, for example, where in spite of is tagged as a token sequence, it is combined into a single token.

Other features

  • KWIC tables that locate a node word in a center column with context columns on either side

Limits

  • the model that this package is designed for has only been trained on English

  • all data must reside in memory, i.e. no streaming of large data from the hard disk (which for example Gensim supports)

License

Code licensed under Apache License 2.0. See LICENSE file.

License

Code licensed under Apache License 2.0. See LICENSE file.

Indices and tables