Overview

corpkit is a tool for doing corpus linguistics.

It does a lot of the usual things, like parsing, concordancing and keywording, but also extends their potential significantly: you can concordance by searching for combinations of lexical and grammatical features, and can do keywording of lemmas, of subcorpora compared to corpora, or of words in certain positions within clauses.

Corpus interrogations can be quickly edited and visualised in complex ways, or saved and loaded within projects, or exported to formats that can be handled by other tools.

corpkit accomplishes all of this by leveraging a number of sophisticated programming libraries, including pandas, matplotlib, scipy, Tkinter, tkintertable and Stanford CoreNLP.

Screenshots


Searching a corpus using constituency parses

Making relative frequencies, skipping subcorpora

Visualising results as a line graph

Concordancing with constituency queries, manually coding results

Defining pre-installed and custom wordlists

Building corpora, viewing parse tree output

Example figures: Risk Semantics project

</tr>

Changing frequencies of risk processes

Nominalisation of risk

How often do certain social actors do risking?

Modal auxiliaries in the NYT

Sayers in verbal processes, sorted by increasing frequency

An ocean of modals in the NYT

Using keywording with a list of politicians' names and no external reference corpus

Using subplots to demonstrate the rise of "to put at risk" in U.S. news

Key features

The main difference from other tools is that corpkit is designed to look at combinations of lexical and grammatical features in structured corpora. You can easily count or concordance the subjects of passive clauses, or the verbal groups that occur when a participant is pronominal. Furthermore, you can do this for every subcorpus in your dataset in turn, in order to understand how language might be similar or different across the different parts of your dataset.

Also unique to corpkit are:

  • Sophisticated editing and plotting tools (via pandas and matplotlib)
  • Immediately editable results (via tkintertable)
  • Thematic coding and colouring of concordance lines
  • Tool for building re-usable wordlists, getting spelling variants and inflections
  • Simple integration of wordlists and corpus queries
  • Auto-storage of results from investigations, as well as all the options used to generate them

The final key difference between corpkit and most current corpus linguistic software (AntConc, WMatrix, Sketch Engine, UAM Corpus Tool, Wordsmith Tools, etc.), corpkit is free and open-source, hackable, and provides both graphical and command-line interfaces, so that it may be useful for geek and non-geek alike.

Download

To download the most recent OSX version, use the link in the menu bar, or just click here. See the Setup page for (very simple) installation instructions.

Linux users can run the graphical interface by installing corpkit with pip install corpkit and then open the GUI with python -m corpkit.gui.

Windows users will need to get a Python interpreter and pip installed, and then run pip install corpkit and python -m corpkit.gui.

Cite

If you want to cite corpkit, please use:

McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from
https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361