corpkit: a tool for investigating text

Overview

corpkit is a tool for doing corpus linguistics.

It does a lot of the usual things, like parsing, concordancing and keywording, but also extends their potential significantly: you can concordance by searching for combinations of lexical and grammatical features, and can do keywording of lemmas, of subcorpora compared to corpora, or of words in certain positions within clauses.

Corpus interrogations can be quickly edited and visualised in complex ways, or saved and loaded within projects, or exported to formats that can be handled by other tools.

corpkit accomplishes all of this by leveraging a number of sophisticated programming libraries, including pandas, matplotlib, scipy, Tkinter, tkintertable and Stanford CoreNLP.

Screenshots

Searching a corpus using constituency parses	Making relative frequencies, skipping subcorpora
Visualising results as a line graph	Concordancing with constituency queries, manually coding results
Defining pre-installed and custom wordlists	Building corpora, viewing parse tree output

Example figures: Risk Semantics project

</tr>

Changing frequencies of risk processes	Nominalisation of risk
How often do certain social actors do risking?	Modal auxiliaries in the NYT
Sayers in verbal processes, sorted by increasing frequency	An ocean of modals in the NYT
Using keywording with a list of politicians' names and no external reference corpus	Using subplots to demonstrate the rise of "to put at risk" in U.S. news

Key features

The main difference from other tools is that corpkit is designed to look at combinations of lexical and grammatical features in structured corpora. You can easily count or concordance the subjects of passive clauses, or the verbal groups that occur when a participant is pronominal. Furthermore, you can do this for every subcorpus in your dataset in turn, in order to understand how language might be similar or different across the different parts of your dataset.

Also unique to corpkit are:

Sophisticated editing and plotting tools (via pandas and matplotlib)
Immediately editable results (via tkintertable)
Thematic coding and colouring of concordance lines
Tool for building re-usable wordlists, getting spelling variants and inflections
Simple integration of wordlists and corpus queries
Auto-storage of results from investigations, as well as all the options used to generate them

The final key difference between corpkit and most current corpus linguistic software (AntConc, WMatrix, Sketch Engine, UAM Corpus Tool, Wordsmith Tools, etc.), corpkit is free and open-source, hackable, and provides both graphical and command-line interfaces, so that it may be useful for geek and non-geek alike.

Note: A more detailed overview of features can be found on the Features page.

Download

To download the most recent OSX version, use the link in the menu bar, or just click here. See the Setup page for (very simple) installation instructions.

Linux users can run the graphical interface by installing corpkit with pip install corpkit and then open the GUI with python -m corpkit.gui.

Windows users will need to get a Python interpreter and pip installed, and then run pip install corpkit and python -m corpkit.gui.

Cite

If you want to cite corpkit, please use:

McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from
https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361