corpkit is a tool for doing corpus linguistics.
It does a lot of the usual things, like parsing, concordancing and keywording, but also extends their potential significantly: you can concordance by searching for combinations of lexical and grammatical features, and can do keywording of lemmas, of subcorpora compared to corpora, or of words in certain positions within clauses.
Corpus interrogations can be quickly edited and visualised in complex ways, or saved and loaded within projects, or exported to formats that can be handled by other tools.
corpkit accomplishes all of this by leveraging a number of sophisticated programming libraries, including pandas, matplotlib, scipy, Tkinter, tkintertable and Stanford CoreNLP.
Searching a corpus using constituency parses
Making relative frequencies, skipping subcorpora
Visualising results as a line graph
Concordancing with constituency queries, manually coding results
Defining pre-installed and custom wordlists
Building corpora, viewing parse tree output
Example figures: Risk Semantics project
Changing frequencies of risk processes
Nominalisation of risk
How often do certain social actors do risking?
Modal auxiliaries in the NYT
Sayers in verbal processes, sorted by increasing frequency
An ocean of modals in the NYT
Using keywording with a list of politicians' names and no external reference corpus
Using subplots to demonstrate the rise of "to put at risk" in U.S. news
The main difference from other tools is that corpkit is designed to look at combinations of lexical and grammatical features in structured corpora. You can easily count or concordance the subjects of passive clauses, or the verbal groups that occur when a participant is pronominal. Furthermore, you can do this for every subcorpus in your dataset in turn, in order to understand how language might be similar or different across the different parts of your dataset.
Also unique to corpkit are:
- Sophisticated editing and plotting tools (via pandas and matplotlib)
- Immediately editable results (via tkintertable)
- Thematic coding and colouring of concordance lines
- Tool for building re-usable wordlists, getting spelling variants and inflections
- Simple integration of wordlists and corpus queries
- Auto-storage of results from investigations, as well as all the options used to generate them
The final key difference between corpkit and most current corpus linguistic software (AntConc, WMatrix, Sketch Engine, UAM Corpus Tool, Wordsmith Tools, etc.), corpkit is free and open-source, hackable, and provides both graphical and command-line interfaces, so that it may be useful for geek and non-geek alike.
A more detailed overview of features can be found on the Features
To download the most recent OSX version, use the link in the menu bar, or just click here. See the Setup page for (very simple) installation instructions.
Linux users can run the graphical interface by installing corpkit with
pip install corpkit and then open the GUI with
python -m corpkit.gui.
Windows users will need to get a Python interpreter and pip installed, and then run
pip install corpkit and
python -m corpkit.gui.
If you want to cite corpkit, please use:
McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from
https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361