Overview
corpkit is a tool for doing corpus linguistics.
It does a lot of the usual things, like parsing, concordancing and keywording, but also extends their potential significantly: you can concordance by searching for combinations of lexical and grammatical features, and can do keywording of lemmas, of subcorpora compared to corpora, or of words in certain positions within clauses.
Corpus interrogations can be quickly edited and visualised in complex ways, or saved and loaded within projects, or exported to formats that can be handled by other tools.
corpkit accomplishes all of this by leveraging a number of sophisticated programming libraries, including pandas, matplotlib, scipy, Tkinter, tkintertable and Stanford CoreNLP.
Screenshots
Searching a corpus using constituency parses
|
Making relative frequencies, skipping subcorpora
|
Visualising results as a line graph
|
Concordancing with constituency queries, manually coding results
|
Defining pre-installed and custom wordlists
|
Building corpora, viewing parse tree output
|
Changing frequencies of risk processes
|
Nominalisation of risk
|
How often do certain social actors do risking?
|
Modal auxiliaries in the NYT
|
Sayers in verbal processes, sorted by increasing frequency
|
An ocean of modals in the NYT
|
Using keywording with a list of politicians' names and no external reference corpus
|
Using subplots to demonstrate the rise of "to put at risk" in U.S. news
|
</tr>
Key features
The main difference from other tools is that corpkit is designed to look at combinations of lexical and grammatical features in structured corpora. You can easily count or concordance the subjects of passive clauses, or the verbal groups that occur when a participant is pronominal. Furthermore, you can do this for every subcorpus in your dataset in turn, in order to understand how language might be similar or different across the different parts of your dataset.
Also unique to corpkit are:
- Sophisticated editing and plotting tools (via pandas and matplotlib)
- Immediately editable results (via tkintertable)
- Thematic coding and colouring of concordance lines
- Tool for building re-usable wordlists, getting spelling variants and inflections
- Simple integration of wordlists and corpus queries
- Auto-storage of results from investigations, as well as all the options used to generate them
The final key difference between corpkit and most current corpus linguistic software (AntConc, WMatrix, Sketch Engine, UAM Corpus Tool, Wordsmith Tools, etc.), corpkit is free and open-source, hackable, and provides both graphical and command-line interfaces, so that it may be useful for geek and non-geek alike.
Note: A more detailed overview of features can be found on the
Features page.
Download
To download the most recent OSX version, use the link in the menu bar, or just click here. See the Setup page for (very simple) installation instructions.
Linux users can run the graphical interface by installing corpkit with pip install corpkit
and then open the GUI with python -m corpkit.gui
.
Windows users will need to get a Python interpreter and pip installed, and then run pip install corpkit
and python -m corpkit.gui
.
Cite
If you want to cite corpkit, please use:
McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from
https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361