The Edit tab allows you to manipulate interrogation results in complex ways.
Quickstart: relative frequencies
Before diving into the many features in the Edit
tab, we’ll start small. Let’s use Edit
to turn absolute frequencies to relative frequencies. Just follow these steps:
- Under
To edit
, select an interrogation that you performed previously. Its results and totals branches will be displayed on the right.
- Select
results
to edit its results, rather than its totals.
- Under
Operation
, select %
.
- For
Denominator
, select Self
, and then the totals
branch.
- Under
Edit name
, pick something memorable.
- Hit
Edit
.
This will generate a relative frequency version of your original interrogation, and store it to memory under the name you chose. You can then do all sorts of things with it:
- Select it as the
To edit
data for more complex editing
- Manually alter the values in the spreadsheets and hit
Update interrogation(s)
- Head to the
Visalise
tab to plot the results
- Go back to the
Interrogate
tab and use the Previous
and Next
buttons to see the results in a larger window
- Navigate to
Menu
→ Manage project
and:
- Save it permanently to
project/saved_interrogations
- View the options that generated it
- Rename it
- Export it to CSV
- Remove it from memory
- Delete it from
project/saved_interrogations
You’ll notice, however, that we just skipped over a lot of options and buttons. In the sections below, editing is outlined in more detail.
Selecting data to edit
You can select any interrogation or edited result to edit. After this, you need to select a branch
, either results
or totals
. These correspond to the two spreadsheets in the Interrogate
tab. Some kinds of searches do not generate both branches, however: Count tokens
, for example, produces only a totals branch.
Operations and denominators
A very common task is to turn absolute into relative frequencies. To do this, you simply select '%'
as the operation, and self
totals
as the denominator, as in the Quickstart example. To calculate a ratio, rather than a percentage, you could use the ‘÷’ operator. The subtraction operator may be a useful way of removing a false positive result.
The other operations are:
Combine
: join two interrogations, aggregating any column that appears in both.
Keywords
: perform log-likelihood keywording
%-diff
: perform percentage-difference keywording
rel. dist
.: To use on Interrogations produced with the Distance from root
option
Keywording is treated in more depth here.
Relative frequencies: a more detailed example
Perhaps you are interested in the most common plural nouns in each subcorpus of your corpus. Your subcorpora, however, vary quite a lot in size, so you think relative, rather than absolute, frequencies are more appropriate.
You decide you want to get the relative frequency of plural nouns compared to:
- All plural nouns
- All nouns
- All tokens
To do this, you need to go back to the Interrogate
tab and perform three interrogations, all using the Trees
search type. First, you use Get words
with a query that will match plural nouns. You give the interrogation a memorable name: list plural
:
/NNP?S/ < __
Second, you select Count
as the return value, and define a query to match any noun (name: count noun
):
/NN.?/ < __
Finally, still using the Count tokens
option, you select the preset query Any
, which will match any token (name: count token
).
Now, back in the Edit
pane, you perform three separate edits, creating three different spreadsheets:
Data |
Branch |
Operation |
Denominator |
Branch |
list plural |
results |
% |
Self /list plural |
totals |
list plural |
results |
% |
Count noun |
totals |
list plural |
results |
% |
Count token |
totals |
corpkit’s ability to perform these kinds of operations means that you can generate useful and appropriate statistics from your data. Why show passives as a percentage of all words, when you could show passives as a percentage of all clauses?
Results branch as denominator
So far, whenever we’ve picked a denominator, we’ve selected its totals
branch. If you use a results
branch as a denominator, things get a little more complex. Rather than being divided by the total from that subcorpus, each entry will be divided by the total occurrences of that particular entry in the denominator data. This is a hard thing to explain, though. It’s easier to understand this feature through an actual use-case:
Calculating risk and power
In an investigation of risk language in The New York Times, exploratory analysis suggested that people in positions of power are often the ones doing risking. Politicians risk votes, but shoppers don’t seem to risk fatigue. This seems like an interesting thing to measure … but there is a problem. Some common nouns, like person, are much more frequent than Obama or senator. So, if we try to tally all riskers, person keeps coming out on top.
To work around this, the investigator performs two searches. The first gets Participants in the corpus. The second gets just Participants who are the actors in risk processes:
# |
Data type |
Search option |
Query |
Function filter |
Lemmatise |
Name |
1 |
Dependencies |
Get tokens by role |
LIST:PARTICIPANT |
|
True |
participants |
2 |
Dependencies |
Get "role:dependent" |
\brisk.* |
LIST:ACTOR |
True |
riskers |
We also head to the Wordlists
feature and define a list of people of interest, called PEOP
:
obama
bush
clinton
politican
senator
man
woman
child
baby
The two lists can then be mashed together in the Edit
tab:
Data |
Branch |
Operation |
Denominator |
Branch |
Sort |
Just totals |
Just entries |
riskers |
results |
% |
participants |
results |
Total |
True |
LIST:PEOP |
The visualised output:
</div>
</center>
... Great! As expected, though *person* might be a much more common word than *politician*, politicans are far more likely to be doing risking.
## Combining results
You can use the `combine` operation to add two results together. By default, this happens on the y axis. If you have multiple columns with the same names, these will be aggregated.
## Sorting
When working with multiple subcorpora, sorting becomes a very powerful feature of *corpkit*. Aside from very normal kinds of sorting (by total, by name), you can also sort by *increase*, *decrease*, *static* or *turbulent*.
*corpkit* does this using [`Scipy`'s *linear regression* function](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html). Essentially, a straight trend line is calculated for each entry in the data. By measuring the angle of this line, we can find out which results increasingly or decreasingly occur across the dataset.
If you tick `Keep stats`, you'll be able to see the `slope`, `intercept`, `stderr` and `p value`, where the null hypothesis is that there is no upward or downward slope. `Remove above p` will automatically exclude results not meeting the standard.
## Keywording
If you select the `keywords` operation, you can get log-likelihood keyness scores for the the dataset of interest, compared with the denominator as a reference corpus. When doing keywording, the reference corpus can be either:
1. a results branch of any interrogation
2. A dictionary from the `dictionaries` folder: either the BNC (included), or one you made yourself using the `Save as dictionary` button in the `Interrogate` tab.
Using `Self`-`results` as denominator will determine which words are key in each subcorpus. Each subcorpus is dropped from the reference corpus in turn in order to calculate these values.
Tip: Negative keywords exist, too: try sorting by inverse total to find out which words are uncommon in the target data.
You can also use `%-diff` to calculate keywords via percentage difference, an alternative keyword algorithm.
## Skipping, keeping, merging
You can easily skip, keep or merge particular entries and subcorpora. For entries, you need to to write out some criteria. For subcorpora, you can select from the list.
When writing out entries to keep/remove/merge, you can supply either:
1. A regular expression to match: `^fr*iends?$` will match `fiend`, `fiends`, `friend` and `friends`.
2. A list: `[fiend,fiends,friend,friends]`
Note: Special queries work here, too. If you searched for process, you could keep only verbal processes by using LIST:VERBAL
.
If merging either subcorpora or entries, you may want to provide a new name for the merged item. If you leave this field blank, the first few entry names are joined together with slashes as the new name.
Merging entries is a powerful way to do thematic categorisation: you could merge the names of illnesses as `Illnesses`, and the names of treatments as `Treatments`. You could then edit the edited results, keeping only those two.
## Replacing names
You can change the names of entries after they appear. One way to do this is to click the spreadsheet, make changes, and click `Update interrogation(s)`. Another way is to use the `Replace names` boxes. The box on the left takes a regular expression as a search pattern. It will look in every entry name matching the expression, and replace it with whatever is in the box on the right (leave blank to simply delete the found pattern). Duplicate entry names are then merged.
## Other options
| Option | Function |
|---|---|
| `Keep stats` | When calculating slopes, keep these in the edited data, and show them in the spreadsheet windows |
| `Remove above p` | A null hypothesis is that entries are equally frequent in each subcorpus. When calculating slopes, remove any entries with `p` above `0.05` |
| `Just totals` | Combine every entry before processing |
| `Transpose` | Flip rows and columns (subcorpora become entries, etc.) |
| `Spelling` | Convert or normalise English |
| `Keep top results` | After all other editing and sorting, return only the top n results |
## Naming your edit
As with interrogations, it might be helpful to give your edit a name, so that it can be more easily identified. You don't have to do this, however.
## Performing an edit
Once all your options are set, just hit `Edit`.
Editing is generally very fast, but very large interrogations, combined with many different kinds of edits, may slow things down a bit. Using results, rather than totals, as a denominator will generally take a little longer.
Note: You can use Help
→ Save log
after performing an edit to get some information about what was performed.
If you'd like to see the results in a bigger window, head back to the `Interrogate` tab and use the `Previous` and `Next` buttons to bring up your edited results.
## Next steps
After you've created a set meaningful results, you can head over to the `Visualise` tab to display your findings in an engaging way.