In this section, we will go through one ‚real-life‘-example of what you have learned so far. We will not cover the whole process but rather give you a guideline on how your approach should look like so that you can solve the rest for yourself. Note that we will mostly skip over linguistic details and focus on the technical side of the issues for this tutorial.
As you probably have guessed from the headline, we will take a look at the passive voice in the English language. Since the Penn Treebank and the SUSANNE Corpus encode their information differently, we will need different queries.
Just in case the Penn Treebank still is your favourite, be prepared to change your mind soon. As previously mentioned, it is always a good idea to start with one concrete sentence because we do not yet know how the passive is annotated in the Penn Treebank. To begin, we look for any simple sentence containing a passive. Such a sentence could contain the words were taken, so we will simply look for that. There are many possibilities here, it does not really matter, but try fairly common words so that you are sure to get at least a couple of good results. So, if we were to look for were taken, what would be a good query?
When you have found the correct query, you should get about 8 results and upon looking at them we will quickly see that were taken is always annotated like this:
(VP (VBD were)
(VP (VBN taken)
Thus, this is the structure we need to describe with our query. Our current query is way too specific, because first of all, we know that there can be any form of to be instead of just were. So, what we are looking for is any verb-tag that immediately dominates a form of to be. In order to describe ‚any verb-tag‘ in the Penn Treebank, we will use the regular expression /VB.?/. Since there is no seperate tag for forms of to be, we will have to to make sure to match them all. This can be achieved with a regular expression like this one: /^(be|is|am|are|was|were|been|being)$/. Combining these two with the correct basic link results in the following query: /VB.?/ < /^(be|is|am|are|was|were|been|being)$/.
So, now that we have described our form of to be, there is still something missing: following the form of to be, there has to be a past participle! Or to speak in TGrep-terms: we need to look for a verb-tag that immediately dominates a form of to be and is a sister of as well as immediately precedes a VP which itself immediately dominates a VBN (which is the word-tag for past participles). Getting confused yet?
Since we already have a good query for the first part, we only have to add the rest. If you have memorized the most important basic links, you will know that there is one that exactly fits ‚is a sister of as well as immediately precedes‘: $. !
Accordingly, our next query should be /VB.?/ < /^(be|is|am|are|was|were|been|being)$/ $. /VP/. The only thing left for us to include in our query is the VBN. Once again, brackets come into play. This is because without brackets, adding < /VBN/ to our query would mean that this link refers back to our /VB.?/ at the very beginning, when it is supposed to refer to the VP. The solution though is quite simple and yields us our final query:
/VB.?/ < /^(be|is|am|are|was|were|been|being)$/ $. (/VP/ < /VBN/)
Once you think you have reached your final query, it is always a good to take a look at as many results as possible in order to be sure that your results are correct.
We have been praising the SUSANNE Corpus quite often in our tutorial and finally it is time for SUSANNE to shine! There might be faster ways of discovering this, but we will start with the same query as before and look for an instances of were taken. /^were$/ . /^taken$/ gives us 6 results:
(Vwp (VBDR were)
Now, we might try to describe that relationship as we did before with the Penn Treebank, but because we know that the SUSANNE annotation scheme is very detailed, we will try to find out what Vwp actually stands for. The capital V means we are dealing with a verb group, which is not really a surprise to us. The subcategory symbol w gives us the information that the verb group begins with a were. This already indicates that the tags of the SUSANNE scheme include much more information, but still, this does not really help us. The lowercase p, however, is our holy grail. Any tag that begins with a capital V and contains a lowercase p is a passive verb group! All we have to do isformulating a regular expression describing this pattern.
By now, this should not be too difficult for you. What would be a good regular expression?
As you have seen, even a rather basic grammatical construction can be quite difficult to query for. As a linguist, you will probably be left unsatisfied with this tutorial. Sure, you formulated a good query and were able to extract passive sentences from a treebank, but why?
Well, our tutorial is only meant to be a starting point for your ventures into the world of treebanks. We (hopefully) have taught you the tools you need to discover facts on your own. One next possible step could be to take up our queries on the passive and find out how often the by-phrase, which can be used to realise the subject of the corresponding active phrase, is left out. But there are many other exciting things for you to find out as well!