This part of the tutorial will introduce you to TGrep2, a tool for querying annotated corpora.
In this part of the tutorial you will learn…
- how to use basic links, the most important part of the tutorial because basic links are needed for every query.
- how to use boolean expressions to form more complex queries.
Getting ready for TGrep2
Firstly, you should know how to form regular expressions because they will be necessary for your queries and for your understanding of this chapter.
If you think you are not familiar with regular expressions have a look at Chapter Regular Expressions.
Secondly, you’ll need to distinguish between nodes and links.
Nodes and node names
Nodes can be phrase-tags (e.g. NP, VP, NS:s, Vde), word-tags (e.g. NNP, VBZ, NN1u, VBDZ) or regular expressions that are matching specific words (eg.
/love/, /^wh[o(at)(en)(ere)]$/) or tags (e.g.
Note that you cannot insert special characters like
;:,&|<>()$! in your node names.
If for example you are searching for a prepositional object like P:u or Tf:u in the SUSANNE Corpus you’ll need to put your node name into a regular expression: /:u/
Links are used to explain the relationship between nodes (e.g. A < B searches for a node A which is the mother of node B). There are three basic kinds of relations:
Hierarchical relation (A is the parent of B: A < B)
This image shows the structure of an example sentence. You can see that S dominates every other element and it is the parent node of NP and VP. Hierarchical structure is very easy to see in this picture.
Sister-relation (A is a sister of B: A $ B)
The example above would look like this in labelled bracketing (Penn annotation scheme):
(S (NP (DT The)
(VP (VDZ sat)
(PP (IN on)
(NP (DT a)
You can see clearly which nodes are siblings. DT is a sister node of NN and vice versa. NP and VP are also sister nodes because they both are dominated by S and appear in the same column.
Linear relation (A precedes B: A .. B)
To demonstrate you how linear relations work it’s best to put the labelled sentence in one single line:
(S (NP (DT The) (NN cat)) (VP (VDZ sat) (PP (IN on) (NP (DT a) (NN mat)))))
Now you can see that The immediately precedes NN (query: /^The$/ . NN) and cat (query: /^The$/ . /^cat$/). It also precedes VP, a and simply every element that is right to it in this view.
When you are formulating your queries keep in mind that SUSANNE and Penn have a different structure, with SUSANNE being rather flat and Penn having a deep structure. A hierarchical (parent-child) relation in the Penn Treebank can become a sister-relation in the SUSANNE Corpus.