{"title": "Convolution Kernels for Natural Language", "book": "Advances in Neural Information Processing Systems", "page_first": 625, "page_last": 632, "abstract": null, "full_text": "Convolution Kernels for Natural Language\n\nMichael Collins\n\nAT&T Labs\u2013Research\n\n180 Park Avenue, New Jersey, NJ 07932\nmcollins@research.att.com\n\nNigel Duffy\n\nDepartment of Computer Science\n\nUniversity of California at Santa Cruz\n\nnigeduff@cse.ucsc.edu\n\nAbstract\n\nWe describe the application of kernel methods to Natural Language Pro-\ncessing (NLP) problems. In many NLP tasks the objects being modeled\nare strings, trees, graphs or other discrete structures which require some\nmechanism to convert them into feature vectors. We describe kernels for\nvarious natural language structures, allowing rich, high dimensional rep-\nresentations of these structures. We show how a kernel over trees can\nbe applied to parsing using the voted perceptron algorithm, and we give\nexperimental results on the ATIS corpus of parse trees.\n\n1 Introduction\n\nKernel methods have been widely used to extend the applicability of many well-known al-\ngorithms, such as the Perceptron [1], Support Vector Machines [6], or Principal Component\nAnalysis [15]. A key property of these algorithms is that the only operation they require\nis the evaluation of dot products between pairs of examples. One may therefore replace\nthe dot product with a Mercer kernel, implicitly mapping feature vectors in \u0002\u0001\ninto a new\nspace \u0004\u0003\n, and applying the original algorithm in this new feature space. Kernels provide\nan ef\ufb01cient way to carry out these calculations when\n\nis large or even in\ufb01nite.\n\nThis paper describes the application of kernel methods to Natural Language Processing\n(NLP) problems. In many NLP tasks the input domain cannot be neatly formulated as a sub-\n\u0001 . Instead, the objects being modeled are strings, trees or other discrete structures\nset of \nwhich require some mechanism to convert them into feature vectors. We describe kernels\nfor various NLP structures, and show that they allow computationally feasible representa-\ntions in very high dimensional feature spaces, for example a parse tree representation that\ntracks all subtrees. We show how a tree kernel can be applied to parsing using the percep-\ntron algorithm, giving experimental results on the ATIS corpus of parses. The kernels we\ndescribe are instances of \u201cConvolution Kernels\u201d, which were introduced by Haussler [10]\nand Watkins [16], and which involve a recursive calculation over the \u201cparts\u201d of a discrete\nstructure. Although we concentrate on NLP tasks in this paper, the kernels should also be\nuseful in computational biology, which shares similar problems and structures.\n\n1.1 Natural Language Tasks\n\nFigure 1 shows some typical structures from NLP tasks. Each structure involves an \u201cob-\nserved\u201d string (a sentence), and some hidden structure (an underlying state sequence or\ntree). We assume that there is some training set of structures, and that the task is to learn\n\n\u0005\n\fa) Lou Gerstner is chairman of IBM\n\n[S [NP Lou Gerstner ] [VP is [NP chairman [PP of [NP IBM ] ] ] ] ]\n\nb) Lou Gerstner is chairman of IBM\nc) Lou/N Gerstner/N is/V chairman/N of/P IBM/N\n\nLou/SP Gerstner/CP is/N chairman/N of/N IBM/SC\n\nFigure 1: Three NLP tasks where a function is learned from a string to some hidden struc-\nture. In (a), the hidden structure is a parse tree. In (b), the hidden structure is an under-\nlying sequence of states representing named entity boundaries (SP = Start person, CP =\nContinue person, SC = Start company, N= No entity). In (c), the hidden states represent\npart-of-speech tags (N = noun, V = verb, P = preposition,).\n\nthe mapping from an input string to its hidden structure. We refer to tasks that involve trees\nas parsing problems, and tasks that involve hidden state sequences as tagging problems.\n\nIn many of these problems ambiguity is the key issue: although only one analysis is plau-\nsible, there may be very many possible analyses. A common way to deal with ambiguity\nis to use a stochastic grammar, for example a Probabilistic Context Free Grammar (PCFG)\nfor parsing, or a Hidden Markov Model (HMM) for tagging. Probabilities are attached to\nrules in the grammar \u2013 context-free rules in the case of PCFGs, state transition probabili-\nties and state emission probabilities for HMMs. Rule probabilities are typically estimated\nusing maximum likelihood estimation, which gives simple relative frequency estimates.\nCompeting analyses for the same sentence are ranked using these probabilities. See [3] for\nan introduction to these methods.\n\nThis paper proposes an alternative to generative models such as PCFGs and HMMs. Instead\nof identifying parameters with rules of the grammar, we show how kernels can be used to\nform representations that are sensitive to larger sub-structures of trees or state sequences.\nThe parameter estimation methods we describe are discriminative, optimizing a criterion\nthat is directly related to error rate.\n\nWhile we use the parsing problem as a running example in this paper, kernels over NLP\nstructures could be used in many ways: for example, in PCA over discrete structures, or\nin classi\ufb01cation and regression problems. Structured objects such as parse trees are so\nprevalent in NLP that convolution kernels should have many applications.\n\n2 A Tree Kernel\n\nThe previous section introduced PCFGs as a parsing method. This approach essentially\ncounts the relative number of occurences of a given rule in the training data and uses these\ncounts to represent its learned knowledge. PCFGs make some fairly strong independence\nassumptions, disregarding substantial amounts of structural information. In particular, it\ndoes not appear reasonable to assume that the rules applied at level\nin the parse tree are\nunrelated to those applied at level\n\n.\n\nAs an alternative we attempt to capture considerably more structural information by con-\nsidering all tree fragments that occur in a parse tree. This allows us to capture higher order\ndependencies between grammar rules. See \ufb01gure 2 for an example. As in a PCFG the new\nrepresentation tracks the counts of single rules, but it is also sensitive to larger sub-trees.\n\n\u0001\u0003\u0002\u0005\u0004\n\nConceptually we begin by enumerating all tree fragments that occur in the training data\ndimen-\n\u2019th tree\n\u2019th tree\n.\n\n. Note that this is done only implicitly. Each tree is represented by an\n\u2019th component counts the number of occurences of the\nto be the number of occurences of the\n\nis now represented as\n\n\u0004\u0007\u0006\t\b\n\b\t\b\u000b\u0006\nsional vector where the\nfragment. Let us de\ufb01ne the function\nfragment in tree\n\n, so that\n\n\f\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\n\n\u0015\u0016\u000f\u0012\u0011\u0014\u0013\u0018\u0017\u0019\u000f\u001a\f\u001c\u001b\u0007\u000f\u0012\u0011\u0014\u0013\u000b\u0006\u0010\f\u001e\u001d\u001f\u000f\u0012\u0011 \u0013!\u0006\n\b\t\b\n\b\u000b\u0006\"\f\n\n\u000f#\u0011\u0014\u0013$\u0013\n\n\n\n\u0001\n\u0005\n\u0005\n\u0001\n\u0001\n\u0001\n\u0011\n\u0011\n\u0003\n\fa)\n\nS\n\nb)\n\nNP\n\nNP\n\nD\n\nN\n\nNP\n\nNP\n\nD\n\nN\n\nD N\n\nthe\n\napple\n\nD\n\nN\n\nD\n\nN\n\nthe\n\napple\n\nthe\n\napple\n\nNP\n\nN\n\nJeff\n\nVP\n\nV\n\nate\n\nNP\n\nD\n\nN\n\nthe\n\napple\n\nFigure 2: a) An example tree. b) The sub-trees of the NP covering the apple. The tree in\n(a) contains all of these sub-trees, and many others. We de\ufb01ne a sub-tree to be any sub-\ngraph which includes more than one node, with the restriction that entire (not partial) rule\nproductions must be included. For example, the fragment [NP [D the ]] is excluded\nbecause it contains only part of the production NP\n\nD N.\n\nNote that\nwill be huge (a given tree will have a number of subtrees that is exponential in\nits size). Because of this we would like design algorithms whose computational complexity\ndoes not depend on\n\n.\n\nRepresentations of this kind have been studied extensively by Bod [2]. However, the work\nin [2] involves training and decoding algorithms that depend computationally on the num-\nber of subtrees involved.\u001b The parameter estimation techniques described in [2] do not\ncorrespond to maximum-likelihood estimation or a discriminative criterion: see [11] for\ndiscussion. The methods we propose show that the score for a parse can be calculated in\npolynomial time in spite of an exponentially large number of subtrees, and that ef\ufb01cient pa-\nrameter estimation techniques exist which optimize discriminative criteria that have been\nwell-studied theoretically.\n\nGoodman [9] gives an ingenious conversion of the model in [2] to an equivalent PCFG\nwhose number of rules is linear in the size of the training data, thus solving many of the\ncomputational issues. An exact implementation of Bod\u2019s parsing method is still infeasible,\nbut Goodman gives an approximation that can be implemented ef\ufb01ciently. However, the\nmethod still suffers from the lack of justi\ufb01cation of the parameter estimation techniques.\nThe key to our ef\ufb01cient use of this high dimensional representation is the de\ufb01nition of an\nand\nappropriate kernel. We begin by examining the inner product between two trees\n\nto be\n\n\u000f#\u0011\nrespectively. We de\ufb01ne the indicator\nand 0 otherwise. It follows\nthat\n. The \ufb01rst step to ef\ufb01cient\ncomputation of the inner product is the following property (which can be proved with some\nsimple algebra):\n\n\u001d\t\u0013\n\u001d\t\u0013\u0018\u0017\nis seen rooted at node\n\n\u000f\u0012\u0011\nand\nif sub-tree\nand\n\n\u000f\u0012\u0011\n\n. To compute\u0001 we \ufb01rst de\ufb01ne\n\n\u000f#\u0011\n\n\u0017\u0007\u0006\n\nthe set of nodes in trees\n\nunder this representation,\u0001\nfunction \u0005\u000b\r\u0010\u000f\n\u0003\t\b\u000b\n\r\f\u000e\b\n\u0013\u0012\u0014\u0016\u0015\u0018\u0017\u001a\u0019\u0018\u001b\u000b\u0013\u001c\u0014\u0016\u0015\u001e\u001d\u000b\u0019 \u001f\"!$#&%\n\u0014\u0016\u0015\u0018\u0017'\u0019(%\nwhere we de\ufb01ne ?\n\nIf the productions at\n\nIf the productions at\n\n\u000f#\u0011\n\nas \u0004\n\n\u001b\u000b\u0013\u0003\u0002\u001f\u0015\nand \u0004\n\u0017\u0007\u0006\n\u0003\u0010\u000f\u0011\n\r\f\u0012\u000f\n!2#43\n\u0014658\u001d9\u0019\u0012\u001f:!\n\u001465\u0018\u0017\u001a\u001973\n\u0014\u0016\u0015\u001e\u001d\u000b\u0019)\u001f*!\n\b9,/.0\b\n\u000f1,/.\u001e\u000f\n\b-,/.0\b\n. Next, we note that ?\n\u0013A\u0005\n\u0017@\u0006\nare different?\n\n\u0005\u001c\u001d\t\u0013\nare the same, and\n\n\u0017DC .\n\nand\n\ncomputed in polynomial time, due to the following recursive de\ufb01nition:\n\nand\n\n\u0005\u001c\u001d\nand\n\n\u0005\u001c\u001d\n\n\u001465\u0018\u0017-=>5\u001e\u001d-\u0019\n\ncan be\n\n\u000f\u0011,;.2\u000f$<\n\n.\u001d\n\n\u0017 In training, a parameter is explicitly estimated for each sub-tree. In searching for the best parse,\n\u001d Pre-terminals are nodes directly above words in the surface string, for example the N, V, and D\n\ncalculating the score for a parse in principle requires summing over an exponential number of deriva-\ntions underlying a tree, and in practice is approximated using Monte-Carlo techniques.\n\nare pre-terminals, then\n\n\n\u0005\n\u0005\n\u0011\n\u001b\n\u0011\n\u001d\n\u001b\n\u0006\n\u0011\n\u0015\n\u0011\n\u001b\n\u0011\n\u001d\n\u001b\n\u001d\n\u0005\n\u0013\n\u0004\n\u0001\n\u0005\n\f\n\n\u001b\n\u0013\n\u0005\n\n\u000f\n\u0005\n\u001b\n\u0013\n\f\n\n\u001d\n\u0013\n\u0005\n\n\u000f\n\u0005\n\u001d\n\u0013\n#\n#\n+\n!\n+\n#\n#\n+\n!\n+\n\u000f\n\u0005\n\u001b\n\u0006\n\u0005\n\u001d\n\u0013\n\n\u0005\n\n\u000f\n\u0005\n\u001b\n\n\u000f\n\u0005\n\u001d\n\u0013\n\u000f\n\u0005\n\u001b\n\u0006\n\u0005\n\u001d\n\u0013\nB\n\u0005\n\u001b\n\u000f\n\u0005\n\u001b\n\u0006\nB\n\u0005\n\u001b\n\u0005\n\u001b\n\u0005\n\u001d\n?\n\u000f\n\u0005\n\u001b\n\u0006\n\u0005\n\u001d\n\u0013\n\u0017\n\u0004\n\fand\n\nare the same and\n\nand\n\n\b\u0005\u0004\n\n\u0005\u001c\u001d\n\u0003\u0001\u0003\u0002\n\u0007\u0003\b\n\n\u000f$\u0004\u0016\u0002\"?\n\n\u000f\n\t\n\f\n\n\u0006\f\u000b\n\n\u0013!\u0006\r\t\n\f\n\nare not pre-terminals,\n\n\u0013$\u0013\n\n\u0005\u001c\u001d\n\u0005\u001c\u001d\u001f\u0006\u000e\u000b\n\nB Else if the productions at\n\n\u0005\u001c\u001d\n\n\u0013\u0019\u0018\n\n\u001b\u001b\u001a\n\nmost values of?\n\n\u0005\u000f\t\u001f\u000f\n\nis the number of children of\n\nwhere\nare the same, we have\n\nin the tree; because the productions at\n\u2019th child-node of\n\nis\n\n\u001b\n\u0013\n\n\u001b\n\u0013\n\n\u0005\u001c\u001d\t\u0013\n\n. The\n\n\u0005\u000f\t\u001f\u000f\n\n\u0005\u000f\t\u0007\u000f\n\nTo see that this recursive de\ufb01nition is correct, note that?\n\n\u0005\u001c\u001d\nsimply counts the number\n. The \ufb01rst two cases are trivially\nof common subtrees that are found rooted at both\ncorrect. The last, recursive, de\ufb01nition follows because a common subtree for\ncan\nbe formed by taking the production at\n, together with a choice at each child of simply\ntaking the non-terminal at that child, or any one of the common sub-trees at that child.\nThus there are\n\u2019th child. (Note\nthat a similar recursion is described by Goodman [9], Goodman\u2019s application being the\nconversion of Bod\u2019s model [2] to an equivalent PCFG.)\n\npossible choices at the\n\n\u0005\u001c\u001d\n\u0001\u0005\u0010\n\u0011\u000e\u000f\n\n\u0004\u0016\u0002\"?\n\n\u0001\u0005\u0010\n\u0011\u000e\u000f\n\n\u000f\n\t\n\f\n\nand\n\nand\n\n\u0013$\u0013$\u0013\n\n\u0005\u001c\u001d\n\n\t\n\f\n\n\u0006$\u0001\n\n/\n\n/\n\n.\n\n, that\n\nof ?\n\nIt is clear from the identity\n\n, and the recursive de\ufb01nition\ntime: the matrix of\n\u0005\u001c\u001d\n\u001b\u001f\u0006\nvalues can be \ufb01lled in, then summed. This can be a pessimistic estimate of\n\u0005\u001c\u001d\nthe runtime. A more useful characterization is that it runs in time linear in the number of\nare the same. In our\nmembers\ndata we have found a typically linear number of nodes with identical productions, so that\n\n\u001d\t\u0013\nsuch that the productions at\n\ncan be calculated in\n\n\u000f#\u0011\n\u0015\u0016\u000f\u0012\u0011\n\n\u001b\u0001\u0015\u0016\u0015\n\nand\n\n\u001d\u0017\u0015\n\n\u000f\r\u0015\n\n\u001b\u001f\u0006\n\n\u001b\u000b\u0013\n\n\u000f#\u0011\n\n\u0003\t\b\u0013\u0012\n\n\u0003\u0010\u000f\n\n\u0013\u000b\u0006\u0003\t\n\f\n\u0002\u001a\u0015\u0016\u000f\u0012\u0011\n\nare 0, and the running time is close to linear in the size of the trees.\n\nThis recursive kernel structure, where a kernel between two objects is de\ufb01ned in terms\nof kernels between its parts is quite a general idea. Haussler [10] goes into some detail\ndescribing which construction operations are valid in this context, i.e. which operations\nmaintain the essential Mercer conditions. This paper and previous work by Lodhi et al. [12]\nexamining the application of convolution kernels to strings provide some evidence that\nconvolution kernels may provide an extremely useful tool for applying modern machine\nlearning techniques to highly structured objects. The key idea here is that one may take\na structured object and split it up into parts. If one can construct kernels over the parts\nthen one can combine these into a kernel over the whole object. Clearly, this idea can be\nextended recursively so that one only needs to construct kernels over the \u201catomic\u201d parts of\na structured object. The recursive combination of the kernels over parts of an object retains\ninformation regarding the structure of that object.\n\nSeveral issues remain with the kernel we describe over trees and convolution kernels in\n.\n\nwill depend greatly on the size of the trees\n\ngeneral. First, the value of\u0001\nOne may normalize the kernel by using\u0001\u001d\u001c\n\n\u001d\n\u0013\n\n\u0006$\u0011\n\n\u000f\u0012\u0011\n\n\u001d\n\u0013\u0003\u001e \u001f\n\n\u000f\u0012\u0011\n\n\u000f\u0012\u0011\n\n\u001d\n\u0013\u0016\u0017\n\nC\u0017!\n\nwhich also satis\ufb01es the essential Mercer conditions. Second, the value of the kernel when\napplied to two copies of the same tree can be extremely large (in our experiments on the\norder of\n) while the value of the kernel between two different trees is typically much\nsmaller (in our experiments the typical pairwise comparison is of order 100). By analogy\nwith a Gaussian kernel we say that the kernel is very peaked. If one constructs a model\nwhich is a linear combination of trees, as one would with an SVM [6] or the perceptron,\nthe output will be dominated by the most similar tree and so the model will behave like\na nearest neighbor rule. There are several possible solutions to this problem. Following\nHaussler [10] we may radialize the kernel, however, it is not always clear that the result is\nstill a valid kernel. Radializing did not appear to help in our experiments.\n\n\u0011\u001c\u001b\u000b\u0013\n\n\u000f#\u0011\n\n\u001d\u001f\u0006$\u0011\n\n\u000f\u0012\u0011\n\nThese problems motivate two simple modi\ufb01cations to the tree kernel. Since there will\nbe many more tree fragments of larger size \u2013 say depth four versus depth three \u2013 and\n\nsymbols in Figure 2.\n\n\u0005\n\u001b\n\u0005\n\u001b\n?\n\u000f\n\u0005\n\u001b\n\u0006\n\u0013\n\u0017\n\u0003\n\u0006\n\u001b\n\u000f\n\u0005\n\u001b\n\u000f\n\u0013\n\u0006\n\u0005\n\u0005\n\u001b\n\u0005\n\u001b\n\u0005\n\u0017\n\u0001\n\u0005\n\u001b\n\u000f\n\u0005\n\u001b\n\u0013\n\u000f\n\u0005\n\u001b\n\u0006\n\u0005\n\u001d\n\u0013\n\u0005\n\u001b\n\u0005\n\u001d\n\u0005\n\u001b\n\u0005\n\u001b\n\u000f\n\u0005\n\u001b\n\u0006\n\u0001\n\u0005\n\u001d\n\u0006\n\u0001\n\u0001\n\u0015\n\u001b\n\u0013\n\u001d\n\u0013\n\u0017\n\u0006\n?\n\u000f\n\u0005\n\u001b\n\u0006\n\u0005\n\u001d\n\u0013\n\u000f\n\u0005\n\u0013\n\u0015\n\u0002\n\u0014\n\u0004\n\u0004\n\u0013\n?\n\u000f\n\u0005\n\u0013\n\u000f\n\u0005\n\u001b\n\u0006\n\u0005\n\u001d\n\u0004\n\u0004\n\u001d\n\u0005\n\u001b\n\u0005\n\u001d\n\u001b\n\u0011\n\u001b\n\u0006\n\u0011\n\u001d\n\u001b\n\u0006\n\u0011\n\u0001\n\u001b\n\u0006\n\u0011\n\u0001\n\u001b\n\u0006\n\u0001\n\u001d\n\u0013\n\u0004\n\fand\n\n\u0017\u0007\u0003\n\n\u0005\u001c\u001d\t\u0013\n\n\u0003\t\b\n\nto be respectively\n\nconsequently less training data, it makes sense to downweight the contribution of larger\ntree fragments to the kernel. The \ufb01rst method for doing this is to simply restrict the depth\nThe second method is to scale the relative importance of\nof the tree fragments we consider.\n,\n\ntree fragments with their size. This can be achieved by introducing a parameterC\u0002\u0001\u0004\u0003\u0006\u0005\nand modifying the base case and recursive case of the de\ufb01nitions of ?\n\u0006\f\u000b\n\r\f\u000b\u000e\r\u0010\u000f\n\n\u0005\u001c\u001d\t\u0013\nThis corresponds to a modi\ufb01ed kernel\n\u0001\u0010\u0012\u0014\u0013\nof tree fragments exponentially with their size.\nIt is straightforward to design similar kernels for tagging problems (see \ufb01gure 1) and for\nanother common structure found in NLP, dependency structures. See [5] for details. In the\ntagging kernel, the implicit feature representation tracks all features consisting of a subse-\nquence of state labels, each with or without an underlying word. For example, the paired se-\nquence\nwould in-\nclude features such as\nSP CP is/N N of/N\nand so on.\n\n, where\n\u2019th fragment. This kernel downweights the contribution\n\nLou/SP Gerstner/CP is/N chairman/N of/N IBM/SC\n\nis the number of rules in the\n\nSP Gerstner/CP N\n\n\u0005\u001c\u001d\u001f\u0006\u000e\u000b\n\u0013$\f\n\n\u0013!\u0006\r\t\n\f\n\u000f#\u0011\n\n\u0004\u0016\u0002\"?\n\n\u0003\u0017\u0003\u0002\n\u0007\u0003\b\n\nSP CP\n\n\u0017\b\u0003\n\n\u0015\u0016\u000f\u0012\u0011\n\n\u000f\n\t\n\f\n\n\u0013$\u0013$\u0013\n\n\u0003\n\t\n\n\u001b\u001f\u0006\n\n\u000f#\u0011\n\n\u000f\u0012\u0011\n\n,\n\n,\n\n3 Linear Models for Parsing and Tagging\n\n\u0006\u0018\u0017\n\nuse\n\n\u0015\u001b\u0019\u001c\n\n(i.e.,\n\n\b\n\b\t\b\u001d\u0016\n\nexamples\n\n\u0019\u001c\r\n\u001b\u001f\u0006\u001c\u0019\u001c\n\nto denote the\n\n\u2019th candidate for the\n\nis a sentence and each\n\nwhere each \u0011\n\nThis section formalizes the use of kernels for parsing and tagging problems. The method\nis derived by the transformation from ranking problems to a margin-based classi\ufb01cation\nproblem in [8]. It is also related to the Markov Random Field methods for parsing suggested\nin [13], and the boosting methods for parsing in [4]. We consider the following set-up:\n\nis represented by a feature vector\n\n\u2019th sentence in training data, and\n\nis the correct tree for that sentence.\n\nto be the correct parse for \u0011\n\nto denote the set of candidates for \u0011\n\n\u0017\u0004\u0017\n. The param-\n. We then de\ufb01ne the \u201cranking score\u201d of each\n. This score is interpreted as an indication of the plausibility of the\n.\n\nB Training data is a set of example input/output pairs. In parsing we would have training\nB We assume some way of enumerating a set of candidates for a particular sentence. We\nB Without loss of generality we take\nB Each candidate\neters of the model are also a vector\nexample as\ncandidate. The output of the model on a training or test example \u0011\n, note that a ranking func-\nWhen considering approaches to training the parameter vector\ntion that correctly ranked the correct parse above all competing candidates would satisfy\n. It is simple to modify the Perceptron\nthe conditions\nand Support Vector Machine algorithms to treat this problem. For example, the SVM opti-\nmization problem (hard margin version) is to \ufb01nd the\n\u001d subject to\n!98 which minimizes\n\u0015\u0016\u0015\nthe constraints\n. Rather than explicitly calculating\n, the perceptron algorithm and Support Vector Machines can be formulated as a search\n=\u0010@\r\u0019 stores\n=>5\nthe number of common subtrees at nodes5)\u0017-=(58\u001d of depth @ or less. The recursive de\ufb01nition of <\n? This can be achieved using a modi\ufb01ed dynamic programming table where <\nin [13]) or to take the5 most probable parses from an existing probabilistic parser (as in [4]).\n\nA A context-free grammar \u2013 perhaps taken straight from the training examples \u2013 is one way of\nenumerating candidates. Another choice is to use a hand-crafted grammar (such as the LFG grammar\n\n\u0013104C32\u0003\u0001\u0010\u000642 \u000b\u0002576\n\u0004<2\u0003\u0001\u0010\u0006=2 \u000b\u00065>6\n\nbe modi\ufb01ed appropriately.\n\n\"$#&%(')\"$*\n+\n\nin the space \n\n$,\u0017\u0002\n\n\u0015\u001b \n\u001465\n\n\u0015\u0016\u000f\u001f\u0019\u001c\n\n\u0015\u0016\u000f\u001f\u0019\n\n\u0015\u0016\u000f\u001f\u0019\n\n\u0015\u0016\u000f\u001f\u0019\n\n\u0015\u0016\u000f\u001f\u0019\n\n\u000f\u001a\u0015\n\n\u000f-\u0019\u001c\n\n\u0013:.\n\n\u0002$\u000f\n\n\u000f-\u0019\n\n\u0013/.\n\n\u0013;5\n\n.\n\n).\n\n\u0019\u001c\n\nis\n\ncan\n\n\n\u0004\n?\n\u000f\n\u0005\n\u001b\n\u0006\n?\n\u000f\n\u0005\n\u0004\n\u0006\n\u001b\n\u000f\n\u000f\n\u0005\n\u001b\n\u000f\n\b\n\u001b\n\u0013\n\u0002\n\u0015\n\u001d\n\u0013\n\u0017\n\u0006\n\n\f\n\n\u001b\n\n\u001d\n\u0013\n\u0011\n\n\u0001\n\u0015\n\u0016\n\u0015\n\u0016\n\u0015\n\u0016\n\u0015\n\u0016\n\u0015\n\u0011\n\n\u0016\n\n\u0017\n\n\u0007\n\u000b\n\u0001\n\u001a\n\u000f\n\u0011\n\n\u0013\n\u0017\n\u001d\n\n\u001e\n\u0019\n\n\u001b\n\n\u001b\n\n\u0019\n\n\u0007\n\n\u0007\n\u0013\n\u0003\n \n!\n\u0018\n\n\u0003\n \n!\n\u0002\n\n\u0007\n\u0013\n\t\n\u0004\n \n!\n\u0002\n\u0013\n \n!\n \n!\n\u0015\n\n\u001b\n\n\u0007\n\u0013\n \n\u0015\n!\n \n!\n\u0002\n\u001b\n\u0007\n\u0013\n \n!\n\u0017\n\u001d\n\fDe\ufb01ne:\n\u000f-\u0019\u001c\r\nInitialization: Set dual parameters\nFor\nIf\n\n\u0004\u0002\u0001\n\u000f-\u0019\n\b\t\b\n\b\ndo nothing, Else\n\n\u0017\b6\n\n\u0005\u001c\n\n\u000f\u001a\u0015\n\n\u0013:.\n\n\u0017DC\n\n\u001b\t\u0013\u000e\u0002\n\n\u000f\u001f\u0019\n\b\t\b\n\b\n\u001310\u0003\n\n\u000f-\u0019\n\n\u0013\u000e\u0002\n\n\u000f-\u0019\n\n\u0015\u0016\u000f\u001f\u0019\u001c\n\n\u000f-\u0019\nFigure 3: The perceptron algorithm for ranking problems.\n\n\u0002\u0005\u0004\n\nDepth\nScore\n\nImprovement\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n\u0004\u0006\u0005\b\u0007\n\t\n\u0016\u0017\t\u0018\u0007\u001a\u0019\n\n\u0004\u0006\u000b\b\u0007\f\t\n\u001b\u0006\u000e\u0010\u0007\u001d\u001c\n\n\u000f\u000e\u0010\u0007\u0011\t\n\u001b\u0012\u0005\u0010\u0007\u001d\u0005\n\n\u0004\u0012\u000b\u0010\u0007\n\t\n\u001b\u0015\t\u001e\u0007\u001a\u0019\n\n\u0004\u0012\u000b\u0010\u0007\f\t\n\t\u001f\u000b\b\u0007 \u0019\n\n\u000e\u0015\t\n\n\u0004\u0006\r\u0010\u0007\u0013\u000e\u0015\u0014\n\t!\r\u0010\u0007\u001d\u0005\n\nTable 1: Score shows how the parse score varies with the maximum depth of sub-tree\nconsidered by the perceptron. Improvement is the relative reduction in error in comparison\nto the PCFG, which scored 74%. The numbers reported are the mean and standard deviation\nover the 10 development sets.\n\nfor \u201cdual parameters\u201d\n\n\u0007 which determine the optimal weights\n\u0013$\u0013\n\n(1)\n\n\u000f-\u0019\u001c\r\n\u000f\u001a\u0015\n). It follows that the score of a parse can be\ncalculated using the dual parameters, and inner products between feature vectors, without\nhaving to explicitly deal with feature or parameter vectors in the space \n\nas shorthand for \u0006\n\n(we use \u0006\n\n\u0015\u0016\u000f\u001f\u0019\u001c\n\n\u0013:.\n\n\u0007\u0003\b\n\n\u0003 :\n\nFor example, see \ufb01gure 3 for the perceptron algorithm applied to this problem.\n\n\u001b\t\u0013\u000e\u0002\n\u0015\u0016\u000f\u001f\u0019\n\n\u000f\u001a\u0015\u0016\u000f\u001f\u0019\u001c\n\n\u0013\u000e\u0002\n\u0015\u0016\u000f\u001f\u0019\n\n\u0013\u0010\u0013\n\n\u0015\u0016\u000f\u001f\u0019\u001c\n\n4 Experimental Results\n\nTo demonstrate the utility of convolution kernels for natural language we applied our tree\nkernel to the problem of parsing the Penn treebank ATIS corpus [14]. We split the treebank\nrandomly into a training set of size 800, a development set of size 200 and a test set of size\n336. This was done 10 different ways to obtain statistically signi\ufb01cant results. A PCFG\nwas trained on the training set, and a beam search was used to give a set of parses, with\nPCFG probabilities, for each of the sentences. We applied a variant of the voted perceptron\nalgorithm [7], which is a more robust version of the original perceptron algorithm with\nperformance similar to that of SVMs. The voted perceptron can be kernelized in the same\nway that SVMs can but it can be considerably more computationally ef\ufb01cient.\n\nWe generated a ranking problem by having the PCFG generate its top 100 candidate parse\ntrees for each sentence. The voted perceptron was applied, using the tree kernel described\npreviously, to this re-ranking problem. It was trained on 20 trees selected randomly from\nthe top 100 for each sentence and had to choose the best candidate from the top 100 on the\ntest set. We tested the sensitivity to two parameter settings: \ufb01rst, the maximum depth of\nsub-tree examined, and second, the scaling factor used to down-weight deeper trees. For\neach value of the parameters we trained on the training set and tested on the development\nset. We report the results averaged over the development sets in Tables 1 and 2.\n\nWe report a parse score which combines precision and recall. De\ufb01ne\nof correctly placed constituents in the\n\nto be the number\nto be the number of constituents\n\n\u2019th test tree,\n\n\"\u0003\n\n\n\u0013\n\u0017\n\u0006\n\u0002\n\n\u0012\n\u0007\n\n\u0012\n\u0007\n\u0015\n\u0007\n\u0015\n\u0013\n\u0013\n\u0001\n\n\u0012\n\u0007\n\u0001\n\u0017\n\u0004\n\u0005\n\u0006\n\u000b\n\n\n\u001b\n\n\u0007\n\u0013\n\u0001\n\n\u0007\n\u0017\n\u0001\n\n\u0007\n\u0001\n\n \n!\n8\n \n!\n8\n\u0017\n!\n\u0002\n\n\u0012\n\u0007\n\u0004\n\u0001\n\n\u0012\n\u0007\n\u001b\n\u0007\n\u0002\n\n\u0012\n\u0007\n\u0004\n\n\u0006\n\u0003\n\u000f\n\u001d\n \n!\n8\n\u0002\n\u0019\n\u0017\n!\n\u0002\n\n\u0012\n\u0007\n\u0004\n\u0001\n\n\u0012\n\u0007\n\u0013\n.\n\u0007\n\t\n\n\u0001\n\fScale\nScore\nImp.\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n\u0004\u0012\u0004\n\u0007\u0011\t\n\t\u0012\t\u0018\u0007\u0013\u001c\n\n\u0004\u0012\r\u0010\u0007\u0011\t\n\n\u0004\u0012\u000b\u0010\u0007\n\t\n\u001b\u0012\u000e\u0010\u0007\u001a\u0019\n\n\u0004\u0012\u000b\n\u0007\n\t\n\u001b\u0015\t\u0018\u0007\u001d\u0005\n\n\u0004\u0006\u000b\u0010\u0007\u0011\t\n\t\u001e\u0007\u001d\u0019\n\n\u0004\u0012\u000b\u0010\u0007\u0011\t\n\u001b\u000f\u001b\n\u0007\u001d\u0019\n\n\u0004\u0012\u000b\u0010\u0007\f\t\n\u001b\u0015\t\u001e\u0007 \u0019\n\n\u0004\u0006\u000b\b\u0007\n\t\n\t\u001f\u000b\b\u0007\u001a\u0019\n\n\u0004\u0006\r\u0010\u0007\u0011\t\n\t!\u0004\n\nTable 2: Score shows how the parse score varies with the scaling factor for deeper sub-trees\nis varied. Imp. is the relative reduction in error in comparison to the PCFG, which scored\n74%. The numbers reported are the mean and standard deviation over the 10 development\nsets.\n\nproposed, and\nde\ufb01ned by a non-terminal label and its span. The score is then\n\nto be the number of constistuents in the true parse tree. A constituent is\n\n\t\u000b\r\n\u0002\u0007\r\t\b\n\n\t\u000b\r\n\"\u000e\r\n\t\n\n/\n\n\u0002\u0007\n\n\u0002\u0007\r\n.\r\n\u0003\u0013\u0003\u001e\n\n\t\t\n\n\u000f$\u0004\n\nC\tC\n\n.\u000e\n\u000e\u0013\n\nC\tC\u0004\u0003\u0006\u0005\nC\tC\u0004\u0003\u0006\u0005\u0014\u000f\f\u000b\n\u000f\rC\u0010\u0003\u0012\u0011\n\nThe precision and recall on the\nthe average precision recall, weighted by the size of the trees\nimprovements over the PCFG scores. If the PCFG score is\nthe relative improvement is\n\nrespectively. The score is then\n. We also give relative\n,\n, i.e., the relative reduction in error.\n\nand the perceptron score is\n\n\u2019th parse are\n\n/\n\"\u000e\n\nand\n\nWe \ufb01nally used the development set for cross-validation to choose the best parameter set-\ntings for each split. We used the best parameter settings (on the development sets) for each\nsplit to train on both the training and development sets, then tested on the test set. This gave\nwith the best choice of maximum depth and a score\na relative goodness score of\nof\non the test data.\nAll of these results were obtained by running the perceptron through the training data only\nonce. As has been noted previously by Freund and Schapire [7], the voted perceptron often\nobtains better results when run multiple times through the training data. Running through\n,\nthe data twice with a maximum depth of 3 yielded a relative goodness score of\nwhile using a larger number of iterations did not improve the results signi\ufb01cantly.\n\nwith the best choice of scaling factor. The PCFG scored\n\n\u000f\rC\u0010\u0003\u0013\u0011\n\n\u0014\u0016\u0015\u0017\u0003\n\nIn summary we observe that in these simple experiments the voted perceptron and an appro-\npriate convolution kernel can obtain promising results. However there are other methods\nwhich perform considerably better than a PCFG for NLP parsing \u2013 see [3] for an overview\n\u2013 future work will investigate whether the kernels in this paper give performance gains over\nthese methods.\n\n\u0004\u0016\u0003\u0018\u0011\n\n5 A Compressed Representation\n\nWhen used with algorithms such as the perceptron, convolution kernels may be even more\ncomputationally attractive than the traditional radial basis or polynomial kernels. The linear\ncombination of parse trees constructed by the perceptron algorithm can be viewed as a\nweighted forest. One may then search for subtrees in this weighted forest that occur more\nwhich contain a common\nthan once. Given a linear combination of two trees\nsubtree, we may construct a smaller weighted acyclic graph, in which the common subtree\noccurs only once and has weight\n. This process may be repeated until an arbitrary linear\ncombination of trees is collapsed into a weighted acyclic graph in which no subtree occurs\nmore than once. The perceptron may now be evaluated on a new tree by a straightforward\ngeneralization of the tree kernel to weighted acyclic graphs of the form produced by this\nprocedure.\n\n\u0002\u0006\u001a\"\u0011\n\n\u0002\u001b\u001a\n\nGiven the nature of our data \u2013 the parse trees have a high branching factor, the words are\nchosen from a dictionary that is relatively small in comparison to the size of the training\ndata, and are drawn from a very skewed distribution, and the ancestors of leaves are part\n\n\t\n\u0004\n\u0007\n\u0001\n\u001b\n\u0007\n\u0001\n\u0002\n\n\u0004\n\u0004\n\u0006\n\n!\n\n\u0002\n\n\u001a\n\u0004\n6\n\u0007\n\u0002\n\u0001\n\u0002\n\n\u000b\n\u0004\n\u0004\n\u0004\n\u000f\n\u0004\n\u0019\n\u0011\n\u001b\n\u001d\n\u0019\n\fof speech tags \u2013 there are a relatively small number of subtrees in the lower levels of the\nparse trees that occur frequently and make up the majority of the data. It appears that the\napproach we have described above should save a considerable amount of computation. This\nis something we intend to explore further in future work.\n\n6 Conclusions\n\nIn this paper we described how convolution kernels can be used to apply standard kernel\nbased algorithms to problems in natural language. Tree structures are ubiquitous in natu-\nral language problems and we illustrated the approach by constructing a convolution kernel\nover tree structures. The problem of parsing English sentences provides an appealing exam-\nple domain and our experiments demonstrate the effectiveness of kernel-based approaches\nto these problems. Convolution kernels combined with such techniques as kernel PCA and\nspectral clustering may provide a computationally attractive approach to many other prob-\nlems in natural language processing. Unfortunately, we are unable to expand on the many\npotential applications in this short note, however, many of these issues are spelled out in a\nlonger Technical Report [5].\n\nReferences\n\n[1] Aizerman, M., Braverman, E., and Rozonoer, L. (1964). Theoretical Foundations of the Potential\nFunction Method in Pattern Recognition Learning. Automation and Remote Control, 25:821\u2013837.\n[2] Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI Publica-\n\ntions/Cambridge University Press.\n\n[3] Charniak, E. (1997). Statistical techniques for natural language parsing. In AI Magazine, Vol. 18,\n\nNo. 4.\n\n[4] Collins, M. (2000). Discriminative Reranking for Natural Language Parsing. Proceedings of the\nSeventeenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann.\n[5] Collins, M. and Duffy, N. (2001). Parsing with a Single Neuron: Convolution Kernels for Natural\nLanguage Problems. Technical report UCSC-CRL-01-01, University of California at Santa Cruz.\n[6] Cortes, C. and Vapnik, V. (1995). Support\u2013Vector Networks. Machine Learning, 20(3):273\u2013297.\n[7] Freund, Y. and Schapire, R. (1999). Large Margin Classi\ufb01cation using the Perceptron Algorithm.\n\nIn Machine Learning, 37(3):277\u2013296.\n\n[8] Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1998). An ef\ufb01cient boosting algorithm for com-\nbining preferences. In Machine Learning: Proceedings of the Fifteenth International Conference.\nSan Francisco: Morgan Kaufmann.\n\n[9] Goodman, J. (1996). Ef\ufb01cient algorithms for parsing the DOP model. In Proceedings of the\nConference on Empirical Methods in Natural Language Processing (EMNLP 96), pages 143-152.\n[10] Haussler, D. (1999). Convolution Kernels on Discrete Structures. Technical report, University\n\nof Santa Cruz.\n\n[11] Johnson, M. The DOP estimation method is biased and inconsistent. To appear in Computa-\n\ntional Linguistics.\n\n[12] Lodhi, H., Christianini, N., Shawe-Taylor, J., and Watkins, C. (2001). Text Classi\ufb01cation using\nString Kernels. To appear in Advances in Neural Information Processing Systems 13, MIT Press.\n[13] Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators for stochastic\n\u2018uni\ufb01cation-based\u201d grammars. In Proceedings of the 37th Annual Meeting of the Association for\nComputational Linguistics. San Francisco: Morgan Kaufmann.\n\n[14] Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated corpus of\n\nenglish: The Penn treebank. Computational Linguistics, 19, 313-330.\n\n[15] Scholkopf, B., Smola, A.,and Muller, K.-R. (1999). Kernel principal component analysis. In B.\nScholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods \u2013 SV Learning,\npages 327-352. MIT Press, Cambridge, MA.\n\n[16] Watkins, C. (2000). Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett, B. Schlkopf, and\n\nD. Schuurmans, editors, Advances in Large Margin Classi\ufb01ers, pages 39-50, MIT Press.\n\n\f", "award": [], "sourceid": 2089, "authors": [{"given_name": "Michael", "family_name": "Collins", "institution": null}, {"given_name": "Nigel", "family_name": "Duffy", "institution": null}]}