{"title": "Latent Attention For If-Then Program Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 4574, "page_last": 4582, "abstract": "Automatic translation from natural language descriptions into programs is a long-standing challenging problem. In this work, we consider a simple yet important sub-problem: translation from textual  descriptions to If-Then programs. We devise a novel neural network architecture for this task which we train end-to-end. Specifically, we introduce Latent Attention, which computes multiplicative weights for the words in the description in a two-stage process with the goal of better leveraging the natural language structures that indicate the relevant parts for predicting program elements. Our architecture reduces the error rate by 28.57% compared to prior art. We also propose a one-shot learning scenario of If-Then program synthesis and simulate it with our existing dataset. We demonstrate a variation on the training procedure for this scenario that outperforms the original procedure, significantly closing the gap to the model trained with all data.", "full_text": "Latent Attention For If-Then Program Synthesis\n\nXinyun Chen\u2217\n\nShanghai Jiao Tong University\n\nChang Liu Richard Shin Dawn Song\n\nUC Berkeley\n\nMingcheng Chen\u2020\n\nUIUC\n\nAbstract\n\nAutomatic translation from natural language descriptions into programs is a long-\nstanding challenging problem.\nIn this work, we consider a simple yet impor-\ntant sub-problem: translation from textual descriptions to If-Then programs. We\ndevise a novel neural network architecture for this task which we train end-to-\nend. Speci\ufb01cally, we introduce Latent Attention, which computes multiplicative\nweights for the words in the description in a two-stage process with the goal of\nbetter leveraging the natural language structures that indicate the relevant parts for\npredicting program elements. Our architecture reduces the error rate by 28.57%\ncompared to prior art [3]. We also propose a one-shot learning scenario of If-Then\nprogram synthesis and simulate it with our existing dataset. We demonstrate a\nvariation on the training procedure for this scenario that outperforms the original\nprocedure, signi\ufb01cantly closing the gap to the model trained with all data.\n\n1\n\nIntroduction\n\nA touchstone problem for computational linguistics is to translate natural language descriptions into\nexecutable programs. Over the past decade, there has been an increasing number of attempts to\naddress this problem from both the natural language processing community and the programming\nlanguage community. In this paper, we focus on a simple but important subset of programs contain-\ning only one If-Then statement.\nAn If-Then program, which is also called a recipe, speci\ufb01es a trigger and an action function, repre-\nsenting a program which will take the action when the trigger condition is met. On websites, such\nas IFTTT.com, a user often provides a natural language description of the recipe\u2019s functionality as\nwell. Recent work [16, 3, 7] studied the problem of automatically synthesizing If-Then programs\nfrom their descriptions. In particular, LSTM-based sequence-to-sequence approaches [7] and an\napproach of ensembling a neural network and logistic regression [3] were proposed to deal with this\nproblem. In [3], however, the authors claim that the diversity of vocabulary and sentence structures\nmakes it dif\ufb01cult for an RNN to learn useful representations, and their ensemble approach indeed\nshows better performance than the LSTM-based approach [7] on the function prediction task (see\nSection 2).\nIn this paper, we introduce a new attention architecture, called Latent Attention, to overcome this\ndif\ufb01culty. With Latent Attention, a weight is learned on each token to determine its importance for\nprediction of the trigger or the action. Unlike standard attention methods, Latent Attention computes\nthe token weights in a two-step process, which aims to better capture the sentence structure. We show\nthat by employing Latent Attention over outputs of a bi-directional LSTM, our new Latent Attention\nmodel can improve over the best prior result [3] by 5 percentage points from 82.5% to 87.5% when\npredicting the trigger and action functions together, reducing the error rate of [3] by 28.57%.\nBesides the If-Then program synthesis task proposed by [16], we are also interested in a new sce-\nnario. When a new trigger or action is released, the training data will contain few corresponding\n\n\u2217Part of the work was done while visiting UC Berkeley.\n\u2020Work was done while visiting UC Berkeley. Mingcheng Chen is currently working at Google [X].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fexamples. We refer to this case as a one-shot learning problem. We show that our Latent Atten-\ntion model on top of dictionary embedding combining with a new training algorithm can achieve a\nreasonably good performance for the one-shot learning task.\n\n2\n\nIf-Then Program Synthesis\n\nIf-Then Recipes.\nIn this work, we consider an important class of simple programs called If-\nThen\u201crecipes\u201d (or recipes for short), which are very small programs for event-driven automation\nof tasks. Speci\ufb01cally, a recipe consists of a trigger and an action, indicating that the action will be\nexecuted when the trigger is ful\ufb01lled.\nThe simplicity of If-Then recipes makes it a great tool for users who may not know how to code.\nEven non-technical users can specify their goals using recipes, instead of writing code in a more\nfull-\ufb02edged programming language. A number of websites have embraced the If-Then program-\nming paradigm and have been hugely successful with tens of thousands of personal recipes created,\nincluding IFTTT.com and Zapier.com. In this paper, we focus on data crawled from IFTTT.com.\nIFTTT.com allows users to share their recipes publicly, along with short natural language descrip-\ntions to explain the recipes\u2019 functionality. A recipe on IFTTT.com consists of a trigger channel, a\ntrigger function, an action channel, an action function, and arguments for the functions. There are a\nwide range of channels, which can represent entities such as devices, web applications, and IFTTT-\nprovided services. Each channel has a set of functions representing events (i.e., trigger functions) or\naction executions (i.e., action functions).\nFor example, an IFTTT recipe with the following description\n\nAutosave your Instagram photos to Dropbox\n\nhas the trigger channel Instagram,\ntrigger function Any new photo by you, action channel\nDropbox, and action function Add file from URL. Some functions may take arguments. For ex-\nample, the Add file from URL function takes three arguments: the source URL, the name for the\nsaved \ufb01le, and the path to the destination folder.\n\nProblem Setup. Our task is similar to that in [16]. In particular, for each description, we focus on\npredicting the channel and function for trigger and action respectively. Synthesizing a valid recipe\nalso requires generating the arguments. As argued by [3], however, the arguments are not crucial for\nrepresenting an If-Then program. Therefore, we defer our treatment for arguments generation to the\nsupplementary material, where we show that a simple frequency-based method can outperform all\nexisting approaches. In this way, our task turns into two classi\ufb01cation problems for predicting the\ntrigger and action functions (or channels).\nBesides the problem setup in [16], we also introduce a new variation of the problem, a one-shot\nlearning scenario: when some new channels or functions are initially available, there are very few\nrecipes using these channels and functions in the training set. We explore techniques to still achieve\na reasonable prediction accuracy on labels with very few training examples.\n\n3 Related Work\n\nRecently there has been increasing interests in executable code generation. Existing works have\nstudied generating domain-speci\ufb01c code, such as regular expressions [12], code for parsing input\ndocuments [14], database queries [22, 4], commands to robots [10], operating systems [5], smart-\nphone automation [13], and spreadsheets [8]. A recent effort considers translating a mixed natural\nlanguage and structured speci\ufb01cation into programming code [15]. Most of these approaches rely on\nsemantic parsing [19, 9, 1, 16]. In particular, [16] introduces the problem of translating IFTTT de-\nscriptions into executable code, and provides a semantic parsing-based approach. Two recent work\nstudied approaches using sequence-to-sequence model [7] and an ensemble of a neural network and\na logistic regression model [3] to deal with this problem, and showed better performance than [16].\nWe show that our Latent Attention method outperforms all prior approaches. Recurrent neural net-\nworks [21, 6] along with attention [2] have demonstrated impressive results on tasks such as machine\ntranslation [2], generating image captions [20], syntactic parsing [18] and question answering [17].\n\n2\n\n\fFigure 1: Network Architecture\n\n4 Latent Attention Model\n\n4.1 Motivation\n\nTo translate a natural language description into a program, we would like to locate the words in\nthe description that are the most relevant for predicting desired labels (trigger/action channels/func-\ntions). For example, in the following description\n\nAutosave Instagram photos to your Dropbox folder\n\nthe blue text \u201cInstagram photos\u201d is the most relevent for predicting the trigger. To capture this infor-\nmation, we can adapt the attention mechanism [2, 17] \u2014\ufb01rst compute a weight of the importance of\neach token in the sentence, and then output a weighted sum of the embeddings of these tokens.\nHowever, our intuition suggests that the weight for each token depends not only on the token itself,\nbut also the overall sentence structure. For example, in\n\nPost photos in your Dropbox folder to Instagram\n\n\u201cDropbox\u201d determines the trigger, even though in the previous example, which contains almost the\nsame set of tokens, \u201cInstagram\u201d should play this role. In this example, the prepositions such as\n\u201cto\u201d hint that the trigger channel is speci\ufb01ed in the middle of the description rather than at the end.\nTaking this into account allows us to select \u201cDropbox\u201d over \u201cInstagram\u201d.\nLatent Attention is designed to exploit such clues. We use the usual attention mechanism for com-\nputing a latent weight for each token to determine which tokens in the sequence are more relevant to\nthe trigger or the action. These latent weights determine the \ufb01nal attention weights, which we call\nactive weights. As an example, given the presence of the token \u201cto\u201d, we might look at the tokens\nbefore \u201cto\u201d to determine the trigger.\n\n4.2 The network\n\nThe Latent Attention architecture is presented in Figure 1. We follow the convention of using lower-\ncase letters to indicate column vectors, and capital letters for matrices. Our model takes as input\na sequence of symbols x1, ..., xJ, with each coming from a dictionary of N words. We denote\nX = [x1, ..., xJ ]. Here, J is the maximal length of a description. We illustrate each layer of the\nnetwork below.\n\nLatent attention layer. We assume each symbol xi is encoded as a one-hot vector of N di-\nmensions. We can embed the input sequence X into a d-dimensional embedding sequence using\nE = Embed\u03b81(X), where \u03b81 is a set of parameters. We will discuss different embedding methods in\nSection 4.3. Here E is of size d \u00d7 J.\n\n3\n\nDescription{\ud835\udc65\ud835\udc56}SoftmaxLatent AttentionLatent Input\ud835\udc37Column-wise Softmax\ud835\udc59Active AttentionActive InputOutput\ud835\udc5c\ud835\udc43SoftmaxPrediction\ud835\udc49\ud835\udc62\ud835\udc64Latent Attention\ud835\udc34\ud835\udc38Weighted SumWeighted SumweightsweightsEmbedding\ud835\udf031Embedding\ud835\udf032Embedding\ud835\udf033\fThe latent attention layer\u2019s output is computed as a standard softmax on top of E. Speci\ufb01cally,\nassume that l is the J-dimensional output vector, u is a d-dimensional trainable vector, we have\n\nl = softmax(uT Embed\u03b81 (X))\n\nActive attention layer. The active attention layer computes each token\u2019s weight based on its im-\nportance for the \ufb01nal prediction. We call these weights active weights. We \ufb01rst embed X into D\nusing another set of parameters \u03b82, i.e., D = Embed\u03b82 (X) is of size d \u00d7 J. Next, for each token Di,\nwe compute its active attention input Ai through a softmax:\nAi = softmax(V Di)\n\nHere, Ai and Di denote the the i-th column vector of A and D respectively, and V is a trainable\nparameter matrix of size J \u00d7 d. Notice that V Di = (V D)i, we can compute A by performing\ncolumn-wise softmax over V D. Here, A is of size J \u00d7 J.\nThe active weights are computed as the sum of Ai, weighted by the output of latent attention weight:\n\nJ(cid:88)\n\nw =\n\nliAi = Al\n\nOutput representation. We use a third set of parameters \u03b83 to embed X into a d \u00d7 J embedding\nmatrix, and the \ufb01nal output o, a d-dimensional vector, is the sum of the embedding weighted by the\nactive weights:\n\ni=1\n\no = Embed\u03b83(X)w\n\nPrediction. We use a softmax to make the \ufb01nal prediction: \u02c6f = softmax(P o), where P is a\nd \u00d7 M parameter matrix, and M is the number of classes.\n\n4.3 Details\n\nEmbeddings. We consider two embedding methods for representing words in the vector space.\nThe \ufb01rst is a straightforward word embedding, i.e., Embed\u03b8(X) = \u03b8X, where \u03b8 is a d \u00d7 N matrix\nand the rows of X are one-hot vectors over the vocabulary of size N. We refer to this as \u201cdictionary\nembedding\u201d later in the paper. \u03b8 is not pretrained with a different dataset or objective, but initialized\nrandomly and learned at the same time as all other parameters. We observe that when using Latent\nAttention, this simple method is effective enough to outperform some recent results [16, 7].\nThe other approach is to take the word embeddings, run them through a bi-directional LSTM (BDL-\nSTM) [21], and then use the concatenation of two LSTMs\u2019 outputs at each time step as the em-\nbedding. This can take into account the context around a token, and thus the embeddings should\ncontain more information from the sequence than from a single token. We refer to such an approach\nas \u201cBDLSTM embedding\u201d. The details are deferred to the supplementary material. In our experi-\nments, we observe that with the help of this embedding method, Latent Attention can outperform\nthe prior state-of-the-art.\nIn Latent Attention, we have three sets of embedding parameters, i.e., \u03b81, \u03b82, \u03b83. In practice, we \ufb01nd\nthat we can equalize the three without loss of performance. Later, we will show that keeping them\nseparate is helpful for our one-shot learning setting.\n\nNormalizing active weights. We \ufb01nd that normalizing the active weights a before computing the\noutput is helpful to improve the performance. Speci\ufb01cally, we compute the output as\n\no = Embed\u03b8(X)normalized(w) = Embed\u03b8(X)\n\nw\n||w||\n\nwhere ||w|| is the L2-norm of w. In our experiments, we observe that this normalization can improve\nthe performance by 1 to 2 points.\n\nPadding and clipping. Latent Attention requires a \ufb01xed-length input sequence. To handle inputs\nof variable lengths, we perform padding and clipping. If an input\u2019s length is smaller than J, then we\npad it with null tokens at the end of the sequence. If an input\u2019s length is greater than J (which is 25\nin our experiements), we keep the \ufb01rst 12 and the last 13 tokens, and get rid of all the rest.\n\n4\n\n\fVocabulary. We tokenize each sentence by splitting on whitespace and punctuation (e.g., ., !?\u201d(cid:48) :\n; )( ), and convert all characters into lowercase. We keep all punctuation symbols as tokens too. We\nmap each of the top 4,000 most frequent tokens into themselves, and all the rest into a special token\n(cid:104)UNK(cid:105). Therefore our vocabulary size is 4,001. Our implementation has no special handling for\ntypos.\n\n5\n\nIf-Then Program Synthesis Task Evaluation\n\nIn this section, we evaluate our approaches with several baselines and previous work [16, 3, 7].\nWe use the same crawler from Quirk et al. [16] to crawl recipes from IFTTT.com. Unfortunately,\nmany recipes are no longer available. We crawled all remaining recipes, ultimately obtaining 68,083\nrecipes for the training set. [16] also provides a list of 5,171 recipes for validation, and 4,294 recipes\nfor test. All test recipes come with labels from Amazon Mechanical Turk workers. We found that\nonly 4,220 validation recipes and 3,868 test recipes remain available. [16] de\ufb01nes a subset of test\nrecipes, where each recipe has at least 3 workers agreeing on its labels from IFTTT.com, as the gold\ntestset. We \ufb01nd that 584 out of the 758 gold test recipes used in [16] remain available. We refer to\nthese recipes as the gold test set. We present the data statistics in the supplementary material.\n\nEvaluated methods. We evaluate two embedding methods as well as the effectiveness of different\nattention mechanisms. In particular, we compare no attention, standard attention, and Latent Atten-\ntion. Therefore, we evaluate six architectures in total. When using dictionary embedding with no\nattention, for each sentence, we sum the embedding of each word, then pass it through a softmax\nlayer for prediction. For convenience, we refer to such a process as standard softmax. For BDL-\nSTM with no attention, we concatenate \ufb01nal states of forward and backward LSTMs, then pass the\nconcatenation through a softmax layer for prediction. The two embedding methods with standard\nattention mechanism [17] are described in the supplementary material. The Latent Attention models\nhave been presented in Section 4.\n\nTraining details. For architectures with no attention, they were trained using a learning rate of\n0.01 initially, which is multiplied by 0.9 every 1,000 time steps. Gradients with L2 norm greater\nthan 5 were scaled down to have norm 5. For architectures with either standard attention mechanism\nor Latent Attention, they were trained using a learning rate of 0.001 without decay, and gradients\nwith L2 norm greater than 40 were scaled down to have norm 40. All models were trained using\nAdam [11]. All weights were initialized uniformly randomly in [\u22120.1, 0.1]. Mini-batches were\nrandomly shuf\ufb02ed during training. The mini-batch size is 32 and the embedding vector size d is 50.\n\nResults. Figure 2 and Figure 3 present the results of prediction accuracy on channel and function\nrespectively. Three previous works\u2019 results are presented as well. In particular, [16] is the \ufb01rst work\nintroducing the If-Then program synthesis task. [7] investigates the approaches using sequence-to-\nsequence models, while [3] proposes an approach to ensemble a feed-forward neural network and a\nlogistic regression model. The numerical values for all data points can be found in the supplementary\nmaterial.\nFor our six architectures, we use 10 different random initializations to train 10 different models. To\nensemble k models, we choose the best k models on the validation set among the 10 models, and\naverage their softmax outputs as the ensembled output. For the three existing approaches [16, 7, 3],\nwe choose the best results from these papers.\nWe train the model to optimize for function prediction accuracy. The channel accuracy in Figure 2\nis computed in the following way: to predict the channel, we \ufb01rst predict the function (from a list of\nall functions in all channels), and the channel that the function belongs to is returned as the predicted\nchannel. We observe that\n\nones using either embedding method.\n\n\u2022 Latent Attention steadily improves over standard attention architectures and no attention\n\u2022 In our six evaluated architectures, ensembling improves upon using only one model signif-\n\u2022 When ensembling more than one model, BDLSTM embeddings perform better than dic-\ntionary embeddings. We attribute this to that for each token, BDLSTM can encode the\n\nicantly.\n\n5\n\n\fFigure 2: Accuracy for Channel\n\nFigure 3: Accuracy for Channel+Function\n\ninformation of its surrounding tokens, e.g., phrases, into its embedding, which is thus more\neffective.\n\n\u2022 For the channel prediction task in Figure 2, all architectures except dictionary embedding\nwith no attention (i.e., Dict) can outperform [16]. Ensembling only 2 BDLSTM models\nwith either standard attention or Latent Attention is enough to achieve better performance\nthan prior art [7]. By ensembling 10 BDLSTM+LA models, we can improve the latest\nresults [7] and [3] by 1.9 points and 2.5 point respectively.\n\n\u2022 For the function prediction task in Figure 3, all our six models (including Dict) outper-\nform [16]. Further, ensembling 9 BDLSTM+LA can improve the previous best results [3]\nby 5 points. In other words, our approach reduces the error rate of [3] by 28.57%.\n\n6 One-Shot Learning\n\nWe consider the scenario when websites such as IFTTT.com release new channels and functions.\nIn such a scenario, for a period of time, there will be very few recipes using the newly available\nchannels and fucntions; however, we would still like to enable synthesizing If-Then programs using\nthese new functions. The rarity of such recipes in the training set creates a challenge similar to\nthe one-shot learning setting. In this scenario, we want to leverage the large amount of recipes\nfor existing functions, and the goal is to achieve a good prediction accuracy for the new functions\nwithout signi\ufb01cantly compromising the overall accuracy.\n\n6.1 Datasets to simulate one-shot learning\n\nTo simulate this scenario with our existing dataset, we build two one-shot variants of it as follows.\nWe \ufb01rst split the set of trigger functions into two sets, based on their frequency. The top100 set\ncontains the top 100 most frequently used trigger functions, while the non-top100 set contains the\nrest.\nGiven a set of trigger functions S, we can build a skewed training set to include all recipes using\nfunctions in S, and 10 randomly chosen recipes for each function not in S. We denote this skewed\ntraining set created based on S as (S, S), and refer to functions in S as majority functions and\nfunctions in S as minority functions. In our experiments, we construct two new training sets by\nchoosing S to be the top100 set and non-top100 set respectively. We refer to these two training sets\nas SkewTop100 and SkewNonTop100.\nThe motivation for creating these datasets is to mimic two different scenarios. On one hand, Skew-\nTop100 simulates the case that at the startup phase of a service, popular recipes are \ufb01rst published,\nwhile less frequently used recipes are introduced later. On the other hand, SkewNonTop100 captures\nthe opposite situation. The statistics for these two training sets are presented in the supplementary\nmaterial. While SkewTop100 is more common in real life, the SkewNonTop100 training set is only\n15.73% of the entire training set, and thus is more challenging.\n\n6\n\n\f(a) Trigger Function Accuracy (SkewTop100)\n\n(b) Trigger Function Accuracy (SkewNonTop100)\n\nFigure 4: One-shot learning experiments. For each column XY-Z, X from {B, D} represents whether\nthe embedding is BDLSTM or Dictionary; Y is either empty, or is from {A, L}, meaning that either\nno attention is used, or standard attention or Latent Attention is used; and Z is from {S, 2N, 2},\ndenoting standard training, na\u00a8\u0131ve two-step training or two-step training.\n\n6.2 Training\n\nWe evaluate three training methods as follows, where the last one is speci\ufb01cally designed for at-\ntention mechanisms. In all methods, the training data is either SkewTop100 or SkewNonTop100.\nStandard training. We do not modify the training process.\nNa\u00a8\u0131ve two-step training. We do standard training \ufb01rst. Since the data is heavily skewed, the model\nmay behave poorly on the minority functions. From a training set (S, S), we create a rebalanced\ndataset, by randomly choosing 10 recipes for each function in S and all recipes using functions in\nS. Therefore, the numbers of recipes using each function are similar in this rebalanced dataset. We\nrecommence the training using this rebalanced training dataset in the second step.\nTwo-step training. We still do standard training \ufb01rst, and then create the rebalanced dataset in\nthe similar way as that in na\u00a8\u0131ve two-step training. However, in the second step, instead of training\nthe entire network, we keep the attention parameters \ufb01xed, and train only the parameters in the\nremaining part of the model. Take the Latent Attention model depicted in Figure 1 as an example. In\nthe second step, we keep parameters \u03b81, \u03b82, u, and V \ufb01xed, and only update \u03b83 and P while training\non the rebalanced dataset. We based this procedure on the intuition that since the rebalanced dataset\nis very small, fewer trainable parameters enable easier training.\n\n6.3 Results\n\nWe compare the three training strategies using our proposed models. We omit the no attention mod-\nels, which do not perform better than attention models and cannot be trained using two-step training.\nWe only train one model per strategy, so the results are without ensembling. The results are pre-\nsented in Figure 4. The concrete values can be found in the supplementary material. For reference,\nthe best single BDLSTM+LA model can achieve 89.38% trigger function accuracy: 91.11% on\ntop100 functions, and 85.12% on non-top100 functions. We observe that\n\ntions are generally better than using standard training and na\u00a8\u0131ve two-step training.\n\n\u2022 Using two-step training, both the overall accuracy and the accuracy on the minority func-\n\u2022 Latent Attention outperforms standard attention when using the same training method.\n\u2022 The best Latent Attention model (Dict+LA) with two-step training can achieve 82.71% and\n64.84% accuracy for trigger function on the gold test set, when trained on the SkewTop100\nand SkewNonTop100 datasets respectively. For comparison, when using the entire training\ndataset, trigger function accuracy of Dict+LA is 89.38%. Note that the SkewNonTop100\ndataset accounts for only 15.73% of the entire training dataset.\n\u2022 For SkewTop100 training set, Dict+LA model can achieve 78.57% accuracy on minority\nfunctions in gold test set. This number for using the full training dataset is 85.12%, al-\nthough the non-top100 recipes in SkewTop100 make up only 30.54% of those in the full\ntraining set.\n\n7\n\n55606570758085All\t\r \u00a0Trigger\t\r \u00a0FunctionNonTop100\t\r \u00a0Trigger\t\r \u00a0Function303540455055606570All\t\r \u00a0Trigger\t\r \u00a0FunctionTop100\t\r \u00a0Trigger\t\r \u00a0Function\fFigure 5: Examples of attention weights output by Dict+LA. latent, trigger, and action indi-\ncate the latent weights and active weights for the trigger and the action respectively. Low values less\nthan 0.1 are omitted.\n\n7 Empirical Analysis of Latent Attention\n\nWe show some correctly classi\ufb01ed and misclassi\ufb01ed examples in Figure 5 along with their attention\nweights. The weights are computed from a Dict+LA model. We choose Dict+LA instead of BDL-\nSTM+LA, because the BDLSTM embedding of each token does not correspond to the token itself\nonly \u2014 it will contain the information passing from previous and subsequent tokens in the sequence.\nTherefore, the attention of BDLSTM+LA is not as easy to interpret as Dict+LA.\nThe latent weights are those used to predict the action functions. In correctly classi\ufb01ed examples,\nwe observe that the latent weights are assigned to the prepositions that determine which parts of the\nsentence are associated with the trigger or the action. An interesting example is (b), where a high\nlatent weight is assigned to \u201c,\u201d. This indicates that LA considers \u201c,\u201d as informative as other English\nwords such as \u201cto\u201d. We observe the similar phenomenon in Example (c), where token \u201c>\u201d has the\nhighest latent weight.\nIn several misclassi\ufb01ed examples, we observe that some attention weights may not be assigned\ncorrectly. In Example (e), although there is nowhere explicitly showing the trigger should be us-\ning a Facebook channel, the phrase \u201cphoto of me\u201d hints that \u201cme\u201d should be tagged in the photo.\nTherefore, a human can infer that this should use a function from the Facebook channel, called\n\u201cYou are tagged in a photo\u201d. The Dict+LA model does not learn this association from the train-\ning data. In this example, we expect that the model should assign high weights onto the phrase\n\u201cof me\u201d, but this is not the case, i.e., the weights assigned to \u201cof\u201d and \u201cme\u201d are 0.01 and 0.007\nrespectively. This shows that the Dict+LA model does not correlate these two words with the\nYou are tagged in a photo function. BDLSTM+LA, on the other hand, can jointly consider the\ntwo tokens, and make the correct prediction.\nExample (h) is another example where outside knowledge might help: Dict+LA predicts the trigger\nfunction to be Create a post since it does not learn that Instagram only consists of photos (and\nlow weight was placed on \u201cInstagram\u201d when predicting the trigger anyway). Again, BDLSTM+LA\ncan predict this case correctly.\n\nAcknowledgements. We thank the anonymous reviewers for their valuable comments. This ma-\nterial is based upon work partially supported by the National Science Foundation under Grant No.\nTWC-1409915, and a DARPA grant FA8750-15-2-0104. Any opinions, \ufb01ndings, and conclusions or\nrecommendations expressed in this material are those of the author(s) and do not necessarily re\ufb02ect\nthe views of the National Science Foundation and DARPA.\n\n8\n\nPost your Instagram photos to Tumblr (b)with the ,triggered atsunrise.latent 0.750.14trigger 0.80.47action0.76trigger action>flickr(d)text tagged #todo,from then quick add event to google calendar.latent 0.810.160.42trigger 0.20.10.150.290.230.12action0.130.70.10.180.23trigger actionany photos ofme tolatent 0.83trigger 0.190.24action0.18latent trigger actionweights(a)(c) (e)(f)weightslabelweightslabelweights0.39weather 0.330.110.150.120.16Instagram.Any_new_photo_by_youTumblr.Create_a_photo_postWeather.SunriseGoogle_Drive.Add_row_to_spreadsheetWordPress.Create_a_postPrediction0.21Correct PredictionsIf send IFTTT aMisclassified Examples0.17Flickr.Upload_public_photo_from_URLSMS.Send_IFTTT_an_SMS_taggedGoogle_Calendar.Quick_add_eventInstagram.Any_new_photo_by_you0.80.92Instagram Spreadsheet daily WordPress.Create_a_photo_postFacebook.You_are_tagged_in_a_photoPredictionAndroid_Photos.Any_new_photo0.440.340.85toTruth (Action)Wordpress0.570.140.54Download dropboxTruth (Trigger)Instagram0.67cell phone\fReferences\n[1] Y. Artzi. Broad-coverage ccg semantic parsing with amr. In EMNLP, 2015.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[3] I. Beltagy and C. Quirk. Improved semantic parsers for if-then statements. In ACL, 2016.\n\n[4] J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic parsing on freebase from question-\n\nanswer pairs. In EMNLP, 2013.\n\n[5] S. R. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay. Reinforcement learning for\n\nmapping instructions to actions. In ACL, 2009.\n\n[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural\n\nnetworks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[7] L. Dong and M. Lapata. Language to logical form with neural attention. In ACL, 2016.\n\n[8] S. Gulwani and M. Marron. Nlyze: Interactive programming by natural language for spread-\n\nsheet data analysis and manipulation. In SIGMOD, 2014.\n\n[9] B. K. Jones, M. Johnson, and S. Goldwater. Semantic parsing with bayesian tree transducers.\n\nIn ACL, 2012.\n\n[10] R. J. Kate, Y. W. Wong, and R. J. Mooney. Learning to transform natural to formal languages.\n\nIn AAAI, 2005.\n\n[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[12] N. Kushman and R. Barzilay. Using semantic uni\ufb01cation to generate regular expressions from\n\nnatural language. In NAACL, 2013.\n\n[13] V. Le, S. Gulwani, and Z. Su. Smartsynth: Synthesizing smartphone automation scripts from\n\nnatural language. In MobiSys, 2013.\n\n[14] T. Lei, F. Long, R. Barzilay, and M. C. Rinard. From natural language speci\ufb01cations to program\n\ninput parsers. In ACL, 2013.\n\n[15] W. Ling, E. Grefenstette, K. M. Hermann, T. Kocisk\u00b4y, A. Senior, F. Wang, and P. Blunsom.\n\nLatent predictor networks for code generation. CoRR, 2016.\n\n[16] C. Quirk, R. Mooney, and M. Galley. Language to code: Learning semantic parsers for if-this-\n\nthen-that recipes. In ACL, 2015.\n\n[17] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In NIPS, 2015.\n\n[18] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign\n\nlanguage. In NIPS, 2015.\n\n[19] Y. W. Wong and R. J. Mooney. Learning for semantic parsing with statistical machine transla-\n\ntion. In NAACL, 2006.\n\n[20] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show,\narXiv preprint\n\nattend and tell: Neural image caption generation with visual attention.\narXiv:1502.03044, 2015.\n\n[21] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv\n\npreprint arXiv:1409.2329, 2014.\n\n[22] J. M. Zelle. Learning to parse database queries using inductive logic programming. In AAAI,\n\n1996.\n\n9\n\n\f", "award": [], "sourceid": 2283, "authors": [{"given_name": "Chang", "family_name": "Liu", "institution": "University of Maryland"}, {"given_name": "Xinyun", "family_name": "Chen", "institution": "Shanghai Jiaotong University"}, {"given_name": "Eui Chul", "family_name": "Shin", "institution": "UC Berkeley"}, {"given_name": "Mingcheng", "family_name": "Chen", "institution": "University of Illinois"}, {"given_name": "Dawn", "family_name": "Song", "institution": "UC Berkeley"}]}