Posts Tagged ‘mt eval’

The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency.  Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence.  Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English).  The judges give each sentence a score for both in the range of 1-5, similar to a movie rating.   It became apparent early on that not even humans correlate well with each other.  One judge may be sparing with the number of 5’s he gives out, while another may give them freely.  The same problem crops up in recommender systems, which I have talked about in the past.

It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged.  The better an MT metric correlates with how human judges would rate sentences, the better.  This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems.  If judges don’t correlate well with each other, how can we expect automatic methods to correlate well with them?  The standard practice now is to normalize the judges’ scores in order to help remove some of the bias in the way each judge uses the rating scale.

Vilar et al. (2007) propose a new way of handling human assessments of MT quality:  binary system comparisons.  Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better.  The definition of what constitutes “better” is left vague, but judges are instructed not to specifically look for adequacy or fluency.  By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.

The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings.  The total number of ratings for the previous method was two per sentence:  O(n), where n is the number of systems (the number of sentences is constant).  Binary system comparisons requires more ratings because the systems have to be ordered:  O(log n!).  In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.

What would be interesting to me is a movie recommendation system that asks you a similar question:  which do you like better?  Of course, this means more work for you.  The standard approaches for collaborative filtering would have to change.  For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies.  Also, people will still disagree with themselves (in theory).  You might say National Treasure was better than Star Trek VI, which was better than Indiana Jones and the Last Crusade, which was better than National Treasure.  You’d have to find some way to deal with cycles like this (ignoring it is one way).

References

Vilar, D., G. Leusch, H. Ney, and R. E. Banchs. 2007. Human Evaluation of Machine Translation Through Binary System Comparisons. In Proceedings of the Second Workshop on Statistical Machine Translation. 96-103. [pdf]

Advertisements

Stepping back in time in MT Eval from my last post, Liu and Gildea (2005) were among the first to really bring syntactic information to evaluating machine translation output. They proposed three metrics for evaluating machine hypotheses: the subtree metric (STM), the tree kernel metric (TKM), and the headword chain metric (HWCM). STM and TKM also had variants for dependency trees, which HWCM relies on. Owczarzak et al. (2007) extended HWCM from dependency parses to LFG parses. HWCM has attracted more attention since it showed better correlation at the sentence level than either STM and TKM (both versions) and outperformed BLEU on longer n-grams. It’s interesting to note, though, that the dependency-based tree kernel metric performed best of all at the corpus level. Sentence level granularity is typically more important for helping you tune your MT system.

The subtree metric is a fairly straightforward idea. You begin by parsing both the hypothesis and the reference sentences using a parser like Charniak or Collins to get a Penn TreeBank style phrase structure tree. You then compare all subtrees in the hypothesis to the reference trees, thresholding the number of matches by the best match in the reference trees. The formula is given below:

subtree metric formula

The tree kernel metric uses convolution kernels discussed by Collins and Duffy (2001). For the specifics of this method, I refer you to the respective papers (and I may post on it at a later date), but the general idea is that you can transform structured data (a tree) into a feature vector by using the kernel trick. Finding all subtrees of a tree can be exponential in the size of the sentence, which would make computation infeasible for large sentences. The kernel trick lets us operate in this exponentially-high-dimensional space with a polynomial time algorithm. Once we have constructed the feature vectors for the hypothesis and refernece trees, we can score them with their cosine similarity:

tree kernel metric

H(T1) and H(T2) are vectors with non-zero values for subtrees (dimensions) that appear in each tree, so the dot product of the two is the number of subtrees in common. The score is computed as the maximum cosine similarity between the hypothesis and the references.

Finally, the headword chain metric (HWCM) relies on dependency parses, which I touched on in my previous post.

In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains.

The HWCM score is calculated just like the STM except by comparing headword chains. The difference between the HWCM and the dependency version of the STM is that STM considers all subtrees whereas HWCM only looks at direct mother-daughter relations (no cousins or sisters).

References

Michael Collins and Nigel Duffy. 2001. Convolution kernels for natural language. In Advances in Neural Information Processing Systems.

Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.

Karolina Owczarzak, Josef van Genabith, and Andy Way. 2007. Labelled Dependencies in Machine Translation Evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 104-111, Prague, June 2007.

Since Papineni et al. (2002) introduced the BLEU metric for machine translation evaluation, string matching functions have dominated the field. These metrics work well enough, but there are cases where they break down and more and more research is revealing their biases. Also, BLEU does not correlate especially well with human judgments, so the quality of MT would benefit from a metric that better captures what makes a good translation.

A recent trend in this direction has been to introduce linguistic information in MT eval. Liu and Gildea (2005) used unlabeled dependency trees to extract headword chains from machine and reference translations to evaluate MT output. To define a few terms, reference translations are human translations that machine translations are compared to during evaluation. In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains. Liu and Gildea compared the headword chains constructed for both machine and reference translations and produced a metric based on comparing the two sets of headword chains. These chains were not annotated with any sort of grammatical relation (subject, object, etc), so they are unlabeled dependencies.

Owczarzak et al. (2007) have extended the work by Liu and Gildea (2005) using labeled dependencies. They parsed the pairs of sentences with a Lexical Functional Grammar (LFG) parser by Cahill et al (2004). In LFG, there are two components of every parse: a c-structure (i.e. a parse tree) and an f-structure, which describes the features of the lexical items. An example of an LFG parse from their paper is given below. F-structures are recursive structures with a head containing all of its constituents. From the f-structure it is easy to construct dependency trees. The bonus is that the f-structure provides the grammatical relations between items in the dependency trees. In the example below, the dependency subj(resign, john) has the grammatical relation of subject. That is, John is the subject of the sentence headed by the verb resigned.

c structure and f structure of two sentences with the same meaning from Owczarzak et al 2007

Their metric is then simply a comparison of these labeled dependency headword chains using precision and recall to compute the f-score (harmonic mean). One of the coolest things in the paper is how they handle parser noise. Statistical parsers are not perfect. They estimate probabilities for rules from labeled data. In natural language, variation is pretty much unlimited, so no matter how big the training corpus, there will always be things the parser has never seen before. Also, we are dealing with imperfect input (by the MT systems or humans) so the problem of noise could be even greater. They address this by running 100 sentences through the various MT metrics they are comparing (including their own) as both the reference machine translation. This produces the “perfect score” for each metric since they are identical. Next, adjuncts are rearranged in the sentence so that the resulting meaning has not been changed, but the structure has. Each MT metric now evaluates the new sentence compared to the original and computes a score. For the LFG parse, the f-structure should remain the same in both cases, so any divergence can be attributed to parser noise. In order to this noise, they used the n-best parses and were able to increase the f-score, bringing it closer to the baesline (ideal). So instead of just comparing the best parse for the reference and machine translation, they combine the n-best parses to compute the f-score.

The result is that they get correlations with human judgments competitive with the best system they compare themselves to (METEOR, Banerjee and Lavie, 2005), beating it for fluency and coming in a close second overall. As far as future work goes, there are quite a few extensions they mention in the paper. The LFG parser produces 32 different types of grammatical relations. In the current setup, they are all weighted the same, but they would like to try tuning the weights to see how that affects the score. Another extension they propose is using paraphrases derived from a parallel corpus. There has been other work done on paraphrasing for MT evaluation (notably Russo-Lassner et al., 2005). One thing I am curious about is whether changing the weight on the harmonic mean would have an impact on correlation. METEOR uses the F9-score while the typical thing to do is F1. It’s not clear that weighting precision and recall equally is the best thing to do.

Interesting stuff, though. I hope they continue the work and maybe we’ll see something in this year’s ACL.

Update

Karolina Owczarzak has confirmed they were using the F1 score and that different F-scores did not lead to significant improvements. I also added the image I forgot to include in the original post.

References

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the Association for Computational Linguistics Conference 2005, pages 65-73, Ann Arbor, Michigan.

Aoife Cahill, Michael Burke, Ruth O’Donovan, Josef van Genabith, and Andy Way. 2004. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26, pages 320-327, Barcelona, Spain.

Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.

Karolina Owczarzak, Josef van Genabith, and Andy Way. 2007. Labelled Dependencies in Machine Translation Evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 104-111, Prague, June 2007.

Grazia Russo-Lassner, Jimmy Lin and Philip Resnik. 2005. A Paraphrase-based Approach to Machine Translation Evaluation. Technical Report LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57, University of Maryland, College Park, Maryland.

At ACL this year, the Third Workshop on Stastical Machine Translation will be held and they are featuring a shared task on MT evaluation. The shared task will involve evaluating output from the shared translation task, which will be released on March 24th, with short papers and rankings due on April 4th. I created an MT evaluation system (pdf) last year for a class (on MT, no less), though I doubt it would do particularly well. I outperformed BLEU, but fell short of METEOR. In any case, it might be interesting to play with the data and certainly will be interesting to read the papers. My system does perform sentence-level ranking as one of its primary goals, which is also a goal stated by the shared task.

This is the question I will have to answer over the next few weeks.

One of my classes this semester is the Advanced Machine Translation Seminar (and I hope that link works outside of CMU). Each of us who has registered for the class will present a certain topic in MT and then do a literature review about it by the end of the semester. Originally I had wanted to cover how word sense disambiguation (WSD) has been applied to statistical machine translation, but that overlapped with another topic on bringing in context to MT. In simple terms, WSD is just the task of figuring out which of the many definitions a word has applies in the given circumstances. WSD systems use the context around the word to determine its sense. Thus, it is just another way of bringing context into MT. We determined there was no clear way of separating the topics so that I could still do that, so since mine was the more specific it seemed reasonable to me that I should change topics. No one else is presenting on machine translation evaluation (MT Eval), so I opted for that.

MT Eval is actually a pretty vibrant topic at the moment. For some quick background, machine translation systems produce woefully inadequate translations much of the time. If you have any doubt of this, try to translate a random web page using any of the many free online services. You will get many disfluencies, untranslated words, downright gibberish, and much worse. Not all of it will be bad, of course, but much of it will be. It is a hard problem, and many MT researchers believe it to be AI-complete (the Wikipedia article mentions MT explicitly). In order to improve machine translation, you need some way to automatically evaluate how well you are doing. Currently this is done using automatic metrics that compare machine output to (usually multiple) human translations (aka reference translations). The most commonly used metric is BLEU (pdf), but a rising star is METEOR, developed in part by one of my professors. I won’t go into these metrics any further here at the moment, and I recommend interested parties check out the papers. What these metrics aim to do is gauge how similar the machine output is to the reference translation(s).

The problem with MT Eval is that in order to be able to automatically tell whether something is a good translation, we would have to know exactly what goes into making a good translation (and by good I mean human-level). If we could do that, we would have solved MT!

More to come.