error_correction_eval/main.tex at master · omriabnd/error_correction_eval · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[letterpaper, 11pt]{article}
%\usepackage[round]{natbib}

\usepackage{mathtools}
\usepackage{setspace}
\usepackage{dsfont}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{subcaption}
\usepackage{paralist}
%\usepackage{subfig}
\usepackage{times}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage[T1]{fontenc}
\usepackage{tikz}
\usepackage{url}
\usepackage{pgfplotstable}
\usepackage{titlesec}
\usepackage{color}
\usepackage{lipsum,adjustbox}
\usepackage[font={small}]{caption}
\usetikzlibrary{positioning}
\usepackage{bbm}

\makeatletter
\newcommand{\@BIBLABEL}{\@emptybiblabel}
\newcommand{\@emptybiblabel}[1]{}
%\makeatother
\usepackage[hidelinks]{hyperref}


\usepackage{acl2012}
\graphicspath{{./plots/}}
\newcommand{\com}[1]{}
%\newcommand{\oa\part{title}}[1]{}
%\newcommand{\lc}[1]{}
\newcommand{\oa}[1]{\footnote{\color{red}OA: #1}}
\newcommand{\oamod}[1]{{\color{red}#1}}
\newcommand{\lc}[1]{\footnote{\color{blue}LC: #1}}
\newcommand{\lcmod}[1]{{\color{blue}#1}}

\newenvironment{myequation}{
  \vspace{-1em}
 \begin{equation}
}{
 \end{equation}
 \vspace{-1.2em}
}
\newenvironment{myequation*}{
	\vspace{-1em}
	\begin{equation*}
}{
\end{equation*}
\vspace{-1.2em}
}


\begin{document}

\title{Conservatism and Over-conservatism in Grammatical Error Correction}
%\author{
%  Leshem Choshen\textsuperscript{1} and Omri Abend\textsuperscript{2} \\
%  \textsuperscript{1}School of Computer Science and Engineering,
%  \textsuperscript{2} Department of Cognitive Sciences \\
%  The Hebrew University of Jerusalem \\
%  \texttt{leshem.choshen@mail.huji.ac.il, oabend@cs.huji.ac.il}\\
%}
\maketitle

\begin{abstract}
  %Evaluation in Grammatical Error Correction (GEC) is generally carried out
  %by comparison to references. Previous work discussed the necessary low
  %  coverage of such protocols given the multitude of different ways to correct a sentence,
  %and proposed
  %In this paper we discuss the impact of using
  %discusses the implications of such reference-based evaluation on
  Grammatical Error Correction systems (henceforth, {\it correctors}) aim to
  correct ungrammatical text, while changing it as little as possible.
  However, whereas such conservatism is a virtue for correctors,
  we find that state-of-the-art systems make substantially fewer changes to the source sentences
  than needed.
  Analyzing the distribution of possible corrections for a given sentence,
  we show that this over-conservatism likely stems from
  the inability of a handful of reference corrections to account for the full variation of valid
  corrections for a given sentence. This results in undue penalization of valid corrections,
  thus disincentivizing correctors to make changes.
  We also show that simply increasing the number of references is unlikely to resolve this problem,
  and conclude by presenting an alternative reference-less approach based on semantic similarity.
  %one by , and the other by using semantic evaluation.
  %Does grammatical error correction systems learn not to learn?
  %We show that state-of-the-art systems are over conservative and are reluctant to correct. We analyze the distributions of
  % corrections showing that a single ungrammatical sentence tends to have hundreds of valid corrections, a problem for
  %current evaluation methods which are based on a reference or two. We suspect it causes correctors to avoid correcting and proceed to analyze the effect of
  %increasing amount of references in the gold standard on different evaluation measures. Discovering that more references are helpful but only to a
  %certain point, we also find that semantic structures are promising as a measure that is not reference based.
\end{abstract}

\section{Introduction}

% Error correction
% evaluation in error correction and its centrality
% faithfulness to the source meaning is important, and this has been noted but prev work, and evaluation is geared towards it
% gap in evaluation: however, steps taken to ensure conservativeness in fact push towards formal conservativism by their definition (theoretical claim about the measure)
% this may result in systems that make few changes. indeed we find that this is the case (empirical claim about systems)
%
% we pursue two approaches to overcome this bias.
%
% 1. increasing the number of references. this has been proposed before and pursued with m=2, but no assessment of its sufficiency or its added value over m=1 has been made. In order to address this gap we first charachterize the distribution of possible corrections for a sentence. We leverage this characterization to characterize the distribution of the scores as a function of $m$, and consequently assess the biases introduced by taking $m=1,2$ as with previous approaches.
% We find that taking these values of $m$ drammatically under-estimate the system scores.
% We back our analysis of these biases with an analysis of the variance of these estimators.
% We analyze the two commonly used scores, the M2 score often used for evalauted, and the accuracy score commonly used in training.
%
% 2. we note that in fact the important factor is semantic conservativism and explore means to directly assess how semantically conservative systems here through the use of semantic annotation.
% We use the UCCA scheme as a test case, motivated by HUME.
% First question: is it well-defined on learner language. it is.
% Second question: are corrections in fact semantically conservate? to show that, we need to verify that the corrections make few (if any) semantic changes. our results indicate that this is the case: we show that the corrections are similar in (UCCA) structure to the source.
%
% conclusion (not in intro): we tried to use semantic similarity to improve systems.
% this is difficult due to semantic conservatism. we expect this will be in issue once evaluation is improved.
% future work.
% also future work: use multiple references in training (did people do that?)
%
% sections:
% 1. Introduction
% 2. Formal conservativism in GEC
% 3. First approach: Multiple References
% 3.1. A Distribution of Corrections
% 3.2. Scores (M2, accuracy index, accuracy exact)
% 3.3. Data
% 3.4. Bias of the Scores (setup + results)
% 3.5. Variance of the Scores (setup + results)
% 4. Second approach: Semantic Similarity
% 4.1. Semantic Annotation of Learner Language (prev work)
% 4.2. UCCA Scheme (see HUME)
% 4.3. Similarity Measures (including prev work of elior)
% 4.4. Empirical Validation: IAA, semantic conservativism vs. gold std
% 5. Conclusion
%
% is a challenging research field, which interfaces with many
%other areas of linguistics and NLP. The field
Grammatical Error Correction (GEC) is receiving considerable
interest recently, notably through the GEC-HOO \cite{dale2011helping,dale2012hoo} and
CoNLL shared tasks \cite{kao2013conll,ng2014conll}.
Within GEC, considerable effort has been placed on evaluation
\cite{tetreault2008native,madnani2011they,felice2015towards,napoles2015ground},
a notoriously difficult challenge, in part due to the many valid corrections a learner's language (LL) sentence may
have \cite{chodorow2012problems}.

An important criterion in the evaluation of correctors
is their ability to generate corrections that are faithful to meaning of the source.
In fact, it has been argued that many would prefer a somewhat cumbersome
or even an occasionally ungrammatical correction over one that alters the meaning of the source \cite{brockett2006correcting}.
Consequently, annotators are often instructed to be conservative when compiling gold standard corrections for the task
(e.g., in the Treebank of Learner English \cite{nicholls2003cambridge}).
There were different attempts to formally capture this precision/recall asymmetry such as the standardized use of $F_{0.5}$ over $F_{1}$ \cite{dahlmeier2012better} and the choices of weights in I-measure \cite{felice2015towards}.

However, penalizing over-correction more harshly than under-correction
during development and training,
may lead to reluctance of correctors to make any changes (henceforth, {\it over-conservatism}).
Using only one or two reference corrections, a common practice in GEC,
compounds this problem, as correctors are not only harshly penalized for making incorrect changes,
but are often penalized for making {\bf correct} changes not found in the reference.

Indeed, we show that current state of the art systems present over-conservatism.
Evaluating the output of 12 recent correctors, we find that all of them
substantially under-predict corrections relative to the gold standard
(\S\ref{sec:formal_conservatism}).
  As the gap in the prevalence of corrections between references and
  correctors' output is often an order of magnitude large, this effect is unlikely to
  be desirable. See discussion in \S\ref{sec:increase-reference}.

We first assess whether the undue penalization of valid corrections can be
resolved by increasing the number of references,
which we denote with $M$ (\S \ref{sec:increase-reference}).
We start by estimating the number and frequency distribution of the valid corrections per sentence,
arriving at an estimate of over 1000 corrections for sentences of no more than 15 tokens.
We then consider two representative reference-based measures (henceforth, {\it RBMs}) for
assessing the validity of a proposed correction relative to a set of references,
and characterize the distribution of their scores as a function of $M$.
Our results show that both measures substantially under-estimate the true performance of
the correctors. Moreover, they show that increasing $M$ only partially addresses
the incurred bias, as both RBMs approach saturation for $M$ values of 10--20,
indicating that a prohibitively large $M$ may be required for reliable estimation.

Our findings echo the results of \newcite{bryant2015far}, who study the effect of $M$
on $F$-score, the most commonly used measure for GEC. Their work focused on
obtaining a more reliable estimate of correctors' performance and proposed to do so
by normalizing corrector's estimated performance with the performance of a human corrector.
However, while such normalization may yield more realistic performance estimates,
it has no effect on the training and tuning of correctors.
%{\color{red} In fact, in \S\ref{sec:increase-reference} we show that
%  current correctors may already surpass
%  human correctors in terms of their single-reference F-score,
%  suggesting that more substantial changes to the common evaluation protocol are in order.}

We conclude by proposing an alternative reference-less semantic evaluation approach which assesses the extent to which
a correction faithfully represents the semantics of the source, by measuring the similarity of their semantic structures (\S \ref{sec:Semantics}).
This approach can be combined with a reference-less measure of grammaticality, based on automatic error detection, as
proposed by \newcite{napoles-sakaguchi-tetreault:2016:EMNLP2016}.
Our experiments support the feasibility of the proposed approach,
by showing that (1) semantic structural annotation can be consistently and automatically applied to LL, (2) that the proposed measure is less prone to unduly penalize valid corrections and (3) that the measure does penalize corrections that alter the semantic structure significantly.

%
%
%We define a measure, using the UCCA scheme \cite{abend2013universal} as a
%test case, motivated by its recent use for machine translation
%evaluation \cite{birch2016hume}.
%We annotate a section of the NUCLE parallel corpus \cite{dahlmeier2013building},
%
%The two approaches address the insufficiency of using too few references from
%complementary angles. The first attempts to cover more of the probability
%mass of valid corrections by taking a larger $M$,
%while the second uses semantic instead of string similarity, in order
%to abstract away from some of the formal variation between different valid corrections.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Over-Conservativism in GEC Systems}\label{sec:formal_conservatism}
%The field of GEC was always thriving or conservatism in its corrections, with the prominent example of using
%$F_{0.5}$ emphasizing precision over recall(\cite{ng2014conll}). we wish to highlight the problem that
%arises from pursuing this conservatism as done today.
%Then, we wished to be conservative, and we achieved that, why shouldn't we rejoice just yet? Theoretically, we might be progressing towards not correcting at all, instead of progressing towards correcting more accurately.
%
%Manual analysis showed excessive formal conservatism and under correction.
%Albeit important, manual analysis is not enough and we aimed for generating some quantitative measures.
%
%We demonstrate that current correctors
%suffer from over-conservatism: they tend to make too few changes to the source, relative to human correctors.
%{ \color{red} This is likely an indication of some hidden, cross-systems, widely spread  bias.}

\subsection{Notation}
We assume each source sentence $x$ has a set of valid corrections $Correct_x$,
and a discrete distribution $\mathcal{D}_x$ over them, where $P_{\mathcal{D}_x}(y)$
for $y \in Correct_x$ is the probability a human annotator would correct $x$ as $y$.

Let $X$ be the evaluated set of source LL sentences where $X$ consists of the sentences $x_{1}\ldots x_{N}$, each independently sampled from some distribution $\mathcal{L}$ over LL sentences and denote $\mathcal{D}_{i}\coloneqq \mathcal{D}_{x_i}$.
Each $x_i$ is paired with $M$ corrections $Y_i = \left\{y_{i}^{1},\ldots, y_{i}^{M}\right\}$,
which are independently sampled from $\mathcal{D}_{i}$.\footnote{Our analysis assumes $M$
	is fixed across source sentences. Generalizing the analysis to sentence-dependent $M$
	values is straightforward.}
We define the {\it coverage} of $M$ references for a sentence $x_i$ to be
$P(y \in Y_i|y \in Correct_i)$ for $Y_i$ of size $M$, and $y$ sampled
according to $\mathcal{D}_i$.

A corrector $C$ is a function from LL sentences to proposed corrections (strings).
An assessment measure is a function from $X$, $Y$ and $C$ to
a real number. We use the term ``true measure'' to refer to the measure's output where the references include all possible corrections, i.e., $Y_i=Correct_i$ for every $i$.

\paragraph{Experimental Setup.}\label{par:experimental_setup}
We conduct all experiments on the NUCLE test dataset,
a parallel corpus of LL essays and their corrected versions,
which is the de facto standard in GEC.
The corpus contains 1414 essays in LL and 50 test essays, each of about 500 words.

We evaluate all participating systems in the CoNLL 2014 shared task,
in addition to three of the best performing systems on this dataset, a hybrid corrector, a phrase based machine translator and a neural network based corrector.
The participating systems and their abbreviations: Adam Mickiewicz University (AMU),
University of Cambridge (CAMB), Columbia University and the University of Illinois at Urbana-Champaign (CUUI),
Indian Institute of Technology, Bombay (IITB), Instituto Politecnico Nacional (IPN),
National Tsing Hua University (NTHU), Peking University (PKU), Pohang University of Science and Technology (POST),
Research Institute for Artificial Intelligence, Romanian Academy (RAC), Shanghai Jiao Tong University (SJTU),
University of Franche Comt\'{e} (UFC), University of Macau (UMC), \newcite[RoRo]{rozovskaya2016grammatical}, \newcite[JMGR]{junczysdowmunt-grundkiewicz:2016:EMNLP2016} \newcite[Char]{xie2016neural}.
All are trained and tested on the NUCLE corpus.

We compare the prevalence of changes made to the source by the correctors,
relative to their prevalence in the NUCLE references (the corrections of one of the annotators was arbitrarily selected). \newcite{bryant2015far} noticed that the NUCLE references are more conservative than the ones they collected which means that our results may even underestimate the level of conservatism relative to other references.
In order to focus on the more substantial changes, we exclude from our evaluation
all non-alphanumeric characters, both within tokens or as tokens of their own.


\paragraph{Measures of Conservatism.}
We consider three types of divergences between the source and the reference.
First, we measure to what extent \emph{words} were changed: altered, deleted or added.
To do so, we compute word alignment between the source and the reference, casting it
as a weighted bipartite matching problem, between the source's words and the correction's.
Edge weights are assigned to be the edit distances
between the tokens.
We note that aligning words in GEC is much simpler than in machine translation,
as most of the words are kept unchanged, deleted fully, added, or changed slightly.
Following word alignment, we define the {\sc WordChange} measure
as the number of unaligned words and aligned words that were changed in any way.

Second, we quantify word \emph{order} differences using
Spearman's $\rho$ between the order of the words in the source sentence,
and the order of their corresponding words in the correction according to the word alignment.
$\rho=0$ where the word order is uncorrelated, and $\rho=1$ where the orders exactly match. We report the average $\rho$ over all source sentences pairs.

Third, we report how many source sentences were split and how many concatenated by the reference and the correctors.
\com{
\begin{figure}
  \centering
  \begin{subfigure}[]{0.4\textwidth}
  	\includegraphics[width = \textwidth]{aligned}
  	\caption{Number of source sentences (y-axis) split
  		(right bars) or concatenated (left bars) in the correction, according to the gold standard (striped column) and different correctors (colored columns). The gold standard makes about an order of magnitude more splits and concatenations than the correctors.\label{fig:split}}
  \end{subfigure}

  \begin{subfigure}[]{0.4\textwidth}
  	\com{\caption{\label{fig:rho}}}
    \includegraphics[width = \textwidth]{spearman_ecdf}
    \caption{Empirical cumulative probability (y-axis) of a sentence to get Spearman's rho values (x-axis) of word alignment. The gold standard(dotted line) makes word change alterations to more sentences than the correctors, and within these sentences, it changes order more substantially.\label{fig:rho}}
  \end{subfigure}

  \begin{subfigure}[]{0.4\textwidth}
  	%\caption{\label{fig:words_changed}}
  	\includegraphics[width = \textwidth]{words_differences_heat}
  	\caption{Amount of sentences(heat) by number of words changed(x-axis) per system(y-label). The gold standard(bottom) corrects more words per sentences and more sentences relative to other systems.\label{fig:words_changed}}
  \end{subfigure}
  \com{\caption{(a) Number of source sentences (y-axis) split
  		(right bars) or concatenated (left bars) in the correction, according to the gold standard (striped column) and different correctors (colored columns). The gold standard makes about an order of magnitude more splits and concatenations than the correctors.\\
  		(b) Empirical cumulative probability (y-axis) of a sentence to get Spearman's rho values (x-axis) of word alignment. The gold standard(dotted line) makes word change alterations to more sentences than the correctors, and within these sentences, it changes order more substantially.\\
  		(c) Amount of sentences(heat) by number of words changed(x-axis) per system(y-label). The gold standard(bottom) corrects more words per sentences and more sentences relative to other systems.\\
  		See \S\ref{par:experimental_setup} for a legend
  		of the systems.}\label{fig:over-conservatism}}
  \caption{\label{fig:over-conservatism}
    See \S\ref{par:experimental_setup} for a legend
    of the systems.}
  	\end{figure}
}
\begin{figure}
  \centering
  \begin{subfigure}[]{0.4\textwidth}
    \includegraphics[width = \textwidth]{words_differences_heat}
  \end{subfigure}
  \begin{subfigure}[]{0.4\textwidth}
    \includegraphics[width = \textwidth]{aligned}
  \end{subfigure}
  \begin{subfigure}[]{0.4\textwidth}
    \includegraphics[width = \textwidth]{spearman_ecdf}
  \end{subfigure}
  \caption{\label{fig:over-conservatism}
    The prevalence of changes of different types in correctors' output and in the NUCLE references.
    The top figure presents the number of sentence pairs (heat) for each number of word changes
    (x-axis; measured by {\sc WordChange}) for each of the different systems and the references (y-axis).
    The middle figure presents the number of source sentences (y-axis) concatenated (right bars) or split (left bars) in the references (striped column) and in the correctors' output (colored columns).
    The bottom figure presents the percentage of sentence pairs (y-axis) where the
    Spearman $\rho$ values do not exceed a certain threshold (x-axis).
    See \S \ref{par:experimental_setup} for a legend of the correctors.
    The three figures show that under all measures, the gold standard references make
    substantially more changes to the source sentences than any of the correctors,
    in some cases an order of magnitude more.
  }
\end{figure}
\vspace{-.2cm}
\paragraph{Results.}
% presents the outcome of the three measures.
%In \ref{fig:split} the amount of sentences each corrector has done is presented. In \ref{fig:words_changed} the accumulated sum of sentences by the words changed in each sentence of each of the correctors is presented. In \ref{fig:rho} the cumulative probability distribution of rho values out of all the sentences.
Results (Figure \ref{fig:over-conservatism}) show that the reference corrections make changes to considerably more source sentences than any of the correctors, and within each changed sentence changes more words and makes more word order changes, often an order of magnitude more. For example, there are  36 reference sentences with 6 word changes, where the most sentences with 6 word changes by any corrector is 5.
Similar amount of correction is observed on the references of the TreeBank of Learner English \cite{yannakoudakis2011new}.
%While $89.6\%$ of NUCLE sentences need corrections,
%The prevalence of FCE consists only of ungrammatical sentences.
%As expected, FCE is a bit less conservative than NUCLE by our measures.
%
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-.1cm}
\section{Multi-Reference Measures}\label{sec:increase-reference}
%
In this section we argue that the observed over-conservatism of correctors likely stems
from them being developed to optimize RBMs that suffer from low-coverage.
We begin with a motivating analysis of the relation between low-coverage and over-conservatism (\S \ref{subsec:motivating_analysis}). We then continue with an empirical assessment of the distribution of corrections for a given sentence (\S \ref{subsec:corrections_distribution})
and the effect of $M$ on commonly used RBMs (\S \ref{subsec:Assessment-values}).
We discuss the implications of our results, concluding that RBMs may only partially address over-conservatism (\S \ref{subsec:mult_discussion}).
%
\vspace{-.2cm}
\subsection{Motivating Analysis}\label{subsec:motivating_analysis}
%
The relation between coverage and over-conservatism requires some explanation.
We abstract away from the details of the training procedure and assume that correctors attempt to maximize an objective function, over some training or development data, and assume for simplicity of the argument that improvement is achieved by iterating over the samples, as with the Perceptron algorithm.

Assume the corrector is faced with a phrase which it predicts to be ungrammatical. Assume $p_{detect}$ is the probability that this prediction is correct.
Assume $p_{correct}$ is the probability it is able to predict
a valid correction for this phrase (including correctly identifying it as erroneous).
Finally, assume that the corrector is evaluated
against $M$ references for which the coverage of the phrase is $p_{coverage}$,
namely the probability that
a valid correction will be found among $M$ randomly sampled references.

We will now assume that the corrector may either choose to correct with the correction it finds the most likely or not at all. If it chose not to correct, its probability of being rewarded (i.e., its output is in the reference set $Y$) is $(1-p_{detect})$. Otherwise, its probability
of being rewarded is $p_{correct} \cdot p_{coverage}$.
In cases where

\vspace{.1cm}
\begin{small}
\begin{myequation}
  \label{eq:reward}
  p_{correct} \cdot p_{coverage} < 1-p_{detect}
\end{myequation}
\vspace{-.1cm}
\end{small}

a corrector is disincentivized from altering the phrase.
We expect Condition (\ref{eq:reward}) to frequently hold in cases that
require non-trivial changes, which are characterized both by low $p_{coverage}$ (as non-trivial
changes can often be made in numerous ways), and by lower expected performance by the corrector.

Moreover, asymmetric measures (e.g., $F_{0.5}$) penalize invalidly correcting more
harshly than not correcting an ungrammatical sentence.
In these cases, Condition (\ref{eq:reward}) should be rephrased as

\begin{small}
	\vspace{-.1cm}
  \begin{myequation*}
    p_{correct} \cdot p_{coverage} - \left(1-p_{correct}p_{coverage}\right) \alpha < 1-p_{detect}
  \end{myequation*}
  \vspace{-.1cm}
\end{small}

where $\alpha$ is the ratio between the penalty for introducing a wrong correction and the reward for a valid correction. Condition (\ref{eq:reward}) is much more likely to hold in these cases.

In order to validate this analysis empirically, we conduct an experiment for determining whether increasing
the number of references available for training indeed reduces conservatism. As there is no multiple-reference
corpus available which is large enough for re-training a corrector, we take an oracle reranking approach
as a simulation, and test whether the availability of increasingly more references to train on reduces
its conservativeness.

Concretely, given a set of sentences, each paired with $\mathcal{M}$ references, and given
a $k$-best list produced by a corrector, we define an oracle re-ranker that selects the highest
scoring correction of the $k$-best list, according to a given evaluation measure.
As a test case, we use the RoRo system, with $k$=100, and apply it to the
largest available LL corpus which is paired with a substantial amount of GEC references,
namely the NUCLE-test corpus, which has 12 references \cite{bryant2015far}. We use
the common F-score as an evaluation measure.

We examine the conservativeness of the oracle reranker for different $M$ values, averaging
over 1312 samples of $M$ references from the available set of $M=12$.
\com{
As this might be insightful to the way those M references are being used in practice at the training and test sets, we rerank over NUCLE references for M=1,2 and sample by random otherwise. Additionally we report the results with \newcite{bryant2015far} 10 references.}
Our results show that word changes increase with $M$ (Figure \ref{fig:reranking_word_change}). No significant difference is found in word order.
%\footnote{As we only rerank individual sentences,
%  there is clearly no change in the number of sentences split or concatenated.}
This indicates that conservatism is indeed related to the number of references available
to the learner.\footnote{We do not see a reason why this reduction in conservatism may
  result from the setup itself. In fact, oracle reranking may in some cases result
  in additional conservatism, because, as more references are used, some of them may be more
  similar to the source, thus increasing conservatism.}


%\com{Ideally, in order to validate this empirically, we should re-train correctors using multiple references, and re-examine their conservatism. However, corpora annotated with more than one correction are scarce.
%As a proof of concept, we simulate a re-ranking procedure over the corpora of \newcite{bryant2015far}, which provide additional 10 references for each of sentence in the NUCLE test set. In order to abstract away from implementation details and from artefacts that may result from the small dataset available, we explore an oracle re-ranking setting, where the correction with the best Micro $F$-score taken from the 100-best list of the RoRo state-of-the-art corrector (see \S 2) is selected. The $F$-score is computed with varying numbers of references ($M$).}


\begin{figure}
	\vspace{-1em}
	\includegraphics[width=8cm]{words_differences_hist_reranking}
	\caption{The amount of sentences (y-axis) with a given number of words changed (x-axis), following oracle reranking with different M values (column colors). \com{For $M=1,2$ is over one and both NUCLE references respectively and higher $M$ values are sampled by random from all the references, BN column represents the results by \newcite{bryant2015far}(M=10)}$M$ values are sampled by random from all the references, the BN column represents the oracle reranking results against the 10 references of \newcite{bryant2015far}(M=10).
		The figure shows that tuning against a larger number of references indeed reduces conservatism.
		\label{fig:reranking_word_change}
        }
	\vspace{-0.5cm}
\end{figure}

 \subsection{Data}
%
Our analysis assumes that we have a reliable estimate for the distribution of corrections
$\mathcal{D}_x$ of the source sentences we evaluate.
Our experiments in the following section are run on a random sample of 52 sentences with a
maximum length of 15 from the NUCLE test data.
Through the length restriction we avoid introducing too many independent
errors that may drastically increase the number of annotations variants (as every combination of corrections for these errors is possible), thus resulting in an unreliable estimation for $\mathcal{D}_x$.
Sentences with less than 6 words were discarded, as they were mostly a result of sentence segmentation errors.

Crowdsourcing has proven effective in GEC evaluation \cite{madnani2011they,napoles2015ground} and in
related tasks such as machine translation \cite{zaidan2011crowdsourcing,post2012constructing}. We thus
use crowdsourcing for obtaining a sample from $\mathcal{D}_x$. Specifically, for each of the 52 source
sentences, we elicited 50 corrections from Amazon Mechanical Turk workers.
%allowing for a reliable estimation of the distributions.
Aiming to judge grammaticality rather than fluency, we asked the workers to
correct only when necessary, not for styling.
4 sentences did not require any correction according to almost half the workers and were hence discarded.
%
\subsection{Estimating the Distribution of Corrections}\label{subsec:corrections_distribution}
%
We begin by estimating $\mathcal{D}_x$ for each sentence, using the crowdsourced corrections.
We use {\sc UnseenEst} \cite{zou2015quantifying}, a non-parametric algorithm that
estimates a multinomial distribution,
in which the individual values do not matter, only the distribution of probabilities
across values. {\sc UnseenEst} aims to minimize the ``earthmover distance'',
between the estimated histogram and the histogram of the distribution and has obtained excellent empirical
results in simulations. Intuitively, if histograms are piles of dirt, minimizing the amount of dirt moved times the distance by which it was moved.\footnote{An implementation of UnseenEst can be found in <to be disclosed upon publication>\com{\href{https://github.com/borgr/unseenest}}}
{\sc UnseenEst} was originally developed for assessing how many
variants a gene might have, including undiscovered ones,
and their relative frequencies.
This is a similar setting to the one tackled here.
Our manual tests of {\sc UnseenEst} with small artificially created datasets
showed satisfactory results.\footnote{All data we collected, along with the estimated
  distributions can be found in <to be disclosed upon publication>}

By the estimates from {\sc UnseenEst}, most source sentences have a large number of
corrections with low probability accounting for the bulk of the probability mass
and a rather small number of frequent corrections.
%The estimated distributions tend to have steps, with many corrections with the same (low) frequency.
Table \ref{tab:corrections_dist} presents the mean number of different corrections with frequency at least
$\gamma$ (for different values of $\gamma$), and their total probability mass.
For instance, 74.34 corrections account for 75\% of the total probability mass of the corrections, each
occurring with a frequency of 0.1\% or higher.

\begin{table}[h!]
	\vspace{-0.5cm}
  \centering
  \small
  \singlespacing
  \begin{tabular}{c|c|c|c|c|}
    %\cline{2-5}
    & \multicolumn{4}{c|}{Frequency Threshold ($\gamma$)}\\
    %\cline{2-5}
    & \multicolumn{1}{c}{0} & \multicolumn{1}{c}{0.001} & \multicolumn{1}{c}{0.01} & \multicolumn{1}{c|}{0.1}
    \\
    \hline
    Variants & 1351.24 & 74.34 & 8.72 & 1.35
    \\
    Mass & 1 & 0.75 & 0.58 & 0.37\\
    \hline
  \end{tabular}
  \caption{\label{tab:corrections_dist}
    Estimating the distribution of corrections $\mathcal{D}_x$.
    The table presents the mean number of corrections per sentence with probability of more than
    $\gamma$ (top row), as well as their total probability mass (bottom row).
  }
  \vspace{-0.3cm}
\end{table}

The overwhelming number of rare corrections raises the question of whether these can be regarded as noise.
To test this we conducted another crowd-sourcing experiment, where 3 annotators were asked to
judge whether a correction produced in the first experiment, is indeed a valid correction.
Figure \ref{fig:validity_judgements} presents the
frequency in which annotators judged a correction to be valid, where corrections are grouped by the number of times they appear in the data.

Results show that the original frequency of the correction has little effect on how often it was deemed
valid, where even the rarest corrections were judged valid 78\% of the times.

\begin{figure}[h!]
	\vspace{-.3cm}
	\includegraphics[width=8cm]{IAA_confirmation_frequency}
	\caption{The mean frequency ($y$-axis) in which a correction that was produced
          a given number of times ($x$-axis), was judged to be valid.
	} \label{fig:validity_judgements}
	\vspace{-0.3cm}
\end{figure}

\subsection{Under-estimation as a function of M} \label{subsec:Assessment-values}
In the previous section we presented an empirical assessment of the corrections distribution of a sentence. We turn to estimating the resulting bias, i.e., the under-estimation of RBMs, for different $M$ values.
We discuss two similarity measures: sentence-level accuracy
(or ``Exact Match'') and the GEC $F$-score.

\paragraph{Sentence-level Accuracy.}
Sentence-level accuracy is the percentage of corrections that
exactly match one of the references.
Accuracy is a basic, interpretable measure, used in GEC by, e.g., \newcite{rozovskaya2010annotating}.
It is also closely related to the 0-1 loss function commonly used
for training statistical correctors \cite{chodorow2012problems,rozovskaya2013joint}.

Formally, given test sentences $X=\{x_1,\ldots,x_N\}$,
their references $Y_1,\ldots,Y_N$, and a corrector $C$,
we define $C$'s accuracy to be

\begin{small}
\vspace{-0.2cm}
  \centering
  \begin{myequation}\label{eq:acc_def}
    Acc\left(C;X,Y\right) = \frac{1}{N} \sum_{i=1}^N \mathds{1}_{C(x_i) \in Y_i}.
  \end{myequation}
\end{small}

Note that $C$'s accuracy is, in fact, an estimate of $C$'s probability to produce
a valid correction for a sentence, or $C$'s {\it true accuracy}. Formally:

 \begin{small}
   \centering
   \vspace{-0.2cm}
   \begin{myequation*}
     TrueAcc\left(C\right) = P_{x\sim{L}}\left(C\left(x\right)\in Correct_x\right).
   \end{myequation*}
   \vspace{-0.15cm}
 \end{small}
%
%We estimate $C$s quality by sampling a set of source sentences
%$x_1,\ldots,x_N \sim \mathcal{L}$, and evaluate the quality of $C(x_1),\ldots,C(x_N)$ relative
%to the source.

The bias of $Acc\left(C;X,Y\right)$ for a sample of $N$ sentences, each paired with $M$ references
is then

\vspace{-0.6cm}
\begin{small}
  \centering
  \begin{flalign}
    &TrueAcc\left(C\right) - \mathbb{E}_{X,Y}\left[Acc\left(C;X,Y\right)\right] = &\\
    &TrueAcc\left(C\right) - P\left(C\left(x\right) \in Y\right)  = &\\
    &Pr\left(C\left(x\right) \in Correct_x\right)  \cdot &\\
    &\label{eq:bias} \left(1 - Pr\left(C\left(x\right) \in Y \vert C\left(x\right) \in Correct_x\right) \right) &
  \end{flalign}
\end{small}
\vspace{-1.5em}

We observe that the bias, denoted $b_M$, is not affected by $N$, only by $M$.
As $M$ grows, $Y$ approximates $Correct_x$ better, and $b_M$ tends to 0.

In order to gain insight into the evaluation measure and the GEC task
(and not the idiosyncrasies of specific systems), we consider an idealized learner,
which, when correct, produces a valid correction with the same
distribution as a human annotator (i.e., according to $\mathcal{D}_x$).
Formally, we assume that, if $C(x) \in Correct_x$ then $C(x) \sim \mathcal{D}_x$.
Hence the bias $b_M$ (Equation \ref{eq:bias}) can be re-written as

\begin{small}
	\vspace{-0.2cm}
\begin{myequation*}
  \centering
  P(C(x) \in Correct_x) \cdot (1 - P_{Y \sim \mathcal{D}_i^M,y\sim \mathcal{D}_x}(y \in Y)).
\end{myequation*}
\end{small}

We will henceforth assume that $C$ is perfect (i.e., its true accuracy $Pr\left(C(x) \in Correct_x\right)$ is 1).
Note that assuming any other value for $C$'s true accuracy
would simply scale $b_M$ by that accuracy.
Similarly, assuming only a fraction $p$ of the sentences require correction scales $b_M$ by $p$.
%
%Denote the bias of a perfect corrector with $b_M$. To recap:
%\begin{equation*}
%  b_M = 1 - P_{x \sim L, Y \in \mathcal{D}_x^M, y \sim \mathcal{D}_x}\left(y \in Y\right)
%\end{equation*}
%
%We turn to estimating $b_M$ empirically. We note that $Acc(C;X,Y)$
%is a sum of Bernoulli variables (i.e., a Poisson Binomial distribution),
%with probabilities $p_i = P_{y \sim \mathcal{D}_i}\left(y \in Y_i\right)$.

We estimate $b_M$ empirically using its empirical mean on our experimental corpus:

\begin{small}
	\vspace{-1em}
  \begin{myequation*}
    \hat{b}_M = 1 - \frac{1}{N}\sum_{i=1}^N P_{Y \sim \mathcal{D}_i^M, y \sim \mathcal{D}_i}\left(y \in Y\right).
  \end{myequation*}
\end{small}

Using the {\sc UnseenEst} estimations of $\mathcal{D}_i$, we can compute $\hat{b}_M$
for any size of $Y_i$ (value of $M$).
However, as this is highly computationally demanding, we estimate it using
sampling. Specifically, for every $M = 1,...,20$ and $x_i$, we sample $Y_i$ 1000 times
(with replacement), and estimate $P\left(y \in Y_i\right)$ as the covered probability mass
$P_{\mathcal{D}_i}\{y: y \in Y_i\}$.

We repeated all our experiments where $Y_i$ is sampled without replacement,
in order to simulate a case where reference corrections are collected by a single
annotator, and are thus not repeated. We find similar trends with a faster increase
in accuracy reaching over $0.47$ with $M=10$.

%
%The resulting estimates for $p_i$
%define the estimate for the distribution of $Acc(C;X,Y)$.
%Given a set of LL sentences $x_1,...,x_N$ and their corresponding references
%$Y_1,...,Y_N$, we define the coverage of the reference set $Y_i$ for the sentence $x_i$ to be
%
%\begin{equation*}
%Cov\left(x_i,Y_i\right)=.
%\end{equation*}
%
%In order to gain insight into the accuracy measure, we need to know something about the distribution from which the given corrector chooses valid corrections. As each corrector might have its own biases, the most appealing choice would be to evaluate a corrector in which this distribution is the same as the one from which corrections for the gold standard are being drawn from. Formally, if $C\left(x_i\right) \in Correct_i$ then $C\left(x_i\right) \sim \mathcal{D}_i$.
%
%Thus, the second term in Equation \ref{eq:correction-in-gs} is $p_i = \mathbb{E}_{Y_i}[Cov(x_i,Y_i)]$.
%Therefore $Acc(C;X,Y)$ is distributed as
%a Poisson Binomial random variable (divided by $N$), with probabilities $\{p_i \cdot CP\}_{i=1}^N$. \footnote{A Poisson Binomial random
%variable is a sum of Bernoulli variables with different success probabilities.} We also assume our corrector is always
%correct (so $CP=1$), but as noted earlier any other value for $CP$ would only scale the results by $CP$.

\begin{figure}
	\vspace{-1em}
  \includegraphics[width=8cm]{noSig_repeat_1000_accuracy}
  \caption{Accuracy and Exact Index Match values for a perfect corrector (y-axis)
    as a function of the number of references $M$ (x-axis).
    %Each data point is paired with a confidence interval ($p=.95$).
  } \label{fig:accuracy_vals}
  \vspace{-0.5cm}
\end{figure}

Figure \ref{fig:accuracy_vals} presents the expected accuracy values for our perfect
corrector (i.e., 1-$\hat{b}_M$) for different values of $M$.
Results show that even for values of $M$ which are much larger than those normally used (e.g., $M=20$),
the expected accuracy is only about 0.5. As $M$ increases, the contribution of each additional correction
  gets smaller to the point it contributes little to the accuracy (the slope is about 0.004 around $M=20$).

We also experiment with a more relaxed measure, {\it Exact Index Match}, which is only sensitive
to the identity of the changed words and not to what they were changed to.
Formally, two corrections $c$ and $c'$ over a source sentence $x$ match
if for their word alignments with the source (computed as above) $a:\{1,...,\left|x\right|\} \rightarrow \{1,...,\left|c\right|,Null\}$
and $a':\{1,...,\left|x\right|\} \rightarrow \{1,...,\left|c'\right|,Null\}$, it holds that $c_{a\left(i\right)} \neq x_{i}$ iff $c'_{a'\left(i\right)} \neq x_{i}$, where $c_{Null}=c'_{Null}$.

Figure \ref{fig:accuracy_vals} also presents the expected accuracy in this case
for different values of $M$, which indicate that while scores of a perfect corrector are somewhat higher,
still with $M=10$, it is 0.54.
As Exact Index Match can be interpreted as an accuracy measure for error detection (rather than correction),
this indicates that error detection evaluation suffers from similar difficulties.
%
%{\color{red}Finally, Figure \ref{fig:accuracy_vals} include confidence intervals ($p=.95$), which are computed analytically. }
%\lc{I need to take what I said back.  is indeed used for the PDF, but currently we don't publish anything that uses the pdf. We do compute the confidence intervals, but those are intervals of the mean, for the graph. The account for the significance of the means we got i.e. for the choices of different sentences which we can't compute analytically. So we bootstrap again. The variance (not specifically significance) of the estimator was originally discussed with the analytical formula ($\sum_{i=1}^{N}P_{\mathcal{D}_x}(y_i \in Y)\cdot\left(1-P_{\mathcal{D}_x}(y_i \in Y)\right)$) when what is interesting is actually just that it decreases with N and when M gets away from 0.5.
%We can also compute using bootstrapping with 10000 repetitions }

The analytic tools we have developed support the computation of the entire distribution of the accuracy,
and not only its expected values. From Equation \ref{eq:acc_def} we see that Accuracy has a Poisson Binominal distribution (i.e., it is a sum of independent Bernoulli variables with different success probabilities), whose success probabilities are $P_{y,Y \sim \mathcal{D}_i}(y \in Y)$, which can be computed, as before, using {\sc UnseenEst}'s estimate for $\mathcal{D}_i$. Estimating the density function allows for the straightforward definition of significance tests for the measure, and can be performed efficiently \cite{hong2013computing}.\footnote{An implementation of this method and the estimated density functions will be released upon publication.}

\paragraph{$F$-Score.}
While accuracy is commonly used as a loss function for training GEC systems,
the $F_\alpha$ score is standard when reporting system performance (and consequently in hyper-parameter
tuning).
Computing $F$-score for GEC is not at all straightforward.
The score is computed in terms of {\it edit} matches between a correction and the references,
where edits are sub-strings of the source that are replaced in the correction/reference.
The HOO shared task used an earlier version of $F$-score, which required that the proposed corrections include edits explicitly.
Later on, relieving correctors from the need to produce edits, $F$-score was redefined optimistically, maximizing
over all possible annotations that generate the correction from the source.\footnote{Since our crowdsourced corrections
	do not include an explicit annotation of edits, we produce edits heuristically.}
$M^2$ \cite{dahlmeier2012better} is the standard F-score computing tool for GEC.

The complexity of the measure prohibits an analytic approach.
We instead use bootstrapping to estimate the bias incurred
by not being able to exhaustively enumerate the set of valid corrections.
%In short, bootstrapping methods sample with repetition from the empirical
%distribution of the observed data to estimate properties (e.g. confidence-interval)
%of the statistic (e.g. $F$-score) over the distribution.
As with accuracy,
in order to avoid confounding our results with system-specific biases,
we assume the evaluated corrector is perfect and sample its corrections from the human distribution of corrections $\mathcal{D}_x$.

Concretely, given a value for $M$ and for $N$, we uniformly sample from our experimental
corpus source sentences $x_1,...,x_N$, and $M$ corrections for each $Y_1,...,Y_N$ (with replacement).
Setting a realistic value for $N$ in our experiments is important
for obtaining comparable results to those obtained on the NUCLE corpus (see below),
as the expected value of $F$-score may depend on $N$ (unlike Accuracy, it is not additive).
In accordance with the NUCLE's test set,
we set $N=1312$ and assume that 136 of the sentences require no correction.
The latter reduces the overall bias by their frequency in the corpus,
and is thus important to include for obtaining realistic results.

The bootstrapping procedure is carried out by the
accelerated bootstrap procedure \cite{efron1987better}, with 1000 iterations.
We also report confidence intervals ($p=.95$), computed using the same procedure.\footnote{We
  use the Python scikits.bootstrap implementation.}
%
%For each sentence which had at least one error according to the NUCLE gold standard
%we sample $M$ sentences uniformly from the
%gathered empirical data to replace it. We leave sentences that do not need
%corrections untouched. This results in reference texts accounting for the
%variability in different choices of corrections while approximating the reduction of variability
%of a big $N$ by consisting of $N$ sentences overall.

% our results
Figure \ref{fig:F_Ms} presents the results of this procedure, which
further indicate the insufficiency of commonly used $M$ values for training and development (1 or 2)
for obtaining a reliable estimation of a corrector's performance.
For instance, the $F_{0.5}$ score for our perfect corrector, whose true $F$-score is 1,
is only 0.42 with $M=2$.
Moreover, the saturation effect observed for accuracy is even more
pronounced in this setting.

The F-score coverage experiment is very similar to that of \newcite{bryant2015far},
who also compared the $F$-score of a human correction against an increasing number of references,
and indeed produced similar results.
They differ from the experiments reported in this section, in that
they did not attempt to estimate the distribution of corrections, and focus exclusively on the F-score
measure.

%
%\paragraph{}
%While our experiments focus on the accuracy and $F$-score measures, we expect
%our results to generalize to other RBMs (see Section \ref{sec:prev_work}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Significance of Real-World Correctors}\label{sec:real_world}
The bootstrapping method for computing the significance of the $F$-score can also
be useful for assessing the significance of the differences in correctors' performance
reported in the literature.
We report results with the bootstrapping protocol (\S \ref{subsec:Assessment-values})
to compute the confidence interval of different correctors with the current NUCLE
test data ($M=2$).
\begin{figure}
  \includegraphics[width=8cm]{$F_{0.5}$_Ms_significance}
  \caption{
    $F_{0.5}$ values for a perfect corrector (y-axis) as a function of the number of references $M$ (x-axis).
    Each data point is paired with a confidence interval ($p=.95$).\label{fig:F_Ms}}
\vspace{-0.5cm}
\end{figure}

\begin{figure}
  \includegraphics[width=8cm]{$F_{0.5}$_significance}
  \caption{$F_{0.5}$ values for different correctors, including confidence interval ($p=.95$).
    The left-most column (``source'') presents the $F$-score of a corrector that doesn't make any
    changes to the source sentences.
    See \S \ref{par:experimental_setup} for a legend of the correctors.\label{fig:F_correctors}}
\vspace{-0.5cm}
\end{figure}

Our results (Figure \ref{fig:F_correctors}) present a mixed picture: some
of the differences between previously reported $F$-scores are indeed significant and some are not.
For example, the best performing corrector is significantly better than the second, but the latter
is not significantly better than the third and fourth.

%Nevertheless, it seems that $M=2$ value taken in NUCLE is sufficiently high
%to generally obtain statistically significant ranking of the different correctors.

\subsection{Discussion}\label{subsec:mult_discussion}
% -- we saw that we have dramatic under-estimation
% -- but is this a problem? for instance, in the last section we saw we can get statistical significance between systems by increasing
%    N even with a low M
% -- balancing $N$ and $M$ is important;
%      other people have looked at similar things (not to re-label or not to re-label). we stress that
%      while for statistical significance increasing $N$ is sufficient, this would not solve this problem:
% -- low coverage entails other problems: it incentivizes systems not to correct, even if they can perfectly predict valid corrections.
% -- mathematical argument
% -- indeed, we see RoRo > Perfect and other systems are close to Perfect. it could be that they are better tailored to
%    those corrections produced by the NUCLE annotators. however in section 2 we saw that they are not.
%    we hypothesize that this is the reason.
%

Our empirical results show that the number of corrections needed for reliable reference-based measures may
be prohibitively large in practice.
Results suggest that there are hundreds of valid corrections with low probability, whose total probability mass
is substantial. RBMs such as accuracy and $F$-score thus show diminishing returns from increasing the value of $M$ over values of about 10.
%
%All these findings suggest that it is too costly to increase $M$ in development data to the extent asymmetric evaluation will not lead to over-conservatism. The exact index match analysis also suggests $M=1,2$ coverage is also low for detection. When detection is separate from correction this might result in over-conservatism.
%
%about a quarter of the probability mass of valid corrections (\S \ref{subsec:Assessment-values}).
%
%The factors controlling significance are two. Variation across sentences themselves (different $D_x$)
%which is reduced with $N$ and the variation across choices of corrections which might be reduced with
%either $M$ or $N$. One can rightly deduce that large $N$ is sufficient for variation, but it will not
%solve the other problems: under-estimation of true performance,
%over-conservatism, possible issues when training systems, and might be more costly than annotating
%a larger $M$ without acquiring more sentences and also annotating them.
%Choosing how to balance is dependent on the goals of the one collecting data, and affects over
%the mean value as well, as discussed in \ref{subsec:Assessment-values}. Thus, we bring supporting data and
%leave the decision to the reader.

Returning to condition (\ref{eq:reward}) (\S \ref{subsec:motivating_analysis}), we find that the coverage
(which is equal to the accuracy depicted in Figure \ref{fig:accuracy_vals})
is lower than 0.5 for $M=2$ on average (for short sentences). For cases of non-trivial
changes, we expect it might be even lower, suggesting that condition (\ref{eq:reward}) often
holds in practice, incentivizing over-conservatism.

Considering the $F$-score of the best-performing systems in Figure \ref{fig:F_correctors}, and
comparing them to the $F$-score of a perfect corrector with $M=2$, we find that their scores are comparable,
where RoRo, in fact, surpasses a perfect corrector's $F$-score.
While it is possible that these correctors outperform the perfect corrector by learning how to
correct a sentence in the same way as one of the NUCLE annotators did, we view this possibility
as unlikely as our results (\S\ref{sec:formal_conservatism}) show that
the output of these systems considerably diverges from NUCLE's references.
A more likely possibility is that these systems' high performance relative to a perfect corrector's
is due to these correctors having learned to predict when not to correct.

Two recent RBMS have been proposed.
One is {\sc I-measure} \cite{felice2015towards},
which introduces novel features to GEC evaluation, such as distinguishing
different quality levels of ungrammatical corrections (e.g., some improve the quality of
the source, while others degrade it), and restricting edits to only consist of single words,
rather than phrases. The other is GLEU \cite{napoles2015ground}, an adaptation of BLEU that
was shown to correlate well with human rankings. We expect our findings, that RBMs substantially under-estimate the
performance of correctors, to generalize to these RBMs, as they all
apply string similarity measures relative to a fairly small number of references.
These measures thus address orthogonal gaps in GEC evaluation from the ones presented here.
Following the proposal of \newcite{sakaguchi2016reassessing}, to emphasize fluency over grammaticality
in reference corrections, only compounds this problem, as it results in a larger number of valid corrections.

%Also, changing from grammatical corrections to fluency ones results in more possible corrections, and consequently a larger bias.

Finally, note that addressing under-estimation by comparing to
a human expected score (in our terms, a perfect corrector) with the same $M$ \cite{bryant2015far},
does not address over-conservatism, as it only
scales the original measure. Moreover, as seen above, a human correction's score
is not necessarily an upper bound, as an over-conservative corrector may surpass a perfect corrector in performance.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Semantic Faithfulness Measure}\label{sec:Semantics}
%
%
%Conservatism is considered an important trait for a corrector, reflected for example
%in the selection of $F_{0.5}$, which emphasizes precision over recall, as the
%standard evaluation measure in GEC.
%In the previous section we followed the common approach in GEC evaluation and evaluated
%
%The thought that stands behind such emphasis is that a user
%would be understanding towards errors he did, of which he is probably
%not even aware, not being corrected, but would not be so understanding
%of corrections altering what he meant to say, in a way he perceives as wrong.
%

In this section we propose a measure that eschews the use
of reference corrections, instead measuring the semantic faithfulness of the proposed
correction to the source.
Concretely, we propose to measure the semantic similarity of the source and the proposed correction
through the graph similarity of their representations.
Such a measure has to be complemented with an
error detection procedure, as it only captures faithfulness, the extent to which
the meaning of the source is preserved in the correction,
and not its grammaticality.
See \newcite{napoles-sakaguchi-tetreault:2016:EMNLP2016}
for a proposal of a complementary measure based
on automatic error detection.

A similar decomposition of output quality to its adequacy (similar to faithfulness)
and fluency (related to grammaticality), has
been used in machine translation (MT) evaluation (e.g., \cite{banchs2015adequacy}).
Another related line of work studies MT reference-less measures
\cite{reeder2006measuring,albrecht2007regression,specia2009estimating,specia2010machine}.
Of which, perhaps the most closely related approach to ours is taken by MEANT \cite{lo2011meant}, which is based on a comparison of semantic role labeling structures on the source and target.

As a test case, we here use the UCCA scheme as a semantic representation \cite{abend2013universal},
motivated by its recent use in semantic MT evaluation \cite{birch2016hume}.
UCCA is motivated by its wider coverage of predicate types, as opposed to MEANT's focus on verbal structures. See \cite{birch2016hume} for discussion. Future work will experiment with a wider variety of semantic representations, both symbolic and distributional (e.g., \cite{cheng2015syntax}).


We conduct two experiments supporting the feasibility of our approach.
We show that semantic annotation can be consistently applied to LL,
through IAA experiments and that a perfect corrector scores high on this measure.
We conclude by showing that the measure is sensitive to changes in meaning, by comparing
the semantic structures of the source to corrections of fairly poor quality.

%As semantic structures represent an abstraction over different realizations
%of a similar meaning,  In fact, $M$ plays no role in this approach, as the measure
%is defined not relative to a refernce but relative to the source sentence.
%
%as LL consists of many grammatical mistakes that makes syntactic
%analysis ill-defined for the task. We show evidence that this is the case, by having
%two annotators annotate a sub-corpus from the NUCLE dataset, and by measuring their
%inter-annotator agreement.
%%%Second, we ask whether corrections for a sentence indeed need to be faithful to the source. We seek to answer this question by measuring
%the semantic similarity between the source and the reference. We show support for an affirmative answer to this question
%by annotating the references provided for the NUCLE dataset,
%and detecting high semantic similarity between the corresponding sentences on both sides.
%
%\subsection{Background}
%
%Reliable assessment by a gold standard might be hard to obtain (see
%\ref{sec:increase-reference}), and human annotation for each output
%is great \cite{madnani2011they} but costly, especially considering the
%development process. Under these conditions,
%%given a reliable semantic annotation we can enhance the reliability of our assessment. A simple way to do it is to somehow account in the assessment score for semantic changes.
%Another, more ambitious way to do that might be to decouple the meaning
%from the structure. We propose a broad idea for a reduction from grammatical
%error detection and a comparable semantics annotation to grammatical
%error correction assessment. Lets assume we have both a reliable error
%detection tool and a good way to measure semantic changes. Then, we
%can transform assessment to a 3 steps assessment.
%Step one, detect errors in the original text. Assess the amount of needed corrections, and the percentage of which that were changed.
%Step two, assess how much did the semantics change.
% Give a negative score for changing semantics.
%Last step, use
%the error detection again to assess how many errors exist in the correction
%output, whether uncorrected by the corrector or new errors presented
%by the correction process itself.
%
%This assessment was partially inspired by the WAS evaluation scheme \cite{chodorow2012problems},
%in short it states we should account in the assessment for 5 types,
%not only the True\textbackslash{}False Positive\textbackslash{}Negative
%but also for the case where the annotation calls for a correction,
%and the corrector did a correction, but one that is unlike the annotation's
%one. With the proposed assessment we can measure how many of the corrections
%were corrected correctly (First + Second), and how many errors do
%we have eventually (Third) and combine them to get something similar
%to the Precision Recall that is widely used. We can also account for
%the places where the error was detected and check if it was corrected
%in a way that makes it grammatical and did not change semantics, the
%fifth type. We do that without getting a human to confirm this is
%indeed a correction.
%
%This system would be even more informative than the current one. Allowing assessment of
%what subtask exactly did the corrector fail to perform. Answering questions
%like: was the corrector too conservative and did not make enough corrections?
%Was it making changes in the right places but not correcting grammar successfully?
% as the corrector correcting grammar but changing things
%it was not supposed to? etc.
%
%Semantic structures were used for LL tasks \cite{king2013shallow}, but so far not to GEC. Some syntactic representations were suggested for this task, and as they are popular for structural representation we devote the next subsection to discuss them. Specifically we explain, why, apart from not being based on semantics, syntactic representation is not a good fit for LL and in particular not for enhancing the evaluation without enlarging reference number.
%
\subsection{Structural Representation in LL}
%
%The usefulness of syntactic parsing in NLP has encouraged a number of previous
%projects to define syntactic annotation for LL.
While linguistic theories propose that each learner makes consistent use of syntax \cite{huebner1985system,tarone1983variability}, this use may not conform to the syntax of the learned language, or of any other known language. This entails difficulties in defining syntactic annotation for LL, as, on the face of it, the language of each learner has to be annotated in its own terms.

LL resources annotate syntactic errors in different ways.
\newcite{berzak2016universal} and \newcite{ragheb2012defining}
annotate according to the syntax used
by the learner, even if this use is not grammatical.
Such annotation may be unreliable as a source of semantic information, as semantically similar sentences, formulated by different learners, may use considerably different structures. \newcite{nagataphrase} take the opposite approach, and try to be faithful to the syntax intended by the learner, this is also the case in works on robustness of parsers assuming grammar should convey meaning and stay robust to errors \cite{bigert2005unsupervised,foster2004parsing}. However, such an approach faces difficulties due to the multitude of different syntactic structures that can be used to express a similar meaning.

%
%Syntactic representation is very popular and useful in many NLP tasks
%\cite{mesfar2007named,ng2002improving,zollmann2006syntax}.
%Thus, one thought that comes to mind is to use grammar annotation
%for LL.
%While not useless, grammatical approach is not well
%defined, and unclear both practically and theoretically.

In this section, we use semantic annotation to structurally
represent LL text. Semantic structures are faithful to the intended
meaning of the sentence, and not to its formal realization, and thus face
fewer conflicts where the syntactic structure used diverges from
the one intended. We are not aware of any previous attempts to semantically
annotate LL text.