udapi.block.eval package¶

Submodules¶

udapi.block.eval.conll17 module¶

Block&script eval.Conll17 for evaluating LAS,UAS,etc as in CoNLL2017 UD shared task.

This is a reimplementation of the CoNLL2017 shared task official evaluation script, http://universaldependencies.org/conll17/evaluation.html

The gold trees and predicted (system-output) trees need to be sentence-aligned e.g. using util.ResegmentGold. Unlike in eval.Parsing, the gold and predicted trees can have different tokenization.

An example usage and output:

$ udapy read.Conllu zone=gold files=gold.conllu \
        read.Conllu zone=pred files=pred.conllu ignore_sent_id=1 \
        eval.Conll17
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Words      |     27.91 |     52.17 |     36.36 |    100.00
UPOS       |     27.91 |     52.17 |     36.36 |    100.00
XPOS       |     27.91 |     52.17 |     36.36 |    100.00
Feats      |     27.91 |     52.17 |     36.36 |    100.00
Lemma      |     27.91 |     52.17 |     36.36 |    100.00
UAS        |     16.28 |     30.43 |     21.21 |     58.33
LAS        |     16.28 |     30.43 |     21.21 |     58.33
CLAS       |     10.34 |     16.67 |     12.77 |     37.50

For evaluating multiple systems and testsets (as in CoNLL2017) stored in systems/testset_name/system_name.conllu you can use:

#!/bin/bash
SYSTEMS=`ls systems`
[[ $# -ne 0 ]] && SYSTEMS=$@
set -x
set -e
for sys in $SYSTEMS; do
    mkdir -p results/$sys
    for testset in `ls systems/$sys`; do
        udapy read.Conllu zone=gold files=gold/$testset \
              read.Conllu zone=pred files=systems/$sys/$testset ignore_sent_id=1 \
              util.ResegmentGold \
              eval.Conll17 print_results=0 print_raw=1 \
              > results/$sys/${testset%.conllu}
    done
done
python3 `python3 -c 'import udapi.block.eval.conll17 as x; print(x.__file__)'` -r 100

The last line executes this block as a script and computes bootstrap resampling with 100 resamples (default=1000, it is recommended to keep the default or higher value unless testing the interface). This prints the ranking and confidence intervals (95% by default) and also p-values for each pair of systems with neighboring ranks. If the difference in LAS is significant (according to a paired bootstrap test, by default if p < 0.05), a line is printed between the two systems.

The output looks like:

 1.          Stanford 76.17 ± 0.12 (76.06 .. 76.30) p=0.001
------------------------------------------------------------
 2.              C2L2 74.88 ± 0.12 (74.77 .. 75.01) p=0.001
------------------------------------------------------------
 3.               IMS 74.29 ± 0.13 (74.16 .. 74.43) p=0.001
------------------------------------------------------------
 4.          HIT-SCIR 71.99 ± 0.14 (71.84 .. 72.12) p=0.001
------------------------------------------------------------
 5.           LATTICE 70.81 ± 0.13 (70.67 .. 70.94) p=0.001
------------------------------------------------------------
 6.        NAIST-SATO 70.02 ± 0.13 (69.89 .. 70.16) p=0.001
------------------------------------------------------------
 7.    Koc-University 69.66 ± 0.13 (69.52 .. 69.79) p=0.002
------------------------------------------------------------
 8.   UFAL-UDPipe-1-2 69.36 ± 0.13 (69.22 .. 69.49) p=0.001
------------------------------------------------------------
 9.            UParse 68.75 ± 0.14 (68.62 .. 68.89) p=0.003
------------------------------------------------------------
10.     Orange-Deskin 68.50 ± 0.13 (68.37 .. 68.62) p=0.448
11.          TurkuNLP 68.48 ± 0.14 (68.34 .. 68.62) p=0.029
------------------------------------------------------------
12.              darc 68.29 ± 0.13 (68.16 .. 68.42) p=0.334
13.  conll17-baseline 68.25 ± 0.14 (68.11 .. 68.38) p=0.003
------------------------------------------------------------
14.             MQuni 67.93 ± 0.13 (67.80 .. 68.06) p=0.062
15.             fbaml 67.78 ± 0.13 (67.65 .. 67.91) p=0.283
16.     LyS-FASTPARSE 67.73 ± 0.13 (67.59 .. 67.85) p=0.121
17.        LIMSI-LIPN 67.61 ± 0.14 (67.47 .. 67.75) p=0.445
18.             RACAI 67.60 ± 0.13 (67.46 .. 67.72) p=0.166
19.     IIT-Kharagpur 67.50 ± 0.14 (67.36 .. 67.64) p=0.447
20.           naistCL 67.49 ± 0.15 (67.34 .. 67.63)

TODO: Bootstrap currently reports only LAS, but all the other measures could be added as well.

class udapi.block.eval.conll17.Conll17(gold_zone='gold', print_raw=False, print_results=True, **kwargs)[source]¶

Bases: udapi.core.basewriter.BaseWriter

Evaluate labeled and unlabeled attachment score (LAS and UAS).

process_end()[source]¶: A hook method that is executed after processing all UD data

process_tree(tree)[source]¶: Process a UD tree

udapi.block.eval.conll17.main()[source]¶

udapi.block.eval.conll17.prec_rec_f1(correct, pred, gold, alig=0)[source]¶

udapi.block.eval.f1 module¶

Block eval.F1 for evaluating differences between sentences with P/R/F1.

eval.F1 zones=en_pred gold_zone=en_gold details=0 prints something like:

predicted =     210
gold      =     213
correct   =     210
precision = 100.00%
recall    =  98.59%
F1        =  99.29%

eval.F1 gold_zone=y attributes=form,upos focus='(?i:an?|the)_DET' details=4 prints something like:

=== Details ===
token       pred  gold  corr   prec     rec      F1
the_DET      711   213   188  26.44%  88.26%  40.69%
The_DET       82    25    19  23.17%  76.00%  35.51%
a_DET          0    62     0   0.00%   0.00%   0.00%
an_DET         0    16     0   0.00%   0.00%   0.00%
=== Totals ===
predicted =     793
gold      =     319
correct   =     207
precision =  26.10%
recall    =  64.89%
F1        =  37.23%

This block finds differences between nodes of trees in two zones and reports the overall precision, recall and F1. The two zones are “predicted” (on which this block is applied) and “gold” (which needs to be specified with parameter gold).

This block also reports the number of total nodes in the predicted zone and in the gold zone and the number of “correct” nodes, that is predicted nodes which are also in the gold zone. By default two nodes are considered “the same” if they have the same form, but it is possible to check also for other nodes’ attributes (with parameter attributes).

As usual:

precision = correct / predicted
recall = correct / gold
F1 = 2 * precision * recall / (precision + recall)

The implementation is based on finding the longest common subsequence (LCS) between the nodes in the two trees. This means that the two zones do not need to be explicitly word-aligned.

class udapi.block.eval.f1.F1(gold_zone, attributes='form', focus=None, details=4, **kwargs)[source]¶

Bases: udapi.core.basewriter.BaseWriter

Evaluate differences between sentences (in different zones) with P/R/F1.

Args: zones: Which zone contains the “predicted” trees?

Make sure that you specify just one zone. If you leave the default value “all” and the document contains more zones, the results will be mixed, which is most likely not what you wanted. Exception: If the document conaints just two zones (predicted and gold trees), you can keep the default value “all” because this block will skip comparison of the gold zone with itself.

gold_zone: Which zone contains the gold-standard trees?

attributes: comma separated list of attributes which should be checked: when deciding whether two nodes are equivalent in LCS
focus: Regular expresion constraining the tokens we are interested in.: If more attributes were specified in the attributes parameter, their values are concatenated with underscore, so focus should reflect that e.g. attributes=form,upos focus='(a|the)_DET'. For case-insensitive focus use e.g. focus='(?i)the' (which is equivalent to focus='[Tt][Hh][Ee]').
details: Print also detailed statistics for each token (matching the focus).: The value of this parameter details specifies the number of tokens to include. The tokens are sorted according to the sum of their predicted and gold counts.

process_end()[source]¶: A hook method that is executed after processing all UD data

process_tree(tree)[source]¶: Process a UD tree

udapi.block.eval.f1.find_lcs(x, y)[source]¶: Find longest common subsequence.

udapi.block.eval.parsing module¶

Block eval.Parsing for evaluating UAS and LAS - gold and pred must have the same tokens.

class udapi.block.eval.parsing.Parsing(gold_zone, **kwargs)[source]¶

Bases: udapi.core.basewriter.BaseWriter

Evaluate labeled and unlabeled attachment score (LAS and UAS).

process_end()[source]¶: A hook method that is executed after processing all UD data

process_tree(tree)[source]¶: Process a UD tree

udapi.block.eval package¶

Submodules¶

udapi.block.eval.conll17 module¶

udapi.block.eval.f1 module¶

udapi.block.eval.parsing module¶

Module contents¶