udapi.core.coref module

Classes for handling coreference.

# CorefUD 1.0 format implementation details

## Rules for ordering “chunks” within node.misc[‘Entity’] Entity mentions are annotated using “chunks” stored in misc[‘Entity’]. Chunks are of three types: 1. opening bracket, e.g. (e1-person 2. closing bracket, e.g. e1-person) 3. single-word span (both opening and closing), e.g. (e1-person)

The Entity MISC attribute contains a sequence of chunks without any separators, e.g. Entity=(e1-person(e2-place) means opening e1 mention and single-word e2 mention starting on a given node.

### Crossing mentions Two mentions are crossing iff their spans have non-empty intersection, but neither is a subset of the other, e.g. e1 spanning nodes 1-3 and e2 spanning 2-4 would be represented as: ` 1 ... Entity=(e1 2 ... Entity=(e2 3 ... Entity=e1) 4 ... Entity=e2) ` This may be an annotation error and we may forbid such cases in future annotation guidelines, but in CorefUD 0.2, there are thousands of such cases (see https://github.com/ufal/corefUD/issues/23).

It can even happen that one entity ends and another starts at the same node: Entity=e1)(e2 For this reason, we need

Rule1: closing brackets MUST always precede opening brackets. Otherwise, we would get Entity=(e2e1), which could not be parsed.

Note that we cannot have same-entity crossing mentions in the CorefUD 1.0 format, so e.g. if we substitute e2 with e1 in the example above, we’ll get (e1, e1), (e1, e1), which will be interpreted as two non-overlapping mentions of the same entity.

### Nested mentions One mention (span) can be often embedded within another mention (span). It can happen that both these mentions correspond to the same entity (i.e. are in the same cluster), for example, “<the man <who> sold the world>”. It can even happen that both mentions start at the same node, e.g. “<<w1 w2> w3>” (TODO: find nice real-world examples). In such cases, we need to make sure the brackets are well-nested:

Rule2: when opening multiple brackets at the same node, longer mentions MUST be opened first.

This is important because - The closing bracket has the same form for both mentions of the same entity - it includes just the entity ID (eid). - The opening-bracket annotation contains other mention attributes, e.g. head index. - The two mentions may differ in these attributes, e.g. the “<w1 w2 w3>” mention’s head may be w3. - When breaking Rule2, we would get ` 1 w1 ... Entity=(e1-person-1(e1-person-3 2 w2 ... Entity=e1) 3 w3 ... Entity=e1) ` which would be interpreted as if the head of the “<w1 w2>” mention is its third word, which is invalid.

### Other rules

Rule3: when closing multiple brackets at the same node, shorter mentions SHOULD be closed first. See Rule4 for a single exception from this rule regarding crossing mentions. I’m not aware of any problems when breaking this rule, but it seems intuitive (to make the annotation well-nested if possible) and we want to define some canonical ordering anyway. The API should be able to load even files breaking Rule3.

Rule4: single-word chunks SHOULD follow all opening brackets and precede all closing brackets if possible. When considering single-word chunks as a subtype of both opening and closing brackets, this rule follows from the well-nestedness (and Rule2). So we should have Entity=(e1(e2) and Entity=(e3)e1), but the API should be able to load even Entity=(e2)(e1 and Entity=e1)(e3).

In case of crossing mentions (annotated following Rule1), we cannot follow Rule4. If we want to add a single-word mention e2 to a node with Entity=e1)(e3, it seems intuitive to prefer Rule2 over Rule3, which results in Entity=e1)(e3(e2). So the canonical ordering will be achieved by placing single-word chunks after all opening brackets. The API should be able to load even Entity=(e2)e1)(e3 and Entity=e1)(e2)(e3.

Rule5: ordering of same-span single-word mentions TODO: I am not sure here. We may want to forbid such cases or define canonical ordering even for them. E.g. Entity=(e1)(e2) vs. Entity=(e2)(e1).

Rule6: ordering of same-start same-end multiword mentions TODO: I am not sure here. These can be either same-span multiword mentions (which may be forbidden) or something like ` 1 w1 ... Entity=(e1(e2[1/2]) 2 w2 ... 3 w3 ... Entity=(e2[2/2])e1) ` where both e1 and e2 start at w1 and end at w3, but e2 is discontinuous and does not contain w2. If we interpret “shorter” and “longer” in Rule2 and Rule3 as len(mention.words) (and not as mention.words[-1].ord - mention.words[0].ord), we get the canonical ordering as in the example above.

Bases: object

relation
target

Bases: MutableSequence

BridgingLinks class serves as a list of BridgingLink tuples with additional methods.

Example usage: >>> bl = BridgingLinks(src_mention) # empty links >>> bl = BridgingLinks(src_mention, [(c12, ‘part’), (c56, ‘subset’)]) # from a list of tuples >>> (bl8, bl9) = BridgingLinks.from_string(‘c12<c8:part,c56<c8:subset,c5<c9’, entities) >>> for entity, relation in bl: >>> print(f”{bl.src_mention} ->{relation}-> {entity.eid}”) >>> print(str(bl)) # c12<c8:part,c56<c8:subset >>> bl(‘part’).targets == [c12] >>> bl(‘part|subset’).targets == [c12, c56] >>> bl.append((c57, ‘funct’))

classmethod from_string(string, entities, node, strict=True, tree2docid=None)[source]

Return a sequence of BridgingLink objects representing a given string serialization. The bridging links are also added to the mentions (mention.bridging) in the supplied entities, so the returned sequence can be usually ignored. If tree2docid parameter is provided (mapping trees to document IDs used as prefixes in eid), the entity IDs in the provided string are interpreted as “GRP”, i.e. as document-wide IDs, which need to be prefixed by the document IDs, to get corpus-wide unique “eid”.

insert(key, new_value)[source]

S.insert(index, value) – insert value before index

property targets

Return a list of the target entities (without relations).

class udapi.core.coref.CorefEntity(eid, etype=None)[source]

Bases: object

Class for representing all mentions of a given entity.

all_bridging()[source]
create_mention(head=None, words=None, span=None)[source]

Create a new CoreferenceMention object within this CorefEntity.

Args: head: a node where the annotation about this CorefMention will be stored in MISC.

The head is supposed to be the linguistic head of the mention, i.e. the highest node in the dependency tree, but if such information is not available (yet), it can be any node within the words. If no head is specified, the first word from words will be used instead.

words: a list of nodes of the mention.

This argument is optional, but if provided, it must contain the head. The nodes can be both normal nodes or empty nodes.

span: an alternative way how to specify words

using a string such as “3-5,6,7.1-7.2”. (which means, there is an empty node 5.1 and normal node 7, which are not part of the mention). At most one of the args words and span can be specified.

property eid
property eid_or_grp
etype
property mentions
split_ante
class udapi.core.coref.CorefMention(words, head=None, entity=None, add_word_backlinks=True)[source]

Bases: object

Class for representing a mention (instance of an entity).

property bridging
property entity
property head
property other
property span
property words
class udapi.core.coref.CorefMentionSubspan(words, mention, subspan_id)[source]

Bases: object

Helper class for representing a continuous subspan of a mention.

mention
property subspan_eid
subspan_id
words
class udapi.core.coref.OtherDualDict(value=None, **kwargs)[source]

Bases: MutableMapping

OtherDualDict class serves as dict with lazily synchronized string representation.

>>> ddict = OtherDualDict('anacata:anaphoric,antetype:entity,nptype:np')
>>> ddict['mention'] = 'np'
>>> str(ddict)
'anacata:anaphoric,antetype:entity,mention:np,nptype:np'
>>> ddict['NonExistent']
''

This class provides access to both * a structured (dict-based, deserialized) representation,

e.g. {‘anacata’: ‘anaphoric’, ‘antetype’: ‘entity’}, and

  • a string (serialized) representation of the mapping, e.g. anacata:anaphoric,antetype:entity.

There is a clever mechanism that makes sure that users can read and write both of the representations which are always kept synchronized. Moreover, the synchronization is lazy, so the serialization and deserialization is done only when needed. This speeds up scenarios where access to dict is not needed.

A value can be deleted with any of the following three ways: >>> del ddict[‘nptype’] >>> ddict[‘nptype’] = None >>> ddict[‘nptype’] = ‘’ and it works even if the value was already missing.

clear() None.  Remove all items from D.[source]
copy()[source]

Return a deep copy of this instance.

set_mapping(value)[source]

Set the mapping from a dict or string.

If the value is None, it is converted to storing an empty string. If the value is a string, it is stored as is. If the value is a dict (or any instance of collections.abc.Mapping), its copy is stored. Other types of value raise an ValueError exception.

udapi.core.coref.load_coref_from_misc(doc, strict=True)[source]
udapi.core.coref.nodes_to_span(nodes)[source]

Converts a list of nodes into a string specifying ranges of their ords.

For example, nodes with ords 3, 4, 5 and 7 will be converted to “3-5,7”. The function handles also empty nodes, so e.g. 3.1, 3.2 and 3.3 will be converted to “3.1-3.3”. Note that empty nodes may form gaps in the span, so if a given tree contains an empty node with ord 5.1, but only nodes with ords 3, 4, 5, 6, 7.1 and 7.2 are provided as nodes, the resulting string will be “3-5,6,7.1-7.2”. This means that the implementation needs to iterate over all nodes in a given tree (root.descendants_and_empty) to check for such gaps.

udapi.core.coref.span_to_nodes(root, span)[source]
udapi.core.coref.store_coref_to_misc(doc)[source]