由于做关系抽取要用到ACE 2005的语料,所以在此记录一下相关的信息,包括各个文件的内容和格式等,也方便初入门者可以更快地了解这个语料。以下内容主要翻译自CMU的William Cohen小组(我也没有了解过这个小组)和ACE 2007 Spanish DevTest - Pilot Evaluation的文档(一开始没有好好看LDC的网站,所以根据关键词Google到了这个,不过好像这个好像还挺有用的,这里也给以下ACE 2005 Multilingual Training Corpus的网址,上边也有它的Online Documentation)。最后附上一个语料预处理的github项目ACE 2005 Corpus Preprocessing
ACE 2005数据集标注了基本任务:the recognition of entities, values, temporal expressions, relation and events。如果想了解更详细的关于ACE05评测的内容,可以看这里The ACE 2005 (ACE05) Evaluation Plan。
ACE 2005语料库训练部分的详细统计数字如下图所示:
以上adj、fp1、fp2、timex2norm文件夹分别表示的是不同的标注过程。ACE语料在所有任务上都是通过两个独立工作的标注器来进行标注的。第一轮的标注成为1P,与之独立的双重第一轮标注成为DUAL。对于1P和DUAL来说,一个标注器完成文件的所有任务。文件是通过自动标注工作流程系统(Annotation Work-flow System, AWS)来进行分配的,而且文件分配是双盲的。(这一段我是瞎翻的,我也不知道自己在说啥)
每个文件的1P和DUAL版本之间的差异由资深标注员或者小组负责人来进行裁决,从而得到一个高质量的gold standard文件。gold standard裁决文件被成为ADJ(也就是我们上边说的ADJ文件夹)。在裁决之后,TIMEX2值被标准化处理以后得到NORM。这个语料中的所有数据集都已经被NORM标注。
1P: entities DUAL: entities
TIMEX2 extents TIMEX2 extents
| |
| |
ADJ: entities
TIMEX2 extents
NORM: TIMEX2 normalization
Source Text (.sgm) Files
- These files contain the source text data in an SGML format; they
use UTF-8 encoding and UNIX-style line termination.
AG (.ag.xml) Files
- These are annotation files created with the LDC's annotation
toolkit. These files have been converted to the corresponding
.apf.xml files.
ACE Program Format (APF) (.apf.xml) Files
- These files are in the official ACE annotation file format. ACE
format is derived by means of a routine format conversion process,
so that the underlying annotation content of the two files is
equivalent See section 8 for more details.
ID table (.tab) Files
- These files store mapping tables between the IDs used in the
ag.xml files and their corresponding apf.xml files.
- Offsets
APF uses the offset counting method traditionally used in previous
ACE evaluation programs:
1) Each (UTF-8) character, not byte, is counted as one.
2) Each newline character is counted as one. (The .sgm files
use the UNIX-style end of line characters.)
3) SGML tags are *not* counted towards offsets. (Please note
that the AG files included in this release do count SGML tags in
4) SGML entities are counted in terms of each character in the
entities. For example, "&" is counted as five
characters, not as one character.
The timex2 element represents TIMEX2 time expression annotations.
Its optional attributes, such as "VAL" and "MOD", represent the
TIMEX2 normalization values.
- TYPE, LDCTYPE and LDCATR in entity_mention
The TYPE attribute in entity_mention stores the official ACE entity
mention type, and the LDCTYPE and LDCATR attributes store the
attributes used in the LDC's annotation process.
- Name in entity_attributes
The "name" element in entity_attributes stores the heads of
"NAM"-type mentions as in the previous years. In response to
George Doddington's request, we have added the NAME attribute to
the "name" element. The NAME attribute stores slightly normalized
versions of the names where:
- \n is replaced with a space
- multiple spaces are reduced to one space
- " (double quote) is removed
- Example:
<name NAME="United States">
<charseq START="4242" END="4254">United
- Nickname metonymy
Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in
entity_mentions. "NAN"-type entity mentions marked as nickname
metonymy do not give rise to name elements.
- Cross-type metonymy
"Cross-type" metonyms are represented with relations of the type
METONYMY. The METONYMY type relations do not have
relation_mentions. The METONYMY type relations are automatically
generated after the annotation process, and are the only kind of
relation annotations that appear in this corpus.
- For more details, please refer to the APF V5.1.2 DTD.
