http://www.livac.org/index.php?lang=sc 自1995年开始,以「共时」方式处理了超常的大量汉语语料,通过精密的技术,累积众多精确的统计数据,建立了LIVAC (Linguistic Variation in Chinese Speech Communities)共时语料库。本语料库最大特点是采用「共时性」视窗模式,严谨地定时分别收集来自多地的定量同类语料,可供各种客观的比较研究,方便有关的信息科技发展与应用。此外,语料库又兼顾了「历时性」,方便各方人士客观地观察与研究视窗内的有代表性的语言发展全面动态。
http://www.chineseldc.org/ (Chinese Linguistic Data Consortium,简称ChineseLDC)的建立。ChineseLDC是吸收国内高等院校,科研机构和公司参加的开放式语言资源联盟。其目的是建成能代表当今中文信息处理水平的,通用的中文语言信息知识库。ChineseLDC 将建设和收集中文信息处理所需要的各种语言资源,包括词典,语料库,数据,工具等。在建立和收集语言资源的基础上,分发资源,促成统一的标准和规范,推荐给用户,并且针对中文信息处理领域的关键技术建立评测机制,为中文信息处理的基础研究和应用开发提供支持。(之所以排名这么后,是因为是国家出钱的项目,却没有什么免费资源。)
-- The Babel English-Chinese Parallel Corpus
http://www.lancs.ac.uk/fass/projects...abel/babel.htm The Babel English-Chinese Parallel Corpus,which was created on our research project Contrasting English and Chinese (ESRC Award Reference RES-000-23-0553),consists of 327 English articles and their translations in Mandarin Chinese. Of these 115 texts (121,493 English tokens plus 135,493 Chinese tokens) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English tokens plus 151,969 Chinese tokens) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese tokens). Here is a list of the titles of the articles included in the corpus.
The corpus is tagged for part of speech and aligned at the sentence level. The English texts were tagged using the CLAWS C7 tagset while Chinese texts were tagged using the Peking University tagset. Sentence alignment was done automatically and corrected by hand. The corpus is also marked for paragraph and sentence. But different markup systems were adopted for the two subcorpora. For the component of the World of English, sentences were marked consecutively throughout whereas for Time, sentences were marked within each paragraph.
The Babel parallel corpus can be accessed via the ParaConc Web or MySql interface (both hosted at The Institute of Education, Singapore). Users can search in either English or Chinese texts. The concordancer returns matched whole sentences and their translations as well as the their locations. At the bottom of the resulting concordance page is a query report that indicate the query strings and distribution of matches. Users can also specify the format the output concordances as POS-tagged or plain texts.
--English Chinese Parallel Concordancer (E-C Concord)
The Hong Kong Institute of Education.
Project leader: Dr. Wang Lixun. Program designers: Chris Greaves, Wang Lixun
Corpus tools developed by group members
Chi-square and loglikelihood Calculator, (卡方检验和对数似然率计算工具)
TreeTagger for Windows, (语料库词性标注工具TreeTagger的Windows界面)
Colligator 1.0 & 2.0, (语料库类联接分析工具)
PatternBuilder 1.0, (赋码语料库检索辅助工具)
The Edinburgh Associative Thesaurus (EAT) for Windows,(爱丁堡联想词库Windows查询工具)
Wordlist Tools 1.0 Beta,(词表分析工具)
My Good Old Blackboard,(我的电子黑板)
BFSU Stanford Parser 1.0,(英文自动句法分析工具)。
BFSU Stanford POS Tagger 1.0,(英文自动词性赋码工具)。
BFSU Sentence Collector 1.0,(例句提取工具)。
BFSU NewWord Marker 1.0,(生词标注工具)。
BFSU Sentence Segmenter 1.0,(英文自动分句工具)。
Web Colligator。
Collocator 1.0: A collocation extraction tool,(搭配分析工具)。
Log-likelihood ratio calculator,(对数似然率计算器)。
Readability Analyzer 1.0 ,(英文文本可读性分析工具)。
Other free corpus tools
AntConc: A free concordancer(跟WordSmith主要功能接近的语料库索引工具)
Range: Vocabulary coverage tools(基于底表的分级词汇测量工具)
语料库检索软件Paraconc 和Multiconcord:
Multiconcord也是一个在Windows窗口下运行的软件。这种软件和Paraconc在检索功能上相似,但检索结果在呈现方式上不同。另外,Paraconc可以检索纯文本格式的文件;Multiconcord 则需要一个Minimark 程序来最低程度地标记文本,如< p > (段落) 和< s > (句子)。