(Aug. 20 2004)
Call For Participation
Cross-Lingual Information Retrieval Task in NTCIR Workshop 5
[1.Intro][2.Subtasks][MLIR][BLIR][PLIR][SLIR][3.Interests][4.Resources][5.Test
Collection][Doc][Topic]
[6.Type of Runs][7.Evaluation][8.Schedule][9.Organizers][10.Contact][Appendix]
The cross-lingual information retrieval (CLIR) task of NTCIR Workshop 5 includes four subtasks such as
for promoting research on East Asian languages (Chinese, Japanese,
and Korean).
Note:
a) See also the official web site (http://research.nii.ac.jp/ntcir/)
for further information.
b) Online registration is available at the web site (http://research.nii.ac.jp/ntcir-ws5/application-en.html).
The CLIR task provides four subtasks. Participants can choose to take part in any one, any two, any three or all of four subtasks.
Multilingual CLIR (MLIR)
The topic set and document set of MLIR subtask consist of more than two
languages. In the case of NTCIR Workshop 5, the participants are allowed
to submit results of runs for only the CJKE multiingual document collection.
Regarding the topic set, participants can use whichever they like of these four languages. The following depicts the MLIR subtask.
Topic set | Document set | |
C | -> | CJKE |
J | -> | CJKE |
K | -> | CJKE |
E | -> | CJKE |
Bilingual CLIR (BLIR)
The topic set and document
set of BLIR subtask consist of two languages. For example, for doing K-->J
run (from Korean topics to Japanese documents), the topics are needed to be
translated into Japanese (or, the documents into Korean).
Note:
In the case of BLIR at NTCIR Workshop 5, runs using topics written in English
are supposed to be not offcially evaluated except for comparison with pivot
language approach (results of trec_eval will be delivered to the participants).
The following depicts the BLIR subtask.
Topic set | Document set | |
C | -> | J |
C | -> | K |
C | -> | E |
J | -> | C |
J | -> | K |
J | -> | E |
K | -> | C |
K | -> | J |
K | -> | E |
Pivot Bilingual CLIR (PLIR)
Pivot language approach (or trans-lingual approach) means a special form
of BLIR, which consists of two steps, e.g., C -> E is followed by E
-> J (i.e., C -> E -> J) for doing C -> J search (in this case,
English serves as a kind of intermediate or pivot language).
Note:
The
participants submitting runs for this subtask are allowed to also submit BLIR
runs using English topics (i.e., E -> C or J or K) in order to analyze
performance of the approach.
Topic set | Document set | |
C | -> | J |
C | -> | K |
C | -> | E |
J | -> | C |
J | -> | K |
J | -> | E |
K | -> | C |
K | -> | J |
K | -> | E |
E | -> | C |
E | -> | J |
E | -> | K |
Single Language IR (SLIR)
The topic set and
document set of SLIR subtask consist of a single language. The following depicts
the SLIR subtask.
Topic set | Document set | |
C | -> | C |
J | -> | J |
K | -> | K |
E | -> | E |
In the NTCIR Workshop 5, it should be challenged to explore some special
issues as follows.
Pivot Language Approach
Since the approach seems
to be realistic, the participants are strongly recommended to try this approach
for BLIR using English as a pivot language.
More MLIR!
MLIR should
be studied more intensively for developing a search engine working in a real
situation on the Internet. The submission of MLIR runs is strongly
recommended.
Term Disambiguation in More Real Situation
Needless to say, an important issue for enhancing CLIR performance is term
disambiguation. In order to promote development of term disambiguation
techniques, gtitle-only runh has become mandatory from the previous NTCIR-4.
Link information to language resources is supposed to be released as soon
as possible.
Note:
The task organizers would like to continue making
efforts for providing useful resources such as bilingual term lists, parallel
corpora, or translation probability tables. If you have a language resource that
can be shared among all participants, please tell us about it.
The test collection used in CLIR task is composed of document set and topic set. The following will give a brief description of each set. It should be noted that a new document set may be added late
(a) Document collection for evaluation
Doc language | files | No. of docs | |||
2000 | 2001 | Total | |||
Chinese | CIRB040r (revised) (581.7 MB) |
United Daily News (udn) | 244,038 | 222,526 | 466,564 |
United Express (ude) | 40,445 | 51,851 | 92,296 | ||
Ming Hseng News (mhn) | 84,437 | 85,302 | 169,739 | ||
Economic Daily News (edn) | 79,380 | 93,467 | 172,847 | ||
Total | 448,300 | 453,146 | 901,446 | ||
Japanese | Mainichi Newspaper 2000-2001 (118.8 MB) | 99,207 | 100,474 | 199,681 | |
Yomiuri Newspaper 2000-2001 (343.3 MB) | 306,709 | 352,010 | 658,719 | ||
Total | 405,916 | 452,484 | 858,400 | ||
Korean | Hankookilbo 2000-2001 (52.1 MB) | 40,306 | 44,944 | 85,250 | |
Chosunilbo 2000-2001 (88.7 MB) | 67,711 | 67,413 | 135,124 | ||
Total | 108,017 | 112,357 | 220,374 | ||
English | Mainichi Daily News 2000-2001 (9.9 MB) | 6,608 | 5,547 | 12,155 | |
Korea Times 2000-2001(25.3 MB) | 16,461 | 14,069 | 30,530 | ||
Daily Yomiuri 2000-2001(22.9 MB) | 9,081 | 8,660 | 17,741 | ||
Xinhua 2000-2001(from LDC) | 107,956 | 90,668 | 198,624 | ||
Total | 140,106 | 118,944 | 259,050 |
*The numbers of documents in Daily Yomiuri were revised at 19:00 on July 07 (Japanese Time), which are indicated by red color.
NOTE:
Please use these document sets when you execute runs for submission.
Document sets included in NTCIR-4 test collections (98-99) should be used
for just training your system. Please do NOT include these documents (98-99)
into final results submitted to organizers.
(b) Document collections for training
(b-1) NTCIR-4 test collection
Language | Collection | No. of Docs | Note | ||
NTCIR-4 CLIR |
Chinese 1998-99 | CIRB020 (United Daily News) | 249,203 | Used in NTCIR-3 | |
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) | 132,172 | Used in NTCIR-3 | |||
total | 381,375 | ||||
Japanese 1998-99 | Mainichi | 220,078 | Used in NTCIR-3 | ||
Yomiuri | 373,558 | New | |||
Total | 593,636 | ||||
Korean 1998-99 | Hankookilbo | 149,921 | New | ||
Chosunilbo | 104,517 | New | |||
total | 254,438 | ||||
English 1998-99 | EIRB010 | Taiwan News | 7,489 | Used in NTCIR-3 | |
China Times English News (Taiwan) | 2,715 | Used in NTCIR-3 | |||
Mainichi Daily News (Japan) | 12,723 | Used in NTCIR-3 | |||
Korea Times | 19,599 | New | |||
Xinhua (AQUAINT) | 208,167 | New | |||
Hong Kong Standard | 96,683 | New | |||
total | 347,376 |
(b-2) NTCIR-3 test collection
Language | Collection | No. of Docs | ||
NTCIR-3 CLIR |
Chinese 1998-99 | CIRB020 (United Daily News) | 249,508 | |
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) | 132,173 | |||
Japanese 1998-99 | Mainichi | 220,078 | ||
Korean 1994 | Korea Economic Daily (1994) | 66,146 | ||
English 1998-99 | EIRB010 | Taiwan News | 7,489 | |
China Times English News (Taiwan) | 2,715 | |||
Mainichi Daily News (Japan) | 12,723 |
(c) Format
The format of each news article is consistent by using a set of tags. The
sample documents will be shown in the Appendix.
Mandatory tags | ||
<DOC> | </DOC> | The tag for each document |
<DOCNO> | </DOCNO> | Document identifier |
<LANG> | </LANG> | Language code: CH, EN, JA, KR |
<HEADLINE> | </HEADLINE> | Title of this news article |
<DATE> | </DATE> | Issue date |
<TEXT> | </TEXT> | Text of news article |
Optional tags | ||
<P> | </P> | Paragraph marker |
<SECTION> | </SECTION> | Section identifier in original newspapers |
<AE> | </AE> | Contain figures or not |
<WORDS> | </WORDS> | Number of words in 2 bytes (for Mainichi Newspaper) |
Each topic has four fields; 'T' (TITLE), 'D' (DESC), 'N' (NARR), 'C' (CONC). The following shows a sample topic.
<TOPIC> <NUM>013</NUM> <SLANG>CH</SLANG> <TLANG>EN</TLANG> <TITLE>NBA labor dispute</TITLE> <DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC> <NARR> <REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL> </NARR> <CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC> </TOPIC> |
The tags used in topic are shown as follows.
<TOPIC> | </TOPIC> | The tag for each topic |
<NUM> | </NUM> | Topic identifier |
<SLANG> | </SLANG> | Source language code: CH, EN, JA, KR |
<TLANG> | </TLANG> | Target language code: CH, EN, JA, KR |
<TITLE> | </TITLE> | The concise representation of information request, which is composed of noun or noun phrase. |
<DESC> | </DESC> | A short description of the topic. The brief description of information need, which is composed of one or two sentences. |
<NARR> | </NARR> | A much longer description of topic. The <NARR> may has three
parts; (1)<BACK>...</BACK>: background information about the topic is described. (2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given. (3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on. |
<CONC> | </CONC> | The keywords relevant to whole topic. |
It should be noted that three sub-fields, <BACK> ,<REL> and <TERM>, has been added in <NARR> field from the previous NTCIR-4.
Mandatory Runs: T-run and D-run
Each participant must submit two
types of run for each combination of topic language and document
language(s);
The purpose of asking participants to submit these mandatory runs is to make research findings clear by comparing systems or methods under a unified condition.
Recommended Runs: DN-run
Also, the task organizers would like to
recommend strongly DN run, which is run using <DESC> and <NARR>
fields are used.
Optional Runs
Other any combinations of fields are allowed to
submit as optional runs according to each participantfs research interests, e.g.
TDN-run, DC-run, TDNC-run and so on.
Number of Runs
Each participant can submit up to 5 runs in total for each language
pair regardless of the type of run, and participants are allowed to include two
T runs in maximum and also two D-runs in maximum into the 5 runs. The language
pair means the combination of topic language and document language(s). For
example,
Language combination -> Topic: C and Docs: CJE
(C->CJE)
Submission -> two T-runs, a D-run, a DN-run and a TDNC run (5
runs in total).
Identification and Priority of Runs
Each run has to be associated
with a RunID. RunID is an identity for each run. The rule of format for RunID is
as follows.
The 'pp' is two digits used to represent the priority of the run. It will be used as a parameter for pooling. The participants have to decide the priority for each submitted run in the basis of each language pair. "01" means the high priority. For example, a participating group, LIPS, submits 3 runs for C-->CJE. The first is a T run, the second is a D run and the third is a DN run. Therefore, the Run ID for each run is LIPS-C-CJE-T-01, LIPS-C-CJE-D-02, and LIPS-C-CJE-DN-03, respectively. Or, if the group uses different ranking techniques in T run for C --> CJE, the RunID for each run has to be LIPS-C-CJE-T-01, LIPS-C-CJE-T-02, and LIPS-C-CJE-D-03.
Relevance judgments will be done in four grades, Highly Relevant, Relevant, Partially Relevant, and Irrelevant. Evaluation will be done using trec_eval, which is run at two different threshold of relevance levels. Also, new proposed metrics for multi-grade relevance judgments, weighted R precision and weighted average precision, which basically give merits to the system that will retrieve more relevant documents in higher ranks, may be employed.
2004/09/30 | Deadline of Registrations for Participation |
2004/11/20 | Release of Data (Document sets) |
2005/05/01 | Distribution of Search Topics |
2005/06/01 | Submission of Search Results |
2005/09/01 | Delivery of Evaluation Results |
2005/10/01 | Deadline for Papers (Working Notes) |
2005/12/06-09 | NTCIR Workshop 5 (Conference) |
2006 | Deadline for Papers (Formal Proceedings) |
Hsin-Hsi Chen, Taiwan
Kuang-hua Chen, Taiwan (co-chair)
Noriko Kando, Japan
Kazuaki Kishida, Japan (co-chair)
Kazuko
Kuriyama, Japan
Suk-Hoon Lee, Korea (co-chair)
Sung Hyon Myaeng, Korea
(in alphabetical order of family names)
If you have a question, please contact the task organizers.
@nii.ac.jp
[Top][1.Intro][2.Subtasks][MLIR][BLIR][PLIR][SLIR][3.Interests][4.Resources][5.Test
Collection][Doc][Topic]
[6.Type of Runs][7.Evaluation][8.Schedule][9.Organizers][10.Contact][Appendix]
1.Sample of Chinese Document Record
2.Sample of Japanese Document Record
3. Sample of Korean Document Record