(Aug. 20 2004)

[CLIR Home][NTCIR Home]

Call For Participation

Cross-Lingual Information Retrieval Task in NTCIR Workshop 5

[1.Intro][2.Subtasks][MLIR][BLIR][PLIR][SLIR][3.Interests][4.Resources][5.Test Collection][Doc][Topic]
[6.Type of Runs][7.Evaluation][8.Schedule][9.Organizers][10.Contact][Appendix]

1. Introduction

The cross-lingual information retrieval (CLIR) task of NTCIR Workshop 5 includes four subtasks such as

for promoting research on East Asian languages (Chinese, Japanese, and Korean).
a) See also the official web site (http://research.nii.ac.jp/ntcir/) for further information.
b) Online registration is available at the web site (http://research.nii.ac.jp/ntcir-ws5/application-en.html).

2. Subtasks

The CLIR task provides four subtasks. Participants can choose to take part in any one, any two, any three or all of four subtasks.

Multilingual CLIR (MLIR)
The topic set and document set of MLIR subtask consist of more than two languages. In the case of NTCIR Workshop 5, the participants are allowed to submit results of runs for only the CJKE multiingual document collection.

Regarding the topic set, participants can use whichever they like of these four languages. The following depicts the MLIR subtask.

Topic set Document set

Bilingual CLIR (BLIR)
The topic set and document set of BLIR subtask consist of two languages. For example, for doing K-->J run (from Korean topics to Japanese documents), the topics are needed to be translated into Japanese (or, the documents into Korean).
In the case of BLIR at NTCIR Workshop 5, runs using topics written in English are supposed to be not offcially evaluated except for comparison with pivot language approach (results of trec_eval will be delivered to the participants).
The following depicts the BLIR subtask.

Topic set Document set
C -> J
C -> K
C -> E
J -> C
J -> K
J -> E
K -> C
K -> J
K -> E

Pivot Bilingual CLIR (PLIR)
Pivot language approach (or trans-lingual approach) means a special form of BLIR, which consists of two steps, e.g., C -> E is followed by E -> J (i.e., C -> E -> J) for doing C -> J search (in this case, English serves as a kind of intermediate or pivot language).
The participants submitting runs for this subtask are allowed to also submit BLIR runs using English topics (i.e., E -> C or J or K) in order to analyze performance of the approach.

Topic set Document set
C -> J
C -> K
C -> E
J -> C
J -> K
J -> E
K -> C
K -> J
K -> E
E -> C
E -> J
E -> K

Single Language IR (SLIR)
The topic set and document set of SLIR subtask consist of a single language. The following depicts the SLIR subtask.

Topic set Document set
C -> C
J -> J
K -> K
E -> E

3. Special Interests in NTCIR Workshop 5

In the NTCIR Workshop 5, it should be challenged to explore some special issues as follows.
Pivot Language Approach
Since the approach seems to be realistic, the participants are strongly recommended to try this approach for BLIR using English as a pivot language.
More MLIR!
MLIR should be studied more intensively for developing a search engine working in a real situation on the Internet. The submission of MLIR runs is strongly recommended.
Term Disambiguation in More Real Situation
Needless to say, an important issue for enhancing CLIR performance is term disambiguation. In order to promote development of term disambiguation techniques, gtitle-only runh has become mandatory from the previous NTCIR-4.

4. Language Resources

Link information to language resources is supposed to be released as soon as possible.
The task organizers would like to continue making efforts for providing useful resources such as bilingual term lists, parallel corpora, or translation probability tables. If you have a language resource that can be shared among all participants, please tell us about it.

5. Test Collection

5.1 Document set

The test collection used in CLIR task is composed of document set and topic set. The following will give a brief description of each set. It should be noted that a new document set may be added late

(a) Document collection for evaluation

Doc language files No. of docs
2000 2001 Total
Chinese CIRB040r (revised)
(581.7 MB)
United Daily News (udn) 244,038 222,526 466,564
United Express (ude) 40,445 51,851 92,296
Ming Hseng News (mhn) 84,437 85,302 169,739
Economic Daily News (edn) 79,380 93,467 172,847
Total 448,300 453,146 901,446
Japanese Mainichi Newspaper 2000-2001 (118.8 MB) 99,207 100,474 199,681
Yomiuri Newspaper 2000-2001 (343.3 MB) 306,709 352,010 658,719
Total 405,916 452,484 858,400
Korean Hankookilbo 2000-2001 (52.1 MB) 40,306 44,944 85,250
Chosunilbo 2000-2001 (88.7 MB) 67,711 67,413 135,124
Total 108,017 112,357 220,374
English Mainichi Daily News 2000-2001 (9.9 MB) 6,608 5,547 12,155
Korea Times 2000-2001(25.3 MB) 16,461 14,069 30,530
Daily Yomiuri 2000-2001(22.9 MB) 9,081 8,660 17,741
Xinhua 2000-2001(from LDC) 107,956 90,668 198,624
Total 140,106 118,944 259,050

*The numbers of documents in Daily Yomiuri were revised at 19:00 on July 07 (Japanese Time), which are indicated by red color.

Please use these document sets when you execute runs for submission.
Document sets included in NTCIR-4 test collections (98-99) should be used for just training your system. Please do NOT include these documents (98-99) into final results submitted to organizers.

(b) Document collections for training

(b-1) NTCIR-4 test collection

Language Collection No. of Docs Note
Chinese 1998-99 CIRB020 (United Daily News) 249,203 Used in NTCIR-3
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) 132,172 Used in NTCIR-3
total 381,375
Japanese 1998-99 Mainichi 220,078 Used in NTCIR-3
Yomiuri 373,558 New
Total 593,636
Korean 1998-99 Hankookilbo 149,921 New
Chosunilbo 104,517 New
total 254,438
English 1998-99 EIRB010 Taiwan News 7,489 Used in NTCIR-3
China Times English News (Taiwan) 2,715 Used in NTCIR-3
Mainichi Daily News (Japan) 12,723 Used in NTCIR-3
Korea Times 19,599 New
Xinhua (AQUAINT) 208,167 New
Hong Kong Standard 96,683 New
total 347,376

(b-2) NTCIR-3 test collection

Language Collection No. of Docs
Chinese 1998-99 CIRB020 (United Daily News) 249,508
CIRB011 (China Times, China Times Express, Commercial Times, China Daily News, Central and Daily News ) 132,173
Japanese 1998-99 Mainichi 220,078
Korean 1994 Korea Economic Daily (1994) 66,146
English 1998-99 EIRB010 Taiwan News 7,489
China Times English News (Taiwan) 2,715
Mainichi Daily News (Japan) 12,723

(c) Format
The format of each news article is consistent by using a set of tags. The sample documents will be shown in the Appendix.

Mandatory tags
<DOC> </DOC> The tag for each document
<DOCNO> </DOCNO> Document identifier
<LANG> </LANG> Language code: CH, EN, JA, KR
<HEADLINE> </HEADLINE> Title of this news article
<DATE> </DATE> Issue date
<TEXT> </TEXT> Text of news article
Optional tags
<P> </P> Paragraph marker
<SECTION> </SECTION> Section identifier in original newspapers
<AE> </AE> Contain figures or not
<WORDS> </WORDS> Number of words in 2 bytes (for Mainichi Newspaper)

5.2 Topics

Each topic has four fields; 'T' (TITLE), 'D' (DESC), 'N' (NARR), 'C' (CONC). The following shows a sample topic.

<TITLE>NBA labor dispute</TITLE>
<DESC>To retrieve the labor dispute between the two parties of the US National Basketball Association at the end of 1998 and the agreement that they reached. </DESC>
<REL>The content of the related documents should include the causes of NBA labor dispute, the relations between the players and the management, main controversial issues of both sides, compromises after negotiation and content of the new agreement, etc. The document will be regarded as irrelevant if it only touched upon the influences of closing the court on each game of the season.</REL>
<CONC>NBA (National Basketball Association), union, team, league, labor dispute, league and union, negotiation, to sign an agreement, salary, lockout, Stern, Bird Regulation.</CONC>

The tags used in topic are shown as follows.

<TOPIC> </TOPIC> The tag for each topic
<NUM> </NUM> Topic identifier
<SLANG> </SLANG> Source language code: CH, EN, JA, KR
<TLANG> </TLANG> Target language code: CH, EN, JA, KR
<TITLE> </TITLE> The concise representation of information request, which is composed of noun or noun phrase.
<DESC> </DESC> A short description of the topic. The brief description of information need, which is composed of one or two sentences.
<NARR> </NARR> A much longer description of topic. The <NARR> may has three parts;
(1)<BACK>...</BACK>: background information about the topic is described.
(2)<REL>...</REL>: further interpretation of the request and proper nouns, the list of relevant or irrelevant items, the specific requirements or limitations of relevant documents, and so on are given.
(3)<TERM>...</TERM>: definition or explanation of proper nouns, scientific terms and so on.
<CONC> </CONC> The keywords relevant to whole topic.

It should be noted that three sub-fields, <BACK> ,<REL> and <TERM>, has been added in <NARR> field from the previous NTCIR-4.

6. Types of Runs

Mandatory Runs: T-run and D-run
Each participant must submit two types of run for each combination of topic language and document language(s);

The purpose of asking participants to submit these mandatory runs is to make research findings clear by comparing systems or methods under a unified condition.

Recommended Runs: DN-run
Also, the task organizers would like to recommend strongly DN run, which is run using <DESC> and <NARR> fields are used.

Optional Runs
Other any combinations of fields are allowed to submit as optional runs according to each participantfs research interests, e.g. TDN-run, DC-run, TDNC-run and so on.

Number of Runs
Each participant can submit up to 5 runs in total for each language pair regardless of the type of run, and participants are allowed to include two T runs in maximum and also two D-runs in maximum into the 5 runs. The language pair means the combination of topic language and document language(s). For example,
Language combination -> Topic: C and Docs: CJE (C->CJE)
Submission -> two T-runs, a D-run, a DN-run and a TDNC run (5 runs in total).

Identification and Priority of Runs
Each run has to be associated with a RunID. RunID is an identity for each run. The rule of format for RunID is as follows.

The 'pp' is two digits used to represent the priority of the run. It will be used as a parameter for pooling. The participants have to decide the priority for each submitted run in the basis of each language pair. "01" means the high priority. For example, a participating group, LIPS, submits 3 runs for C-->CJE. The first is a T run, the second is a D run and the third is a DN run. Therefore, the Run ID for each run is LIPS-C-CJE-T-01, LIPS-C-CJE-D-02, and LIPS-C-CJE-DN-03, respectively. Or, if the group uses different ranking techniques in T run for C --> CJE, the RunID for each run has to be LIPS-C-CJE-T-01, LIPS-C-CJE-T-02, and LIPS-C-CJE-D-03.

7. Evaluation

Relevance judgments will be done in four grades, Highly Relevant, Relevant, Partially Relevant, and Irrelevant. Evaluation will be done using trec_eval, which is run at two different threshold of relevance levels. Also, new proposed metrics for multi-grade relevance judgments, weighted R precision and weighted average precision, which basically give merits to the system that will retrieve more relevant documents in higher ranks, may be employed.

8. Schedule

2004/09/30 Deadline of Registrations for Participation
2004/11/20 Release of Data (Document sets)
2005/05/01 Distribution of Search Topics
2005/06/01 Submission of Search Results
2005/09/01 Delivery of Evaluation Results
2005/10/01 Deadline for Papers (Working Notes)
2005/12/06-09 NTCIR Workshop 5 (Conference)
2006 Deadline for Papers (Formal Proceedings)

9.CLIR Task Executives Committee (Task Organizers)

Hsin-Hsi Chen, Taiwan
Kuang-hua Chen, Taiwan (co-chair)
Noriko Kando, Japan
Kazuaki Kishida, Japan (co-chair)
Kazuko Kuriyama, Japan
Suk-Hoon Lee, Korea (co-chair)
Sung Hyon Myaeng, Korea
(in alphabetical order of family names)

10. Contact Information

If you have a question, please contact the task organizers.

[Top][1.Intro][2.Subtasks][MLIR][BLIR][PLIR][SLIR][3.Interests][4.Resources][5.Test Collection][Doc][Topic]
[6.Type of Runs][7.Evaluation][8.Schedule][9.Organizers][10.Contact][Appendix]


1.Sample of Chinese Document Record

2.Sample of Japanese Document Record

3. Sample of Korean Document Record