Submission of search results - NTCIR-6 CLIR Task

Submission Guideline for NTCIR-6 CLIR Task
how to submit search results

July/15/2006

[CLIR Task Home]

1. Files to Be Submitted

All participants have to submit (a) files of document list by each run and (b) a file of system descriptions. Please use XML-style tags for describing your system characteristics according to instruction in the Section 5. An example is as follows.

(a) example of document list (search results): See Section 6 for details.
030 0    cts_cec_19991118596    1                     4238     LIPS-C-CJE-T-01
030 0    cts_cec_19991118596    2                     3211     LIPS-C-CJE-T-01
...........
030 0    cts_cec_19991118596    1000                1116     LIPS-C-CJE-T-01

(b) example of system description: See Section 5 for details.
<TECHDESC>
<RUN>
<ID>LIPS-C-CJE-T-01</ID>
<INDEXUNIT>word</INDEXUNIT>
<INDEXTECH>morphology</INDEXTECH>
<INDEXSTRUC>inverted file</INDEXSTRUC>
<QUERYUNIT>word</QUERYUNIT>
<MODEL>vector space</MODEL>
<RANK>tf-idf</RANK>
<TRANS>dictionary-based query translation</TRANS>
<QEXP>pre- and post-translation expansion by Rocchio</QEXP>
<CORPUS>using NTCIR-1 Japanese document collections for expansion</CORPUS>
<PIVOT>none</PIVOT>
<SPTECH>searching translations of unknown terms automatically from Web pages</SPTECH>
<COMMENT>none</COMMENT>
</RUN>
<RUN>
<ID>LIPS-C-CJE-D-02</ID>
......
</RUN>
</TECHDESC>

2. Type of Runs

Mandatory Runs: T-run and D-run
Each participant must submit two types of run for each combination of topic language and document language(s);

T-run, for which only TITLE field is used
D-run, for which only DESC field is used

The purpose of asking participants to submit these mandatory runs is to make research findings clear by comparing systems or methods under a unified condition.

Recommended Runs: DN-run
Also, the task organizers would like to recommend strongly DN run, which is run using <DESC> and <NARR> fields are used.

Optional Runs
Other any combinations of fields are allowed to submit as optional runs according to each participant's research interests, e.g. TDN-run, DC-run, TDNC-run and so on.

3. Number of Runs

Each participant can submit up to 5 runs in total for each language pair regardless of the type of run, and participants are allowed to include two T runs in maximum and also two D-runs in maximum into the 5 runs. The language pair means the combination of topic language and document language(s). For example,
Language combination -> Topic: C and Docs: CJE (C->CJK)
Submission -> two T-runs, a D-run, a DN-run and a TDNC run (5 runs in total).

4. Identification and Priority of Runs

Each run has to be associated with a RunID. RunID is an identity for each run. The rule of format for RunID is as follows.

Group's ID-Topic Language-Document Language-Run Type-pp

The 'pp' is two digits used to represent the priority of the run. It will be used as a parameter for pooling (see below). The participants have to decide the priority for each submitted run in the basis of each language pair. "01" means the high priority. For example, a participating group, LIPS, submits 3 runs for C-->CJK. The first is a T run, the second is a D run and the third is a DN run. Therefore, the Run ID for each run is LIPS-C-CJK-T-01, LIPS-C-CJK-D-02, and LIPS-C-CJK-DN-03, respectively. Or, if the group uses different ranking techniques in T run for C --> CJK, the RunID for each run has to be LIPS-C-CJK-T-01, LIPS-C-CJK-T-02, and LIPS-C-CJK-D-03.

Note: Top X documents in each of the submitted runs will be collected and put into the document pool. Only documnets in the pool will be judged by human assessors. If the number of the submitted runs are too large, the runs to be put in the pool may be selected based on the priority that you assign to each of the runs.

5. System Description

5.1 Descriptive Information

In addition to search results, every participating group has to give us a concise description of each run. This description should contain the following information.

<INDEXUNIT>: Unit of indexing, e.g., character, bi-character, bi-word, phrase, etc.
<INDEXTECH>: Techniques for indexing, e.g., morphology, stemming, POS, etc
<INDEXSTRUC>: inverted file, signature file, PAT, etc.
<QUERYUNIT>:character, word, phrase, etc.
<MODEL>:vector space model, probabilistic model (Okapi, INQUERY, logistic regression), etc.
<RANK>:ranking factor for measuring each term, e.g., tf, tf/idf, mutual information, word association, document length, etc.
<TRANS>: translation technique used to deal with cross-lingual information retrieval, e.g., dictionary-based, corpus-based, MT, etc. The detailed information are welcome, e.g., select-all, select-top-N, translation disambiguation, etc.
<QEXP>: techniques used to expand query or no query expansion.
<CORPUS>: information about special corpus used to translation, expansion,etc. If you used NTCIR-3 or NTCIR-4 test collection for training your system, please describe the way of using the test collection in this field.
<PIVOT>: language used for pivot approach, e.g., English.
<SPTECH>: special techniques for improving performance of CLIR runs.
<COMMENT>: any other comments.

5.2 Root tags

Please pack system descriptions for all runs into a single file using two root tags, <TECHDESC> and <RUN>, as follows;

<TECHDESC>
<RUN>
...description of the run1...
</RUN>
<RUN>
...description of the run2...
</RUN>
...
</TECHDESC>

5.3 Template

Please copy and use the template for writing your description.

5.4 File name and format

Please store the system descriptions into a single plain-text file (.txt) with your group name as it's file name, e.g., LIPS.txt.

6. Document List

6.1 Format

Since the TREC's evaluation program is used to compute metrics for evaluation, each participating group has to submit its retrieval result in the designated format. The result file is a list of tuples in the following form:

001 0    cts_cec_19991118596    1                     9999     LIPS-C-CJE-T-01
001 0    cts_cec_19991118596    2                     9998     LIPS-C-CJE-T-01
...........
001 0    cts_cec_19991118596    1000                1116     LIPS-C-CJE-T-01
002 0    cts_cec_19991118596    1                     9997     LIPS-C-CJE-T-01
002 0    cts_cec_19991118596    2                     9994     LIPS-C-CJE-T-01
...........
140 0    cts_cec_19991118596    1000                1994     LIPS-C-CJE-T-01

The search result file which will be sent should follow the format below:

Topic-ID  Dummy-field  Document-ID  Rank  Similarity-value  Run-ID

Elements in a line are segmented by a TAB.
The list must be sorted numerically by Topic-ID in ascending order. The Topic-ID is a sequence of digits in the <NUM> field of each topic file.
The Dummy-field is not used for evaluation, usually use '0' for the field.
Document-ID is a string in <DOCNO> of each document record.
Rank is represented as an integer. But the evaluation program ignores the filed.
Similarity-value, or brief-value is a score that each system produced to present the similarity of the document to the topic. The Similarity-value is assumed to be higher for the documents to be retrieved in the higher ranks. The evaluation program will sort and rank the retrieved document by Similarity-value, then Document-ID. Similarity-value is mandatory for evaluation. If the system does not produce such score, please assign some values in the field. It should be noted that the values are positive.
Run-ID is a tag given to the run by the participant group (see the Section 4 for details).
The Run-ID must be the same on every lines in a result file.
Please include 1000 documents at most per topic.

6.2 File name and format

Please store the document list for each run into a single plain-text file with RunID as it's file name, e.g., LIPS-C-CJE-T-01 (with no file identifier).

7. How to Submit Files

Please send your search results to us according to the following procedure by the deadline.

Make files of document lists. You must make a single file for a run, and the file name must be its run-ID without file identifier (e.g., "NII-J-J-T-01", if your group's ID is "NII").
Please make sure of your group's ID that you specified in your application form for NTCIR-6 CLIR task.
Make a single file of system descriptions. The file name use your group's ID with .txt (e.g., NII.txt).
Make a text file (list.txt) including a list of names of all files which you created in (1) and (2). An example of the list is as follows.
```
NII-J-C-T-01
NII-J-C-D-02
....
NII-C-J-DN-03
NII.txt
```
Please attach your group's ID to the head of the file name (e.g., NII.list.txt).
Pack the set of run files, the system description file and the file list into a single .tgz file or .zip file. (e.g., NII.tgz)
Send the compressed file to us from WWW page (not by e-mail).
1. Make sure of your URL that we assigned for your downloading document data sets
  http://rcir.nii.ac.jp/ntcir/access/ntcir6-ws/***/+++/clir/index.html
  (*** and +++ are different by each group)
2. Access this page, and click NTCIR-6 CLIR Submission Form
3. Send your file from the page. *Please be sure to execute our format checker before sending search results!

Deadline

August 01 2006 23:59 Japanese Time (for Stage 1)

8.Others

We would like to remind you that you must return the document data if you do NOT submit any results.

9. Contact Information

If you have any questions, please contact with task organizers: @nii.ac.jp

[CLIR Task Home]