Submission Guideline for NTCIR-5 CLIR Task
how to submit search results


May/18/2005

1. Files to Be Submitted

All participants have to submit (a) files of document list by each run and (b) a file of system descriptions. Please use XML-style tags for describing your system according to instruction in the Section 5. An example is as follows.

(a) example of document list (search results): See Section 6 for details.
030 0    cts_cec_19991118596    1                     4238     LIPS-C-CJE-T-01
030 0    cts_cec_19991118596    2                     3211     LIPS-C-CJE-T-01
...........
030 0    cts_cec_19991118596    1000                1116     LIPS-C-CJE-T-01


(b) example of system description: See Section 5 for details.
<TECHDESC>
<RUN>
<ID>LIPS-C-CJE-T-01</ID>
<INDEXUNIT>word</INDEXUNIT>
<INDEXTECH>morphology</INDEXTECH>
<INDEXSTRUC>inverted file</INDEXSTRUC>
<QUERYUNIT>word</QUERYUNIT>
<MODEL>vector space</MODEL>
<RANK>tf-idf</RANK>
<TRANS>dictionary-based query translation</TRANS>
<QEXP>pre- and post-translation expansion by Rocchio</QEXP>
<CORPUS>using NTCIR-1 Japanese document collections for expansion</CORPUS>
<PIVOT>none</PIVOT>
<SPTECH>searching translations of unknown terms automatically from Web pages</SPTECH>
<COMMENT>none</COMMENT>
</RUN>
<RUN>
<ID>LIPS-C-CJE-D-02</ID>
......
</RUN>
</TECHDESC>

2. Type of Runs

Mandatory Runs: T-run and D-run
Each participant must submit two types of run for each combination of topic language and document language(s);

The purpose of asking participants to submit these mandatory runs is to make research findings clear by comparing systems or methods under a unified condition.

Recommended Runs: DN-run
Also, the task organizers would like to recommend strongly DN run, which is run using <DESC> and <NARR> fields are used.

Optional Runs
Other any combinations of fields are allowed to submit as optional runs according to each participant's research interests, e.g. TDN-run, DC-run, TDNC-run and so on.

3. Number of Runs

Each participant can submit up to 5 runs in total for each language pair regardless of the type of run, and participants are allowed to include two T runs in maximum and also two D-runs in maximum into the 5 runs. The language pair means the combination of topic language and document language(s). For example,
Language combination -> Topic: C and Docs: CJE (C->CJE)
Submission -> two T-runs, a D-run, a DN-run and a TDNC run (5 runs in total).

4. Identification and Priority of Runs

Each run has to be associated with a RunID. RunID is an identity for each run. The rule of format for RunID is as follows.

The 'pp' is two digits used to represent the priority of the run. It will be used as a parameter for pooling (see below). The participants have to decide the priority for each submitted run in the basis of each language pair. "01" means the high priority. For example, a participating group, LIPS, submits 3 runs for C-->CJE. The first is a T run, the second is a D run and the third is a DN run. Therefore, the Run ID for each run is LIPS-C-CJE-T-01, LIPS-C-CJE-D-02, and LIPS-C-CJE-DN-03, respectively. Or, if the group uses different ranking techniques in T run for C --> CJE, the RunID for each run has to be LIPS-C-CJE-T-01, LIPS-C-CJE-T-02, and LIPS-C-CJE-D-03.

Note: Top X documents in each of the submitted runs will be collected and put into the document pool. Only documnets in the pool will be judged by human assessors. If the number of the submitted runs are too large, the runs to be put in the pool may be selected based on the priority that you assign to each of the runs.

5. System Description

5.1 Descriptive Information

In addition to search results, every participating group has to give us a concise description of each run. This description should contain the following information.

<INDEXUNIT>: Unit of indexing, e.g., character, bi-character, bi-word, phrase, etc.
<INDEXTECH>: Techniques for indexing, e.g., morphology, stemming, POS, etc
<INDEXSTRUC>: inverted file, signature file, PAT, etc.
<QUERYUNIT>:character, word, phrase, etc.
<MODEL>:vector space model, probabilistic model (Okapi, INQUERY, logistic regression), etc.
<RANK>:ranking factor for measuring each term, e.g., tf, tf/idf, mutual information, word association, document length, etc.
<TRANS>: translation technique used to deal with cross-lingual information retrieval, e.g., dictionary-based, corpus-based, MT, etc. The detailed information are welcome, e.g., select-all, select-top-N, translation disambiguation, etc.
<QEXP>: techniques used to expand query or no query expansion.
<CORPUS>: information about special corpus used to translation, expansion,etc.
<PIVOT>: language used for pivot approach, e.g., English.
<SPTECH>: special techniques for improving performance of CLIR runs.
<COMMENT>: any other comments.

5.2 Root tags

Please pack system descriptions for all runs into a single file using two root tags, <TECHDESC> and <RUN>, as follows;

<TECHDESC>
<RUN>

...description of the run1...
</RUN>
<RUN>
...description of the run2...
</RUN>
...
</TECHDESC>

5.3 Template

Please copy and use the template for writing your description.

5.4 File name and format

Please store the system descriptions into a single plain-text file (.txt) with your group name as it's file name, e.g., LIPS.txt.

6. Document List

6.1 Format

Since the TREC's evaluation program is used to carry out the relevance assessment, each participating group has to submit its retrieval result in the designated format. The result file is a list of tuples in the following form:

001 0    cts_cec_19991118596    1                     9999     LIPS-C-CJE-T-01
001 0    cts_cec_19991118596    2                     9998     LIPS-C-CJE-T-01
...........
001 0    cts_cec_19991118596    1000                1116     LIPS-C-CJE-T-01

002 0    cts_cec_19991118596    1                     9997     LIPS-C-CJE-T-01
002 0    cts_cec_19991118596    2                     9994     LIPS-C-CJE-T-01
...........
050 0    cts_cec_19991118596    1000                1994     LIPS-C-CJE-T-01

The search result file which will be sent should follow the format below:

Topic-ID  Dummy-field  Document-ID  Rank  Similarity-value  Run-ID


6.2 File name and format

Please store the document list for each run into a single plain-text file with RunID as it's file name, e.g., LIPS-C-CJE-T-01 (with no file identifier).

7. How to Submit Files

Please send your search results to us according to the following procedure by the deadline.

  1. Make files of document lists. You must make a single file for a run, and the file name must be its run-ID without file identifier (e.g., "NII-J-J-T-01", if your group's ID is "NII").
    Please make sure of your group's ID that you specified in your application form for NTCIR-5 CLIR task.

  2. Make a single file of system descriptions. The file name use your group's ID with .txt (e.g., NII.txt).

  3. Make a text file (list.txt) including a list of names of all files which you created in (1) and (2). An example of the list is as follows.
    NII-J-C-T-01
    NII-J-C-D-02
    ....
    NII-C-J-DN-03
    NII.txt
    
    Please attach your group's ID to the head of the file name (e.g., NII.list.txt).

  4. Pack the set of run files, the system description file and the file list into a single .tgz file or .zip file. (e.g., NII.tgz)

  5. Send the compressed file to us from WWW page (not by e-mail).
    1. Make sure of your URL that we assigned for your downloading document data sets
      http://rcir.nii.ac.jp/ntcir/access/***/+++/ ntcir5/clir/index.html
      (*** and +++ are different by each group)

    2. Access a fup.html file in the same directory, i.e.,
      http://rcir.nii.ac.jp/ntcir/access/***/+++/ ntcir5/clir/fup.html
      (user ID and password are the same as index.html).

    3. Send your file from the page.

Deadline

June 01 2005 23:59 Japanese Time
(except runs searching English document sets)

8.Others

We would like to remind you that you must return the document data if you do NOT submit any results.

9. Contact Information


If you have any questions, please contact with task organizers: ****@nii.ac.jp