Home

TREC 2001 CROSS LANGUAGE DATASET

Data provided by National Institute of Standards and Technology

Ten groups participated in the TREC-2001 cross-language information retrieval track, which focussed on retrieving Arabic language documents based on 25 queries that were originally prepared in English. French and Arabic translations of the queries were also available. This was the first year in which a large Arabic test collection was available, so a variety of approaches were tried and a rich set of experiments performed using resources such as machine translation, parallel corpora, several approaches to stemming and/or morphology, and both pre-translation and post-translation blind relevance feedback. On average, forty percent of the relevant documents discovered by a participating team were found by no other team, a higher rate than normally observed at TREC. This raises some concern that the relevance judgment pools may be less complete than has historically been the case.

About this Dataset

Updated: 2025-04-06

Metadata Last Updated: 2024-10-02 00:00:00

Date Created: N/A

Data Provided by:

Dataset Owner: N/A

Access this data

View JSON data

Contact dataset owner Access URL
Landing Page URL

Table representation of structured data
Title	TREC 2001 CROSS LANGUAGE DATASET
Description	Ten groups participated in the TREC-2001 cross-language information retrieval track, which focussed on retrieving Arabic language documents based on 25 queries that were originally prepared in English. French and Arabic translations of the queries were also available. This was the first year in which a large Arabic test collection was available, so a variety of approaches were tried and a rich set of experiments performed using resources such as machine translation, parallel corpora, several approaches to stemming and/or morphology, and both pre-translation and post-translation blind relevance feedback. On average, forty percent of the relevant documents discovered by a participating team were found by no other team, a higher rate than normally observed at TREC. This raises some concern that the relevance judgment pools may be less complete than has historically been the case.
Modified	2024-10-02 00:00:00
Publisher Name	National Institute of Standards and Technology
Contact	mailto:[email protected]
Keywords	TREC text retrieval conference

{
    "identifier": "ark:\/88434\/mds2-3588",
    "accessLevel": "public",
    "contactPoint": {
        "hasEmail": "mailto:[email protected]",
        "fn": "Ian Soboroff"
    },
    "programCode": [
        "006:045"
    ],
    "landingPage": "https:\/\/data.nist.gov\/od\/id\/mds2-3588",
    "title": "TREC 2001 CROSS LANGUAGE DATASET",
    "description": "Ten groups participated in the TREC-2001 cross-language information retrieval track, which focussed on retrieving Arabic language documents based on 25 queries that were originally prepared in English. French and Arabic translations of the queries were also available. This was the first year in which a large Arabic test collection was available, so a variety of approaches were tried and a rich set of experiments performed using resources such as machine translation, parallel corpora, several approaches to stemming and\/or morphology, and both pre-translation and post-translation blind relevance feedback. On average, forty percent of the relevant documents discovered by a participating team were found by no other team, a higher rate than normally observed at TREC. This raises some concern that the relevance judgment pools may be less complete than has historically been the case.",
    "language": [
        "en"
    ],
    "distribution": [
        {
            "accessURL": "https:\/\/catalog.ldc.upenn.edu\/LDC2001T55",
            "description": "These are the documents used in this dataset.  You must obtain them from the LDC at this URL.",
            "title": "LDC2001T55 document collection"
        },
        {
            "downloadURL": "https:\/\/trec.nist.gov\/data\/topics_noneng\/arabic_topics.txt",
            "format": "Traditional TREC SGML topic format",
            "description": "The Arabic search topics, for monolingual search.",
            "mediaType": "text\/SGML",
            "title": "TREC 2001 cross language topics in Arabic"
        },
        {
            "downloadURL": "https:\/\/trec.nist.gov\/data\/topics_noneng\/english_topics.txt",
            "format": "Traditional TREC SGML topic format",
            "description": "English topics for the 2001 CLIR track.",
            "mediaType": "text\/SGML",
            "title": "TREC 2001 cross language topics in English"
        },
        {
            "downloadURL": "https:\/\/ir.nist.gov\/trec.nist.gov\/data\/topics_noneng\/french_topics.txt",
            "format": "Traditional TREC SGML topic format",
            "description": "The French topics for the TREC 2001 cross-language track",
            "mediaType": "text\/SGML",
            "title": "TREC 2001 cross language topics in French"
        },
        {
            "downloadURL": "https:\/\/ir.nist.gov\/trec.nist.gov\/data\/qrels_noneng\/xlingual_t10qrels.txt",
            "format": "Whitespace-separated: Topic, \"0\", document, relevance level",
            "description": "This file indicates the documents judged relevant for each of the topics.",
            "mediaType": "text\/plain",
            "title": "TREC 2001 CLIR Relevance judgments"
        }
    ],
    "bureauCode": [
        "006:55"
    ],
    "modified": "2024-10-02 00:00:00",
    "publisher": {
        "@type": "org:Organization",
        "name": "National Institute of Standards and Technology"
    },
    "theme": [
        "Information Technology"
    ],
    "keyword": [
        "TREC text retrieval conference"
    ]
}

Commerce Data Hub

TREC 2001 CROSS LANGUAGE DATASET

About this Dataset

Access this data

Department of Commerce

Breadcrumb

TREC 2001 CROSS LANGUAGE DATASET

About this Dataset

Access this data

Share this page