Home

TREC 2022 Deep Learning test collection

Data provided by National Institute of Standards and Technology

This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

About this Dataset

Updated: 2025-04-06

Metadata Last Updated: 2023-03-01 00:00:00

Date Created: N/A

Data Provided by:

Dataset Owner: N/A

Access this data

View JSON data

Contact dataset owner Landing Page URL
Download URL

Table representation of structured data
Title	TREC 2022 Deep Learning test collection
Description	This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Modified	2023-03-01 00:00:00
Publisher Name	National Institute of Standards and Technology
Contact	mailto:[email protected]
Keywords	information retrieval , search , trec , deep learning

{
    "identifier": "ark:\/88434\/mds2-2974",
    "accessLevel": "public",
    "contactPoint": {
        "hasEmail": "mailto:[email protected]",
        "fn": "Ian Soboroff"
    },
    "programCode": [
        "006:045"
    ],
    "landingPage": "https:\/\/data.nist.gov\/od\/id\/mds2-2974",
    "title": "TREC 2022 Deep Learning test collection",
    "description": "This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.",
    "language": [
        "en"
    ],
    "distribution": [
        {
            "downloadURL": "https:\/\/microsoft.github.io\/msmarco\/TREC-Deep-Learning-2022",
            "format": "HTML",
            "description": "This is where the document and passage collections are hosted, along with training data used in the track.",
            "mediaType": "text\/html",
            "title": "The Deep Learning Track homepage"
        },
        {
            "downloadURL": "https:\/\/trec.nist.gov\/data\/deep2022.html",
            "format": "HTML",
            "description": "These are the search queries and relevance judgments created in the 2022 Deep Learning track.",
            "mediaType": "text\/html",
            "title": "The queries and relevance judgments."
        }
    ],
    "bureauCode": [
        "006:55"
    ],
    "modified": "2023-03-01 00:00:00",
    "publisher": {
        "@type": "org:Organization",
        "name": "National Institute of Standards and Technology"
    },
    "theme": [
        "Information Technology:Data and informatics"
    ],
    "keyword": [
        "information retrieval",
        "search",
        "trec",
        "deep learning"
    ]
}

Commerce Data Hub

TREC 2022 Deep Learning test collection

About this Dataset

Access this data

Department of Commerce

Breadcrumb

TREC 2022 Deep Learning test collection

About this Dataset

Access this data

Share this page