{"_id":"57fd034e60205c19008485b1","category":{"_id":"57f801b3760f3a1700219ebb","version":"5773dcfc255e820e00e1cd50","__v":0,"project":"5773dcfc255e820e00e1cd4d","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-10-07T20:12:35.170Z","from_sync":false,"order":7,"slug":"browse-datasets","title":"Browse public datasets"},"user":"5613e4f8fdd08f2b00437620","project":"5773dcfc255e820e00e1cd4d","githubsync":"","parentDoc":null,"__v":1,"version":{"_id":"5773dcfc255e820e00e1cd50","__v":26,"project":"5773dcfc255e820e00e1cd4d","createdAt":"2016-06-29T14:36:44.812Z","releaseDate":"2016-06-29T14:36:44.812Z","categories":["5773dcfc255e820e00e1cd51","5773df36904b0c0e00ef05ff","577baf92451b1e0e006075ac","577bb183b7ee4a0e007c4e8d","577ce77a1cf3cb0e0048e5ea","577d11865fd4de0e00cc3dab","578e62792c3c790e00937597","578f4fd98335ca0e006d5c84","578f5e5c3d04570e00976ebb","57bc35f7531e000e0075d118","57f801b3760f3a1700219ebb","5804d55d1642890f00803623","581c8d55c0dc651900aa9350","589dcf8ba8c63b3b00c3704f","594cebadd8a2f7001b0b53b2","59a562f46a5d8c00238e309a","5a2aa096e25025003c582b58","5a2e79566c771d003ca0acd4","5a3a5166142db90026f24007","5a3a52b5bcc254001c4bf152","5a3a574a2be213002675c6d2","5a3a66bb2be213002675cb73","5a3a6e4854faf60030b63159","5c8a68278e883901341de571","5cb9971e57bf020024523c7b","5cbf1683e2a36d01d5012ecd"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"metadata":{"title":"","description":"","image":[]},"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-10-11T15:20:46.703Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":0,"body":"[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"On this page:\",\n  \"body\": \"* [Public datasets on Cavatica](#datasets)\\n * [The Cancer Genome Atlas (TCGA)](#tcga)\\n * [Cancer Cell Line Encyclopedia (CCLE)](#ccle)\\n* [Metadata for public datasets on Cavatica](#metadata)\\n* [Explore datasets using the visual interface](#visual-interface)\\n* [Related pages](#related)\"\n}\n[/block]\nCavatica hosts both [The Cancer Genome Atlas (TCGA)](#tcga) and the [Cancer Cell Line Encyclopedia (CCLE)](#ccle), two datasets you can use in your genomics analyses. On this page, learn more about these datasets as well as their underlying metadata structure.\n\nThe following public datasets are made available on Cavatica. Note that these public datasets are different from [Cavatica datasets](datasets-overview).\n\n<a name=\"tcga\"></a>\n###The Cancer Genome Atlas (TCGA)\n\nTCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the [National Cancer Institute (NCI)](https://www.cancer.gov/), [National Human Genome Research Institute (NHGRI)](https://www.genome.gov/), the [National Institutes of Health (NIH)](https://www.nih.gov/), and the [U.S. Department of Health and Human Services](http://www.hhs.gov/). Cavatica provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data.\n\nTCGA data is made available on Cavatica through an integration with the Seven Bridges [Cancer Genomics Cloud (CGC)](http://www.cancergenomicscloud.org/). TCGA on Cavatica includes both [Open and Controlled Data](https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data). While all data in TCGA is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on Cavatica after you are [authenticated](tcga-data-access#section-authenticate) and agree to data use policies. In addition, you can obtain access to Controlled Data through the NIH via the [Database of Genotypes and Phenotypes (dBGaP)](https://www.ncbi.nlm.nih.gov/gap) site.\n\nLearn more about [TCGA Data](http://docs.cancergenomicscloud.org/docs/tcga-data) on Cavatica, [permissions required to access TCGA data](tcga-data-access#section-authenticate), and the [TCGA metadata schema](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc).\n\n<a name=\"ccle\"></a>\n###Cancer Cell Line Encyclopedia (CCLE)\n\nThe Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of human cancer cell lines, permanently established cell cultures derived from patients that will proliferate indefinitely given appropriate fresh medium and space. The CCLE is the result of a collaboration between the [Broad Institute](http://www.broadinstitute.org/), the [Novartis Institutes for Biomedical Research](https://www.nibr.com/), and the [Genomics Institute of the Novartis Research Foundation](https://www.gnf.nibr.com/).\n\nCCLE contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples. Cavatica hosts the CCLE dataset in the form of a [read-only public project](http://docs.sevenbridges.com/v1.0/docs/ccle) which contains cell line samples as available from [cgHub](https://cghub.ucsc.edu/) on May 11, 2016. You have automatic access to all CCLE data on Cavatica.\n\nLearn more about the [CCLE public project](http://docs.sevenbridges.com/v1.0/docs/ccle) and the [CCLE metadata schema](ccle-metadata).\n\n### GDC Datasets Update Policy\nSeven Bridges is committed to providing Cavatica users with up-to-date versions of the datasets that are available from the NCI Genomic Data Commons (GDC). Therefore, we have a clearly formulated set of rules that apply to updates of GDC datasets that are available through Cavatica:\n\n* We aim to update the data on Cavatica within 30 day of release by the GDC.\n* The time frame for alignment of datasets available through Cavatica with the current GDC data release is within 30 days of the release by GDC.\n* If a GDC data release includes redaction of files from a dataset, the affected files will be available on Cavatica for an additional 30 days. After that, you will need to contact the GDC for information on how to retain access to redacted files. \n* Re-running queries executed in the past may return slightly different results due to updates in the datasets from the GDC. This is expected as datasets are dynamic and version updates can introduce file updates or redactions, and queries will return the most up to date version of files. This applies both to the queries made through the Data Browser and through the Datasets API.\n\n<a name=\"metadata\"></a>\n##Metadata for public datasets on Cavatica\n\nMetadata is data about the genomic information carried by files. It is data about the time, place, and manner in which the genomic data was obtained as well as the genomic data's source and type. You can use metadata on Cavatica to browse and query datasets. Metadata describing datasets on Cavatica consist of **properties** which describe the **entities** of each dataset.\n\n**Entities** are particular resources with UUIDs, such as files, cases, samples, and cell lines. These can be the subject of your query.\n\n**Properties** can either describe an entity or relate that entity to another entity. For instance, properties include an entity's vital status, gender, data format, or experimental strategy.\n\nView the metadata schema, which includes a list of entities and their related properties, for the following datasets:\n  * [TCGA Metadata](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc)\n  * [CCLE Metadata](ccle-metadata)\n\nBelow, learn how to start working with datasets via their metadata on the visual interface.\n\n<a name=\"visual-interface\"></a>\n##Explore datasets using the visual interface\n\nThe [The Data Browser](doc:the-data-browser) allows you to explore datasets using an interactive graphical interface. Start by building queries to filter data using various metadata attributes. Then, access these files for further analysis.\n\nTo access the Data Browser, click on **Data** on the top navigation bar and select **Data Browser**. You'll see the screen below. Here, you can select the dataset to query.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/c293f84-image2016-9-2_18-21-6.png\",\n        \"image2016-9-2 18-21-6.png\",\n        2270,\n        1226,\n        \"#f7f7f8\"\n      ]\n    }\n  ]\n}\n[/block]\nTake advantage of pre-built example queries or build your own from scratch using metadata entities and properties. Learn more about [queries in the Data Browser](the-data-browser#section-queries-on-the-data-browser).\n\nOnce you've located specific files using a Data Browser query, you can access this data for further analysis.\n[block:callout]\n{\n  \"type\": \"danger\",\n  \"body\": \"Remember, TCGA contains both Open and Controlled Data. When you attempt to access TCGA data from Cavatica, you will be asked to authenticate with dbGaP. You will only be able to access the data for which you are approved. Learn more about [TCGA data access](doc:tcga-data-access) on Cavatica.\",\n  \"title\": \"TCGA Controlled Data\"\n}\n[/block]\n###Related pages\n  * [TCGA data access](doc:tcga-data-access)\n  * [TCGA Data](http://docs.cancergenomicscloud.org/docs/tcga-data)\n  * [TCGA Metadata](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc)\n  * [Cancer Cell Line Encyclopedia (CCLE)](ccle)\n  * [CCLE Metadata](doc:ccle-metadata)\n  * [The Data Browser](doc:the-data-browser)","excerpt":"","slug":"introduction-to-datasets","type":"basic","title":"Introduction to datasets"}

Introduction to datasets


[block:callout] { "type": "warning", "title": "On this page:", "body": "* [Public datasets on Cavatica](#datasets)\n * [The Cancer Genome Atlas (TCGA)](#tcga)\n * [Cancer Cell Line Encyclopedia (CCLE)](#ccle)\n* [Metadata for public datasets on Cavatica](#metadata)\n* [Explore datasets using the visual interface](#visual-interface)\n* [Related pages](#related)" } [/block] Cavatica hosts both [The Cancer Genome Atlas (TCGA)](#tcga) and the [Cancer Cell Line Encyclopedia (CCLE)](#ccle), two datasets you can use in your genomics analyses. On this page, learn more about these datasets as well as their underlying metadata structure. The following public datasets are made available on Cavatica. Note that these public datasets are different from [Cavatica datasets](datasets-overview). <a name="tcga"></a> ###The Cancer Genome Atlas (TCGA) TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the [National Cancer Institute (NCI)](https://www.cancer.gov/), [National Human Genome Research Institute (NHGRI)](https://www.genome.gov/), the [National Institutes of Health (NIH)](https://www.nih.gov/), and the [U.S. Department of Health and Human Services](http://www.hhs.gov/). Cavatica provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data. TCGA data is made available on Cavatica through an integration with the Seven Bridges [Cancer Genomics Cloud (CGC)](http://www.cancergenomicscloud.org/). TCGA on Cavatica includes both [Open and Controlled Data](https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data). While all data in TCGA is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on Cavatica after you are [authenticated](tcga-data-access#section-authenticate) and agree to data use policies. In addition, you can obtain access to Controlled Data through the NIH via the [Database of Genotypes and Phenotypes (dBGaP)](https://www.ncbi.nlm.nih.gov/gap) site. Learn more about [TCGA Data](http://docs.cancergenomicscloud.org/docs/tcga-data) on Cavatica, [permissions required to access TCGA data](tcga-data-access#section-authenticate), and the [TCGA metadata schema](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc). <a name="ccle"></a> ###Cancer Cell Line Encyclopedia (CCLE) The Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of human cancer cell lines, permanently established cell cultures derived from patients that will proliferate indefinitely given appropriate fresh medium and space. The CCLE is the result of a collaboration between the [Broad Institute](http://www.broadinstitute.org/), the [Novartis Institutes for Biomedical Research](https://www.nibr.com/), and the [Genomics Institute of the Novartis Research Foundation](https://www.gnf.nibr.com/). CCLE contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples. Cavatica hosts the CCLE dataset in the form of a [read-only public project](http://docs.sevenbridges.com/v1.0/docs/ccle) which contains cell line samples as available from [cgHub](https://cghub.ucsc.edu/) on May 11, 2016. You have automatic access to all CCLE data on Cavatica. Learn more about the [CCLE public project](http://docs.sevenbridges.com/v1.0/docs/ccle) and the [CCLE metadata schema](ccle-metadata). ### GDC Datasets Update Policy Seven Bridges is committed to providing Cavatica users with up-to-date versions of the datasets that are available from the NCI Genomic Data Commons (GDC). Therefore, we have a clearly formulated set of rules that apply to updates of GDC datasets that are available through Cavatica: * We aim to update the data on Cavatica within 30 day of release by the GDC. * The time frame for alignment of datasets available through Cavatica with the current GDC data release is within 30 days of the release by GDC. * If a GDC data release includes redaction of files from a dataset, the affected files will be available on Cavatica for an additional 30 days. After that, you will need to contact the GDC for information on how to retain access to redacted files. * Re-running queries executed in the past may return slightly different results due to updates in the datasets from the GDC. This is expected as datasets are dynamic and version updates can introduce file updates or redactions, and queries will return the most up to date version of files. This applies both to the queries made through the Data Browser and through the Datasets API. <a name="metadata"></a> ##Metadata for public datasets on Cavatica Metadata is data about the genomic information carried by files. It is data about the time, place, and manner in which the genomic data was obtained as well as the genomic data's source and type. You can use metadata on Cavatica to browse and query datasets. Metadata describing datasets on Cavatica consist of **properties** which describe the **entities** of each dataset. **Entities** are particular resources with UUIDs, such as files, cases, samples, and cell lines. These can be the subject of your query. **Properties** can either describe an entity or relate that entity to another entity. For instance, properties include an entity's vital status, gender, data format, or experimental strategy. View the metadata schema, which includes a list of entities and their related properties, for the following datasets: * [TCGA Metadata](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc) * [CCLE Metadata](ccle-metadata) Below, learn how to start working with datasets via their metadata on the visual interface. <a name="visual-interface"></a> ##Explore datasets using the visual interface The [The Data Browser](doc:the-data-browser) allows you to explore datasets using an interactive graphical interface. Start by building queries to filter data using various metadata attributes. Then, access these files for further analysis. To access the Data Browser, click on **Data** on the top navigation bar and select **Data Browser**. You'll see the screen below. Here, you can select the dataset to query. [block:image] { "images": [ { "image": [ "https://files.readme.io/c293f84-image2016-9-2_18-21-6.png", "image2016-9-2 18-21-6.png", 2270, 1226, "#f7f7f8" ] } ] } [/block] Take advantage of pre-built example queries or build your own from scratch using metadata entities and properties. Learn more about [queries in the Data Browser](the-data-browser#section-queries-on-the-data-browser). Once you've located specific files using a Data Browser query, you can access this data for further analysis. [block:callout] { "type": "danger", "body": "Remember, TCGA contains both Open and Controlled Data. When you attempt to access TCGA data from Cavatica, you will be asked to authenticate with dbGaP. You will only be able to access the data for which you are approved. Learn more about [TCGA data access](doc:tcga-data-access) on Cavatica.", "title": "TCGA Controlled Data" } [/block] ###Related pages * [TCGA data access](doc:tcga-data-access) * [TCGA Data](http://docs.cancergenomicscloud.org/docs/tcga-data) * [TCGA Metadata](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc) * [Cancer Cell Line Encyclopedia (CCLE)](ccle) * [CCLE Metadata](doc:ccle-metadata) * [The Data Browser](doc:the-data-browser)