Metadata manifest file format
Two different metadata manifest file formats are used on CAVATICA. Which metadata manifest file you should use depends on the functionality that you're using the manifest for.
- Command Line Uploader - use this format of metadata manifest file in order to apply metadata when uploading files to CAVATICA.
- Editing metadata via the visual interface - use this format of the metadata manifest file in order to edit metadata for the files that are already in one of your projects on CAVATICA.
Metadata manifest format for the Command Line Uploader
When uploading files to CAVATICA using the Command Line Uploader, a metadata manifest file can be used to define metadata values for those files. The supported file format for this manifest file is CSV, i.e. comma separated values. A CSV file contains a number of rows with columns which are separated with a comma.
The following rules apply:
Rule | Description |
---|---|
Line separation | The lines are separated with a line break, while the columns are separated using a comma. |
First row | The first row (manifest file header) has to contain column names which are treated as metadata keys (e.g. “sample”, “library”). Within the header, the first column must be File name, while the remaining columns are metadata keys. Also, please have in mind that metadata keys might be metadata schema or custom metadata keys. |
Must include File name | It's mandatory to have File name as the first column in the header. |
First column | The first column has to contain the names of the files which will be uploaded. In case the files are not in the same directory as the metadata manifest file, you should also include a path to the files (e.g. ../filename.fastq ). |
Subsequent columns | All subsequent columns should contain metadata fields which will be assigned to the specified files. |
Case sensitivity | Keys and values are case sensitive. |
Escape character | To escape comma character in metadata key or value, enclose that field within double quotation marks (e.g. for setting specific key to md_value1,2 use "md_value 1,2"). Furthermore, to use double quote character in metadata key or value, enclose that field within double quotation marks and use two double quotation marks (e.g. for setting specific key to md_value"1" use "md_value""1"""). |
Maximum size | The maximum size for the metadata manifest file is 5 GiB. |
Maximum number of key-value pairs | Maximum number of key-value pairs per file is 1000, including null-value keys. |
Keys and values encoding | Keys and values are UTF-8 encoded strings. |
Maximum key length | 100 bytes (UTF-8 encoding) |
Maximum value length | 300 bytes (UTF-8 encoding) |
Metadata for files only | There is no metadata for folders on CAVATICA. Any folder specified in the manifest file will be skipped and the metadata will not be set for a folder. |
The following example shows the content of the metadata manifest for three files with three metadata fields.
File name | sample | library | paired_end |
---|---|---|---|
file1.fastq | sample1 | examplelibrary1 | 1 |
file2.fastq | sample1 | examplelibrary1 | 2 |
file3.fastq | sample2 | examplelibrary2 | 1 |
Below is the same example in a comma separated format.
File name,sample,library,paired_end
file1.fastq,sample1,examplelibrary1,1
file2.fastq,sample1,examplelibrary1,2
file3.fastq,sample2,examplelibrary2,1
Metadata manifest format for modifying metadata via the visual interface
This manifest file is used In case you are modifying metadata for your project files via the visual interface.
The supported file formats for this manifest file are CSV and TSV. A CSV file contains a number of rows with columns which are separated with a comma, while the TSV file separates them with a tab.
The following rules apply:
Rule | Description |
---|---|
Line separation | The lines are separated with a line break, while the columns are separated using a comma. |
Columns separator | The columns are separated using a comma (CSV) or a tab (TSV) while the lines are separated with a line break. |
First row | The first row (manifest file header) has to contain column names which are treated as metadata keys (e.g. “sample”, “library”). Within the header, the first column must be either id or name , Next, there can be project and size columns which are system metadata fields and are treated as read-only (these fields are present if manifest file is generated using Export metadata to a manifest action). The remaining columns are metadata keys, which can be metadata schema or custom metadata keys. Please have in mind that the order of columns is important, e.g. id column (if present) must be the first one. |
Must include id or name (along with project path) | 1. It’s mandatory to have either id or name column, and it’s allowed to have both of them. 2. If both id and name columns are present, then id will be used for identifying file, while the name will be omitted. |
Name field | This field should also include file path within the project for files which are not in the project root, so it’s actually path + name. If "id" column is not present in the metadata manifest file, then this field will be used for identifying files whose metadata should be edited. If "id" is present in the metadata manifest file, then this field will be ignored, meaning it's not possible to change file name or file path using "Import metadata manifest" feature (situation which is the same as currently, although it might change in the future). This way, it's possible to edit metadata using manifest file without providing file IDs and providing only file names (along with path, in case when files are stored in folder instead of project root) instead of having to fetch file IDs. |
Project field | It’s possible to have “project” column (e.g. if you have used the Export metadata manifest feature, then edited generated metadata manifest file and submitted it for import). If present it is treated as read-only field (i.e. it is not possible to either move or copy files from one project to the other using this feature), but validation should take place - if the specified “project” value is different than the current project (i.e. the project in which the Import metadata manifest feature has been used) then this file shouldn’t be edited and it should be counted towards files which are failed as part of this action. Have in mind that the “project” field is non-mandatory field in the manifest file. |
Size field | It is possible to have the “size” column (e.g. you have used the Export metadata manifest feature, edited the generated metadata manifest file and submitted it for import). If present, it is treated as read-only field (i.e. it is not possible to change the file size using this feature, or any other feature). In addition, there should not be any validations including this field. |
Metadata schema fields and custom metadata fields | Following the aforementioned fields (there can be at a minimum one, and at the maximum four read-only fields) you can specify the metadata fields which should be edited. There must be at least one metadata column specified in the manifest file (otherwise the action will fail because there is no metadata field to be edited). The metadata schema fields are specified according to the documented metadata schema. |
Empty rows | It's allowed to have an empty row in the manifest file. Empty row will be skipped during manifest file processing. |
Case sensitivity | The manifest file is case-sensitive. |
Escape character | To escape comma character in metadata key or value, enclose that field within double quotation marks (e.g. for setting specific key to md_value1,2 use "md_value 1,2"). Furthermore, to use double quote character in metadata key or value, enclose that field within double quotation marks and use two double quotation marks (e.g. for setting specific key to md_value"1" use "md_value""1"""). |
Maximum size | The maximum size for the metadata manifest file is 5 GiB. |
Maximum number of rows | The maximum number of rows for the metadata manifest file is 40,001, which corresponds to maximum number of files in the single project on the Platform (plus header row). |
Maximum number of key-value pairs | Maximum number of key-value pairs per file is 1000, including null-value keys. |
Keys and values encoding | Keys and values are UTF-8 encoded strings. |
Maximum key length | 100 bytes (UTF-8 encoding) |
Maximum value length | 300 bytes (UTF-8 encoding) |
Columns which are specified after id, name, size and project will be treated as metadata fields.
The following example shows a metadata manifest file which contains both id and name columns along with one metadata schema field (quality_scale, paired_end) and one custom metadata field (Read length). Please note that in this case the id field will be used to uniquely identify file within the project, while the name field will be ignored.
id | name | quality_scale | paired_end | Read length |
---|---|---|---|---|
581b298d20946e087b2ce503 | file1.fastq | illumina18 | 1 | 26 |
581b298d20946e087b2ce51f | file2.fastq | illumina18 | 2 | 26 |
581b298d20946e087b2ce50b | file3.fastq | solexa | 1 | 98 |
Below is the same example in a comma separated format.
id,name,quality_scale,paired_end,Read length
581b298d20946e087b2ce503,file1.fastq,illumina18,1,26
581b298d20946e087b2ce503,file2.fastq,illumina18,2,26
581b298d20946e087b2ce503,file3.fastq,solexa,1,98
The following example shows a metadata manifest file which contains the name column, one metadata schema field (case_id) and one custom metadata field (Donor ID). Please note that in this case the name field (containing file path along with the name of the file) will be used to uniquely identify the file within the project.
name | case_id | Donor ID |
---|---|---|
StudyYYZ/file1.bam | cid173 | 1197 |
StudyYYZ/file2.bam | cid365 | 1198 |
StudyYYZ/file3.bam | cid882 | 1199 |
name,case_id,Donor ID
StudyYYZ/file1.bam,cid173,1197
StudyXYZ/file2.bam,cid365,1198
StudyXYZ/file3.bam,cid882,1199
Updated less than a minute ago