Data Studio environments and libraries

At the moment, Data Studio offers a set of predefined libraries curated by Seven Bridges bioinformaticians, which are automatically available every time an analysis is started.

The list of available libraries depends on the environment you are using (JupyterLab or RStudio) and the selected environment setup (set of preinstalled libraries that is available each time an analysis is started).

Both of these settings are selected in the analysis creation wizard and cannot be changed once the analysis has been created.

JupyterLab

Depending on the purpose and objective of your JupyterLab analysis, you can select an environment setup that you find most suitable for the given analysis.

The following table shows the available JupyterLab environment setups and some details about available tools and libraries in each of them:

Environment setup	Details
SB Data Science - Python 3.11, R 4.3.1 (default)	This environment setup contains Python version 3.11, R version 4.3.1 and Julia 1.9.3. The setup also includes libraries that are available in datascience-notebook, with the addition of the tabix library.
SB Data Science - Python 3.9, R 4.1	This environment setup contains Python version 3.9, R version 4.1 and Julia 1.6.2. The setup also includes libraries that are available in datascience-notebook, with the addition of the tabix library.
SB Data Science - Python 3.6, R 3.4 (legacy)	This environment setup contains Python version 3.6.3, R version 3.4.1 and Julia 0.6.2. The setup also includes libraries that are available in datascience-notebook, with the addition of the following libraries: Python2 \ Python3: path.py, biopython, pymongo, cytoolz, pysam, pyvcf, ipywidgets, beautifulsoup4, cigar, bioservices, intervaltree, appdirs, cssselect, bokeh, scikit-allel, cairo, lxml, cairosvg, rpy2 R: r-ggfortify, r, r-stringi, r-pheatmap, r-gplots, bioconductor-ballgown, bioconductor-deseq2, bioconductor-metagenomeseq, bioconductor-biomformat, bioconductor-biocinstaller, r-xml
SB Data Science - Spark 3.5.1, Python 3.11 (beta) Spark initialization and loading of Parquet/VCF files	This environment setup contains Python version 3.11, Spark version 3.5.1. The setup also includes libraries that are available in allspark-notebook , with the addition of glow (version 2.0.0), tabix, hail and bkzep libraries. To initialize Spark and learn how to load Parquet or VCF files, follow the instructions below . An analysis using this environment will initialize a six-instance cluster with the following configuration: A driver `m5.xlarge` instance with 1000 GB of storage space Five `m5.4xlarge` worker instances with 1000 GB of storage space each Note that this cluster of 6 instances counts towards the total parallel instance limit that applies to your account. For an analysis that uses this environment to start without delays, you need to be able to initialize 6 more instances in parallel, before reaching the parallel instance limit for your account. If that is not the case, the analysis environment will be in the initialization state until it is able to start all 6 required instances.
SB Data Science - Spark 3.1.2, Python 3.9 (beta) Spark initialization and loading of Parquet/VCF files	This environment setup contains Python version 3.9, Spark version 3.1.2. The setup also includes libraries that are available in allspark-notebook, with the addition of glow (version 1.1.2), tabix, hail and bkzep libraries. To initialize Spark and learn how to load Parquet or VCF files, follow the instructions below. An analysis using this environment will initialize a six-instance cluster with the following configuration: A master `m5.xlarge` instance with 1000 GB of storage space Five `m5.4xlarge` slave instances with 1000 GB of storage space each Note that this cluster of 6 instances counts towards the total parallel instance limit that applies to your account. For an analysis that uses this environment to start without delays, you need to be able to initialize 6 more instances in parallel, before reaching the parallel instance limit for your account. If that is not the case, the analysis environment will be in the initialization state until it is able to start all 6 required instances.
SB Machine Learning - TensorFlow 2.0, Python 3.7	This environment setup is optimized for machine learning and execution on GPU instances. It is based on the jupyter/tensorflow-notebook image (jupyter/scipy-notebook that includes popular packages from the scientific Python ecosystem, with the addition of popular Python deep learning libraries). Learn more about available libraries.

All available environment setups also contain sevenbridges-python and sevenbridges-r API libraries, as well as htop and openvpn as general-purpose tools. The libraries are installed using conda, as JupyterLab supports multiple programming languages and conda is a language-agnostic package manager. You can also install libraries directly from the notebook and use them during the execution of your analysis.

For optimal performance and avoidance of potential conflicts, we recommend using conda when installing libraries within your analyses. However, unlike default libraries, libraries installed in that way will not be automatically available next time the analysis is started.

Spark initialization and loading of Parquet/VCF files for the SB Data Science - Spark 3.5.1, Python 3.11 environment setup

To initialize Spark in the Spark 3.5.1, Python 3.11 environment, use the following code:

from pyspark.sql import SparkSession
import glow
spark = SparkSession \
    .builder \
    .appName("PythonPi") \
    .config("spark.jars.packages", "io.projectglow:glow-spark3_2.12:2.0.0") \
    .config("spark.hadoop.io.compression.codecs", "io.projectglow.sql.util.BGZFCodec") \
    .getOrCreate()
spark = glow.register(spark)

When loading Parquet or VCF files, use the following pattern:

df = spark.read.parquet('/path/to/example.parquet')
df_vcf = spark.read.format('vcf').load('/path/to/file.vcf')

Spark initialization and loading of Parquet/VCF files for the SB Data Science - Spark 3.1.2, Python 3.9 environment setup

To initialize Spark in the Spark 3.1.2, Python 3.9 environment, use the following code:

from pyspark.sql import SparkSession
import glow
spark = SparkSession \
    .builder \
    .appName("PythonPi") \
    .config("spark.jars.packages", "io.projectglow:glow-spark3_2.12:1.1.2") \
    .config("spark.hadoop.io.compression.codecs", "io.projectglow.sql.util.BGZFCodec") \
    .getOrCreate()
spark = glow.register(spark)

When loading Parquet or VCF files, use the following pattern:

df = spark.read.parquet('/path/to/example.parquet')
df_vcf = spark.read.format('vcf').load('/path/to/file.vcf')

RStudio

If you select RStudio as the analysis environment, you can also select one of the available environment setups depending on the purpose of your analysis. This will help you optimize analysis setup and time to getting a fully-functional environment that suits your needs by having the needed libraries preinstalled in the selected environment setup. Here are the available options:

Environment setup	Details
SB Bioinformatics - R 4.3.2 - BioC 3.18 (default)	This environment setup is based on the official Bioconductor image bioconductor/bioconductor_docker:RELEASE_3_18. For more information about the image, please see its Docker Hub repository . Here is a list of libraries that are installed by default: CRAN - BiocManager, devtools, doSNOW, ggfortify, gplots, pheatmap, Seurat, tidyverse, OlinkAnalyze, ordinal, heatmaply, renv, markdown Bioconductor - AnnotationDbi, AnnotationHub, arrayQualityMetrics, ballgown, Biobase, BiocParallel, biomaRt, biomformat, Biostrings, DelayedArray, DESeq2, edgeR, genefilter, GenomeInfoDb, GenomicAlignments, GenomicFeatures, GenomicRanges, GEOquery, IRanges, limma, metagenomeSeq, oligo, Rsamtools, rtracklayer, sevenbridges, SummarizedExperiment, XVector
SB Bioinformatics - R 4.3 - BioC 3.17	This environment setup is based on the official Bioconductor image bioconductor/bioconductor_docker:RELEASE_3_17. For more information about the image, please see its Docker Hub repository . Here is a list of libraries that are installed by default: CRAN - BiocManager, devtools, doSNOW, ggfortify, gplots, pheatmap, Seurat, tidyverse, OlinkAnalyze, ordinal, heatmaply, renv, markdown Bioconductor - AnnotationDbi, AnnotationHub, arrayQualityMetrics, ballgown, Biobase, BiocParallel, biomaRt, biomformat, Biostrings, DelayedArray, DESeq2, edgeR, genefilter, GenomeInfoDb, GenomicAlignments, GenomicFeatures, GenomicRanges, GEOquery, IRanges, limma, metagenomeSeq, oligo, Rsamtools, rtracklayer, sevenbridges, SummarizedExperiment, XVector
SB Bioinformatics - R 4.2 - BioC 3.15	This environment setup is based on the official Bioconductor image bioconductor/bioconductor_docker:RELEASE_3_15. For more information about the image, please see its Docker Hub repository. Here is a list of libraries that are installed by default: CRAN - BiocManager, devtools, doSNOW, ggfortify, gplots, pheatmap, Seurat, tidyverse, OlinkAnalyze, ordinal, heatmaply Bioconductor - AnnotationDbi, AnnotationHub, arrayQualityMetrics, ballgown, Biobase, BiocParallel, biomaRt, biomformat, Biostrings, DelayedArray, DESeq2, edgeR, genefilter, GenomeInfoDb, GenomicAlignments, GenomicFeatures, GenomicRanges, GEOquery, IRanges, limma, metagenomeSeq, oligo, Rsamtools, rtracklayer, sevenbridges, SummarizedExperiment, XVector
SB Bioinformatics - R 4.1 - BioC 3.14	This environment setup is based on the official Bioconductor image bioconductor/bioconductor_docker:RELEASE_3_14. For more information about the image, please see its Docker Hub repository. Here is a list of libraries that are installed by default: CRAN - BiocManager, devtools, doSNOW, ggfortify, gplots, pheatmap, Seurat, tidyverse Bioconductor - AnnotationDbi, AnnotationHub, arrayQualityMetrics, ballgown, Biobase, BiocParallel, biomaRt, biomformat, Biostrings, DelayedArray, DESeq2, edgeR, genefilter, GenomeInfoDb, GenomicAlignments, GenomicFeatures, GenomicRanges, GEOquery, IRanges, limma, metagenomeSeq, oligo, Rsamtools, rtracklayer, sevenbridges, SummarizedExperiment, XVector
SB Bioinformatics - R 4.1 - BioC 3.13	This environment setup is based on the official Bioconductor image bioconductor/bioconductor_docker:RELEASE_3_13 which is built on top of rocker/rstudio:4.1.0. For more information about the image, please see its Docker Hub repository. Here is a list of libraries that are installed by default: CRAN - BiocManager, devtools, doSNOW, ggfortify, gplots, pheatmap, Seurat, tidyverse Bioconductor - AnnotationDbi, AnnotationHub, arrayQualityMetrics, ballgown, Biobase, BiocParallel, biomaRt, biomformat, Biostrings, DelayedArray, DESeq2, edgeR, genefilter, GenomeInfoDb, GenomicAlignments, GenomicFeatures, GenomicRanges, GEOquery, IRanges, limma, metagenomeSeq, oligo, Rsamtools, rtracklayer, sevenbridges, SummarizedExperiment, XVector
SB Bioinformatics - R 4.0	This environment setup is based on the official Bioconductor image bioconductor_docker:RELEASE_3_11 which is built on top of rockerdev/rstudio:4.0.0-ubuntu18.04. For more information about the image, please see its Docker Hub repository. Here is a list of libraries that are installed by default: CRAN - BiocManager, devtools, doSNOW, ggfortify, gplots, pheatmap, Seurat, tidyverse Bioconductor - AnnotationDbi, arrayQualityMetrics, ballgown, Biobase, BiocParallel, biomaRt, biomformat, Biostrings, DelayedArray, DESeq2, edgeR, genefilter, GenomeInfoDb, GenomicAlignments, GenomicFeatures, GenomicRanges, GEOquery, IRanges, limma, metagenomeSeq, oligo, Rsamtools, rtracklayer, SummarizedExperiment, XVector
SB Bioinformatics - R 3.6	This environment setup is based on the rstudio/verse image from The Rocker Project and contains tidyverse, devtools, tex and publishing-related packages. For more information about the image, please see its Docker Hub repository. Here is a list of libraries that are installed by default: CRAN - BiocManager, ggfortify, pheatmap, gplots Bioconductor - ballgown, DESeq2, metagenomeSeq, biomformat, BiocInstaller
SB Machine Learning - TensorFlow 1.13, R 3.6	This environment setup is optimized for machine learning and execution on GPU instances. It is based on the rocker/ml-gpu image that is intended for machine learning and GPU-based computation in R. Learn more.

All available environment setups also contain the sevenbridges-r API library, as well as htop and openvpn as general-purpose tools.