cmapR: R utilities for Connectivity Map Resources

Provided by the Connectivity Map, Broad Institute of MIT and Harvard. More information on our website

Other resources

Introductory info

Installation

Dependencies

Dependencies are listed in the DESCRIPTION file.

Installing from source

The easiest way to install the cmapR repository is to point your R’s install.packages function at a tarball of the cmapR archive. You can generate this archive by cloning this repository and doing the following:

# make a gzip tar ball of the repo
R CMD build cmapR # makes cmapR_1.0.tar.gz

# check that the package is ok
R CMD check cmapR_1.0.tar.gz

Once you have created the tarball, open an R terminal and execute the following:

install.packages("cmapR_1.0.tar.gz", type="source", repos=NULL)
library("cmapR")

You can also source individual files as needed instead of installing the entire package.

# For example, just load the IO methods
source("cmapR/R/io.R")``

High-level API reference

Data (data.R)

Loads data sets available for use. These include:

  • "gene_set": These are a collection of gene sets as used in the Lamb 2006 CMap paper.

  • "cdesc_char": An example table of metadata, as would be parsed from a text file or using parse.gctx.

    Warning

    Initially, all columns are of type character.

  • "ds": An example of a GCT object with row and column data, and gene expression values in the matrix.

IO (io.R)

Generally, this module contains the GCT class and relevant method definitions. These are:

GCT Object (S4 class)

#An S4 class to represent a GCT object

@slot mat a numeric matrix
@slot rid a character vector of row ids
@slot cid a character vector of column ids
@slot rdesc a \code{data.frame} of row descriptors
@slot rdesc a \code{data.frame} of column descriptors
@slot src a character indicating the source (usually file path) of the data

@description The GCT class serves to represent annotated
  matrices. The \code{mat} slot contains said data and the
  \code{rdesc} and \code{cdesc} slots contain data frames with
  annotations about the rows and columns, respectively

@seealso \code{\link{parse.gctx}}, \code{\link{write.gctx}}, \code{\link{read.gctx.meta}}, \code{\link{read.gctx.ids}}
@seealso \link{http://clue.io/help} for more information on the GCT format

setClass("GCT",
         representation(
             mat = "matrix",
             rid = "character",
             cid = "character",
             rdesc = "data.frame",
             cdesc = "data.frame",
             version = "character",
             src = "character"
         )
)

GCTX parsing functions

Parse a .gct or .gctx file to GCT object

parse.gctx <- function(fname, rid=NULL, cid=NULL, set_annot_rownames=F, matrix_only=F)

@param fname path to the GCTX file on disk
@param rid either a vector of character or integer
  row indices or a path to a grp file containing character
  row indices. Only these indicies will be parsed from the
  file.
@param cid either a vector of character or integer
  column indices or a path to a grp file containing character
  column indices. Only these indicies will be parsed from the
  file.
@param set_annot_rownames boolean indicating whether to set the
  rownames on the row/column metadata data.frames. Set this to
  false if the GCTX file has duplicate row/column ids.
@param matrix_only boolean indicating whether to parse only
  the matrix (ignoring row and column annotations)

@details \code{parse.gctx} also supports parsing of plain text
  GCT files, so this function can be used as a general GCT parser.

@examples
gct_file <- system.file("extdata", "modzs_n272x978.gctx", package="roller")
(ds <- parse.gctx(gct_file))

# matrix only
(ds <- parse.gctx(gct_file, matrix_only=T))

# only the first 10 rows and columns
(ds <- parse.gctx(gct_file, rid=1:10, cid=1:10))

@family GCTX parsing functions

Parse row/column metadata only

read.gctx.meta <- function(gctx_path, dimension="row", ids=NULL, set_annot_rownames=T)

@param gctx_path the path to the GCTX file
@param dimension which metadata to read (row or column)
@param ids a character vector of a subset of row/column ids
  for which to read the metadata
@param set_annot_rownames a boolean indicating whether to set the
  \code{rownames} addtribute of the returned \code{data.frame} to
  the corresponding row/column ids.

@return a \code{data.frame} of metadata

@examples
gct_file <- system.file("extdata", "modzs_n272x978.gctx", package="roller")
# row meta
row_meta <- read.gctx.meta(gct_file)
str(row_meta)
# column meta
col_meta <- read.gctx.meta(gct_file, dimension="column")
str(col_meta)
# now for only the first 10 ids
col_meta_first10 <- read.gctx.meta(gct_file, dimension="column", ids=col_meta$id[1:10])
str(col_meta_first10)

@family GCTX parsing functions

Parse row/column ids only

read.gctx.ids <- function(gctx_path, dimension="row")

#Read GCTX row or column ids

@param gctx_path path to the GCTX file
@param dimension which ids to read (row or column)

@return a character vector of row or column ids from the provided file

@examples
gct_file <- system.file("extdata", "modzs_n272x978.gctx", package="roller")
# row ids
rid <- read.gctx.ids(gct_file)
head(rid)
# column ids
cid <- read.gctx.ids(gct_file, dimension="column")
head(cid)

@family GCTX parsing functions

GCTX writing functions

Write a GCT object to disk in .gct format

write.gct <- function(ds, ofile, precision=4, appenddim=T, ver=3)

@param ds the GCT object
@param ofile the desired output filename
@param precision the numeric precision at which to
  save the matrix. See \code{details}.
@param appenddim boolean indicating whether to append
  matrix dimensions to filename
@param ver the GCT version to write. See \code{details}.

@details Since GCT is text format, the higher \code{precision}
  you choose, the larger the file size.
  \code{ver} is assumed to be 3, aka GCT version 1.3, which supports
  embedded row and column metadata in the GCT file. Any other value
  passed to \code{ver} will result in a GCT version 1.2 file which
  contains only the matrix data and no annotations.

@return NULL

@examples
\dontrun{
write.gct(ds, "dataset", precision=2)
}
@family GCTX parsing functions

Write a GCT object to disk in .gctx format

write.gctx <- function(ds, ofile, appenddim=T, compression_level=0, matrix_only=F)

@param ds a GCT object
@param ofile the desired file path for writing
@param appenddim boolean indicating whether the
  resulting filename will have dimensions appended
  (e.g. my_file_n384x978.gctx)
@param compression_level integer between 1-9 indicating
  how much to compress data before writing. Higher values
  result in smaller files but slower read times.
@param matrix_only boolean indicating whether to write
  only the matrix data (and skip row, column annotations)

@examples
\dontrun{
# assume ds is a GCT object
write.gctx(ds, "my/desired/outpath/and/filename")
}

@family GCTX parsing functions

Write a ``data.frame`` of metadata only to a GCTX file

write.gctx.meta <- function(ofile, df, dimension="row")

@param ofile the desired file path for writing
@param df the \code{data.frame} of annotations
@param dimension the dimension to annotate
  (row or column)

@examples
\dontrun{
# assume ds is a GCT object
write.gctx.meta("/my/file/path", cdesc_char, dimension="col")
}
@family GCTX parsing functions
@keywords internal

Parsing GRP files

Parse a .grp file to vector

parse.grp <- function(fname)

@param fname the file path to be parsed
@return a vector of the contents of \code{fname}

@examples
grp_path <- system.file("extdata", "lm_epsilon_n978.grp", package="roller")
values <- parse.grp(grp_path)
str(values)

@family CMap parsing functions
@seealso \link{http://clue.io/help} for details on the GRP file format

Writing to .grp files

Write a vector to a .grp file

write.grp <- function(vals, fname)

@param vals the vector of values to be written
@param fname the desired file name

@examples
\dontrun{
write.grp(letters, "letter.grp")
}

@family CMap parsing functions
@seealso \link{http://clue.io/help} for details on the GRP file format

Parsing GMX files

Parse a .gmx file to a list

parse.gmx <- function(fname)

@param fname the file path to be parsed

@return a list of the contents of \code{fname}. See details.

@details \code{parse.gmx} returns a nested list object. The top
  level contains one list per column in \code{fname}. Each of
  these is itself a list with the following fields:
  - \code{head}: the name of the data (column in \code{fname})
  - \code{desc}: description of the corresponding data
  - \code{len}: the number of data items
  - \code{entry}: a vector of the data items

@examples
gmx_path <- system.file("extdata", "lm_probes.gmx", package="roller")
gmx <- parse.gmx(gmx_path)
str(gmx)

@family CMap parsing functions
@seealso \link{http://clue.io/help} for details on the GMX file format

Parsing GMT files

Parse a .gmt file to a list

parse.gmt <- function(fname)

@param fname the file path to be parsed

@return a list of the contents of \code{fname}. See details.

@details \code{parse.gmt} returns a nested list object. The top
  level contains one list per row in \code{fname}. Each of
  these is itself a list with the following fields:
  - \code{head}: the name of the data (row in \code{fname})
  - \code{desc}: description of the corresponding data
  - \code{len}: the number of data items
  - \code{entry}: a vector of the data items

@examples
gmt_path <- system.file("extdata", "query_up.gmt", package="roller")
gmt <- parse.gmt(gmt_path)
str(gmt)

@family CMap parsing functions
@seealso \link{http://clue.io/help} for details on the GMT file format

Writing to GMT files

write.gmt <- function(lst, fname)

@param lst the nested list to write. See \code{details}.
@param fname the desired file name

@details \code{lst} needs to be a nested list where each
  sub-list is itself a list with the following fields:
  - \code{head}: the name of the data
  - \code{desc}: description of the corresponding data
  - \code{len}: the number of data items
  - \code{entry}: a vector of the data items

@examples
\dontrun{
write.gmt(gene_set, "gene_set.gmt")
}

@family CMap parsing functions
@seealso \link{http://clue.io/help} for details on the GMT file format

Writing a data.frame to a tsv file

write.tbl <- function(tbl, ofile, ...)

@param tbl the \code{data.frame} to be written
@param ofile the desired file name
@param ... additional arguments passed on to \code{write.table}

@details This method simply calls \code{write.table} with some
  preset arguments that generate a unquoted, tab-delimited file
  without row names.

@examples
\dontrun{
write.tbl(cdesc_char, "col_meta.txt")
}

@seealso \code{\link{write.table}}

utils (utils.R)

Melting GCT objects

Transform a GCT object to long form (aka ‘melt’).

melt.gct(g, suffixes=NULL, remove_symmetries=F, keep_rdesc=T, keep_cdesc=T)

@description Utilizes the \code{\link{data.table::melt}} function to transform the
  matrix into long form. Optionally can include the row and column
  annotations in the transformed \code{\link{data.table}}.

@param g the GCT object
@param keep_rdesc boolean indicating whether to keep the row
  descriptors in the final result
@param keep_cdesc boolean indicating whether to keep the column
  descriptors in the final result
@param remove_symmetries boolean indicating whether to remove
  the lower triangle of the matrix (only applies if \code{g@mat} is symmetric)
@param suffixes the character suffixes to be applied if there are
  collisions between the names of the row and column descriptors

@return a \code{\link{data.table}} object with the row and column ids and the matrix
  values and (optinally) the row and column descriptors

@examples
# simple melt, keeping both row and column meta
head(melt.gct(ds))

# update row/colum suffixes to indicate rows are genes, columns experiments
head(melt.gct(ds, suffixes = c("_gene", "_experiment")))

# ignore row/column meta
head(melt.gct(ds, keep_rdesc = F, keep_cdesc = F))

@family GCT utilities

Concatenating

Merge two GCT objects

merge.gct(g1, g2, dimension="row", matrix_only=F)

@param g1 the first GCT object
@param g2 the second GCT object
@param dimension the dimension on which to merge (row or column)
@param matrix_only boolean idicating whether to keep only the
  data matrices from \code{g1} and \code{g2} and ignore their
  row and column meta data

@examples
# take the first 10 and last 10 rows of an object
# and merge them back together
(a <- subset.gct(ds, rid=1:10))
(b <- subset.gct(ds, rid=969:978))
(merged <- merge.gct(a, b, dimension="row"))

@family GCT utilities
@export

Slicing

Slice a GCT object using the provided row and/or column ids

subset.gct(g, rid=NULL, cid=NULL)

@param g a gct object
@param rid a vector of character ids or integer indices for ROWS
@param cid a vector of character ids or integer indices for COLUMNS

@examples
# first 10 rows and columns by index
(a <- subset.gct(ds, rid=1:10, cid=1:10))

# first 10 rows and columns using character ids
(b <- subset.gct(ds, rid=ds@rid[1:10], cid=ds@cid[1:10]))

identical(a, b) # TRUE

@family GCT utilities

Annotating

Given a GCT object and either a data.frame or a path to an annotation table, apply the annotations to the GCT using the given keyfield.

annotate.gct(g, annot, dimension="row", keyfield="id")

@description Given a GCT object and either a \code{\link{data.frame}} or
a path to an annotation table, apply the annotations to the
gct using the given \code{keyfield}.

@param g a GCT object
@param annot a \code{\link{data.frame}} or path to text table of annotations
@param dimension either 'row' or 'column' indicating which dimension
  of \code{g} to annotate
@param keyfield the character name of the column in \code{annot} that
  matches the row or column identifiers in \code{g}

@return a GCT object with annotations applied to the specified
  dimension

@examples
\dontrun{
 g <- parse.gctx('/path/to/gct/file')
 g <- annotate.gct(g, '/path/to/annot')
}

@family GCT utilities

Transpose

transpose.gct(g)

@param g the \code{GCT} object

@return a modified verion of the input \code{GCT} object
  where the matrix has been transposed and the row and column
  ids and annotations have been swapped.

@examples
transpose.gct(ds)

@family GCT utilties
@export

Math

Convert values in a matrix to ranks

rank.gct(g, dim="row")
@param g the \code{GCT} object to rank
@param dim the dimension along which to rank
  (row or column)

@return a modified version of \code{g}, with the
  values in the matrix converted to ranks

@examples
(ranked <- rank.gct(ds, dim="column"))
# scatter rank vs. score for a few columns
plot(ds@mat[, 1:3], ranked@mat[, 1:3],
  xlab="score", ylab="rank")

@family GCT utilities

Meta-info about cmapR

Contribution guidelines

We welcome contributors! For your pull requests, please include the following:

  • Sample code/file that reproducibly causes the bug/issue
  • Documented code (include a docstring for new functions!) providing fix
  • Unit tests evaluating added/modified methods.

FAQ

We will be adding FAQs as they come up.

BSD 3-Clause License

Copyright (c) 2017, Connectivity Map (CMap) at the Broad Institute, Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Citation Information

If you use GCTx and/or cmapR in your work, please cite Enache et al.