Make GRanges from a GFF/GTF file

makeGRangesFromGFF(file, level = c("genes", "transcripts"),
  .checkAgainstTxDb = FALSE)

makeGRangesFromGTF(file, level = c("genes", "transcripts"),
  .checkAgainstTxDb = FALSE)

Arguments

file

character(1). File path.

level

character(1). Return as genes or transcripts.

.checkAgainstTxDb

logical(1). Enable strict mode, intended for development and unit testing only. Generate an internal TxDb using GenomicFeatures::makeTxDbFromGRanges() and check that the ranges(), seqnames(), and identifiers defined in names() are identical. Doesn't work for all GFF/GTF files due to some current limitations in the GenomicFeatures package, so this is disabled by default. Generally, GenomicFeatures parses GTF files better than GFF files. However, it's a useful sanity check and should be enabled if possible.

Details

Remote URLs and compressed files are supported.

Functions

  • makeGRangesFromGTF: GTF file extension alias. Runs the same internal code as makeGRangesFromGFF().

Recommendations

  • Use GTF over GFF3. We recommend using a GTF file instead of a GFF3 file, when possible. The file format is more compact and easier to parse.

  • Use Ensembl over RefSeq. We generally recommend using Ensembl over RefSeq, if possible. It's better supported in R and generally used by most NGS vendors.

GFF/GTF specification

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.

The GTF (General Transfer Format) format is identical to GFF version 2.

The UCSC website has detailed conventions on the GFF3 format, including the metadata columns.

Feature type

  • CDS: CoDing Ssequence. A contiguous sequence that contains a genomic interval bounded by start and stop codons. CDS refers to the portion of a genomic DNA sequence that is translated, from the start codon to the stop codon.

  • exon: Genomic interval containing 5' UTR (five_prime_UTR), CDS, and 3' UTR (three_prime_UTR).

  • mRNA: Processed (spliced) mRNA transcript.

See also:

Supported sources

Currently makeGRangesFromGFF() supports genomes from these sources:

  • Ensembl (GTF, GFF3).

  • GENCODE (GTF, GFF3).

  • RefSeq (GTF, GFF3).

  • FlyBase (GTF).

  • WormBase (GTF).

Ensembl

Note that makeGRangesFromEnsembl() offers native support for Ensembl genome builds and returns additional useful metadata that isn't defined inside a GFF/GTF file.

If you must load a GFF/GTF file directly, then use makeGRangesFromGFF().

GENCODE vs. Ensembl

Annotations available from Ensembl and GENCODE are very similar.

The GENCODE annotation is made by merging the manual gene annotation produced by the Ensembl-Havana team and the Ensembl-genebuild automated gene annotation. The GENCODE annotation is the default gene annotation displayed in the Ensembl browser. The GENCODE releases coincide with the Ensembl releases, although GENCODE can skip an Ensembl release if there is no update to the annotation with respect to the previous release. In practical terms, the GENCODE annotation is essentially identical to the Ensembl annotation.

However, GENCODE handles pseudoautosomal regions (PAR) differently than Ensembl. The Ensembl GTF file only includes this annotation once, for chromosome X. However, GENCODE GTF/GFF3 files include the annotation in the PAR regions of both chromosomes. You'll see these genes contain a "_PAR_Y" suffix.

Additionally, GENCODE GFF/GTF files import with a gene identifier containing a suffix, which differs slightly from the Ensembl GFF/GTF spec (e.g. GENCODE: ENSG00000000003.14; Ensembl: ENSG00000000003).

The GENCODE FAQ has additional details.

RefSeq

Refer to the current RefSeq spec for details.

See also:

  • RefSeq FAQ

  • ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

UCSC

Loading UCSC genome annotations from a GFF/GTF file are intentionally not supported by this function.

We recommend using a pre-built TxDb package from Bioconductor instead. For example, load TxDb.Hsapiens.UCSC.hg38.knownGene for hg38.

For reference, note that UCSC doesn't provide direct GFF/GTF file downloads. Use of the hgTables table browser is required in a web browser.

Select the following options to download hg38:

  • clade: Mammal

  • genome: Human

  • assembly: Dec. 2013 (GRCh38/hg38)

  • group: Genes and Gene Predictions

  • track: GENCODE v29

  • table: knownGene

  • region: genome

  • output format: GTF - gene transfer format

  • output file: <Enter a file name>

Related URLs:

Example URLs

See also

Examples

file <- pasteURL(freerangeTestsURL, "ensembl.gtf", protocol = "none") ## Genes x <- makeGRangesFromGFF(file = file, level = "genes")
#> Making GRanges from GFF file.
#> Importing ensembl.gtf using rtracklayer::import().
#> Ensembl GTF detected.
#> Defining broadClass using: geneBiotype, geneName, seqnames
#> Arranging by geneID.
#> 60 genes detected.
#> [1] "GRanges object with 60 ranges and 8 metadata columns"
## Transcripts x <- makeGRangesFromGFF(file = file, level = "transcripts")
#> Making GRanges from GFF file.
#> Importing ensembl.gtf using rtracklayer::import().
#> Ensembl GTF detected.
#> Defining broadClass using: geneName, seqnames, transcriptBiotype
#> Arranging by transcriptID.
#> 167 transcripts detected.
#> [1] "GRanges object with 167 ranges and 16 metadata columns"