bio::::mart BIOCONDUCTOR Database mining with biomaRt Steffen Durinck lllumina Inc. BioC 2009 Overview • The BioMart software suite • biomaRt package • biomaRt installation • biomaRt example queries to show the variety of different data types/questions that can be retrieved/answered for many organisms BioC 2009 bio::::mart BIOCONDUCTOR BioMart 0.7 • BioMart is a query-oriented data management system developed jointly by the European Bioinformatics Institute (EBI) and Cold Spring Harbor Laboratory (CSHL). • Originally developed for the Ensembl project but has now been generalized BioC 2009 bio::::mart BIOCONDUCTOR BioMart 0.7 • BioMart data can be accessed using either web, graphical, or text based applications, or prog ram matically using web services or software libraries written in Perl and Java. • http://www.biomart.org BioC 2009 BIOCONDUCTOR bio::::mart Example BioMart databases Ensembl Wormbase Reactome Gramene File Fdit View (io Knckiikii Is I mils Help n I _11 iII|>://vavvv.ensembl.org/Klulti/martvie> —3 llGlvega Customize Links Frey HuLrridll Windows Medid RealPlayer Windows Home TidriseuruiJ 11 err its [iiLruduiLiori to SLdlisLics ArdLiidurjsis Uitrlidria el '""Ensembl Home ' EnsMart ' TextSearch ' BlastSearch * MartSearch * Download * |Sanger Select the dataset for this query Database: | Ensembl 31 J Dataset: Homo sapiens genes (NCBI35) Using M| After cho next pag< 0UTPU1 number c Summary start , Not yet initialised filter /BloMart/mai+rlew ^ nenlPlnypr , Windows Homp TIET u AríMílopsin ttinll/nn Mart View Find: including ncludmg Home|Genome Blast / Blat|wormMart Batch Sequences|Markers|Genetic Maps Sub ■m stari| 11 EdÄíSieQaaŕiäl Select the dataset for this query Summary '' Gramene BioMart Genome Browser (MartView) - Mozilla Firefox Database: ^ Dataset: Flle Edit View Go Bookmarks Tools Help - -I - r> ■ rX |~ http://www.giamoii<>.oiw "His Customize Links Free Hotmail Windows Media RealPlayei Windows Home Transeuropa Ferries Introduction to Statistics Arabidopsis thalian Genome Browser Using MartView GRAMENE d After choosing a DATASE an next page and then which OUTPUT page. At any st; number of entries you car MartView can generate a including sequence and t< including HTML, text and Search for: | Database: [ah 3 search |_ BLAST CMap Markers Protein Ontology Gene OTL Literature Species Resources About Gram webmaster@WAW.wormbase.org BioC 2009 4. "ii .> « «»i'ii-ii» Select the dataset for this query Dataset' SOrwa gativa ne>nf± output Not yet bio::::mart BIOCONDUCTOR BioMart databases • De-normalized • Tables with 'redundant' information • Query optimized • Fast and flexible • Well suited for batch querying BioC 2009 biomaRt • R interface to BioMart databases • Performs online queries • Current release version 2.0.0 • Depends on Rcurl and XML packages BioC 2009 BIOCONDUCTOR Installing biomaRt & GenomeGraphs • Platforms on which biomaRt has been installed: -Linux (curl http://curl.haxx.se) - OSX (curl) -Windows BioC 2009 Installing biomaRt & _GenomeGraphs_ > source( ''http://www.bioconductor.org/biocLite. R") > biocLite('GenomeGraphs') Running biocinstall version 2.4.11 with R version 2.9.1 Your version of R requires version 2.4 of Bioconductor. also installing the dependencies 'bitops', 'XML', 'RCurl', 'biomaRt' BioC 2009 List available BioMart _databases > Hbrary(biomaRt) Loading required package: XML Loading required package: Rcurl > UstMartsQ BioC 2009 List available BioMarts biomart version 1 ensembl ENSEMBL 55 GENES (SANGER UK) 2 snp ENSEMBL 55 VARIATION (SANGER UK) 3 functionaLgenomics ENSEMBL 55 FUNCTIONAL GENOMICS 4 vega VEGA 35 (SANGER UK) 5 msd MSD PROTOTYPE (EBI UK) 6 htgt HIGH THROUGHPUT GENE TARGETING AND TRAPPING 7 QTL_MART GRAMENE 29 QTL DB (CSHL US) 8 ENSEMBL_MART_ENSEMBL GRAMENE 29 GENES 9 ENSEMBL_MART_SNP GRAMENE 29 SNPs 10 GRAMENE MARKER 29 GRAMENE 29 MARKERS Ensembl • Ensembl is a joint project between EMBL - European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute (WTSI) • A software system which produces and maintains automatic annotation on selected eukaryotic genomes. • http://www.ensembl.org BioC 2009 BIOCONDUCTOR Ensembl - BioMart > ensembl-useMart("ensembl") ( Firefox File Edit View History Bookmarks Tools Window Help ©OS Mozilla Firefox ^)) S> 5at 20:41 í Q •^ä' ' ^ ^ ^litt0://w^.ensembl.org/biomart/mai^iew/ed4cfbf7Z331ad514e7O59f465al2-1 ujt u- |C| * DSX how capture screen q, g f Ensembl Your Ensembl * Login or Register Ö About User Accounts Ensembl Archive View previous release of page in Archive! Stable Archive! link for this Search: Erraembl ^ E BI <~ Sanger <~ [Tľ ^ | e.g. AL13S722.15.1.44776. ENSGOQOOQl39613 i Damsel: Homo sapisns gsnes (NCBI3B) » Attributes (Features) Ensembl Gene ID Ensembl Transcript ID * Filters [None selected] i Data set: [None Selected] Database: | Enssmbl 44 Datasel: Homo sspíers jeres (NCBI36) 1. ChoosB Dataset above 2. Click Attributes and make your selection in this panel 3. Click Results in the top panel bio mart version 0.5 © 2007 WTSI / EBI. Ensembl is available to download far public use - please see the code licence for details. Don, A "4 ®e BioC 2009 Ensembl - Datasets > HstDatasets(ensembl) Returns: name: hsapiens_gene_ensembl description: Homo sapiens genes - version: (GRCh37) Ensembl currently contains 50 datasets-species BioC 2009 Ensembl - Datasets A dataset can be selected using the useMart function > ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl) Checking attributes ...ok Checking filters ...ok BioC 2009 biomaRt query: Attributes • Attributes define the values which the user is interested in. • Conceptually equal to output of the query • Example attributes: - chromosome_name - band BioC 2009 biomaRt query: Filters • Filters define restrictions on the query • Conceptually filters are inputs • Example filters: - entrezgene -chromosome name BioC 2009 biomaRt query Attributes (e.g., Filters (e.g., Values (e.g., chromosome "entrezgene") EntrezGene and band) identifiers) V___^ biomaRt query BioC 2009 Three main biomaRt functions • HstFilters - Lists the available filters • HstAttributes - Lists the available attributes • getBM - Performs the actual query and returns a data.frame BioC 2009 Microarrays & Ensembl • Ensembl does an independent mapping of array probe sequences to genomes (Affymetrix, lllumina, Agilent,...) • If there is no clear match then that probe is not assigned to a gene BioC 2009 TASK 1 - Ensembl • Annotate the following Affymetrix probe identifiers from the human ul33plus2 platform with hugo gene nomenclature symbol (hgnc_symbol) and chromosomal location information: 211550 at, 202431 s at, 206044 s at BioC 2009 TASK 1 - Ensembl • Filters: affy_hg_ul33_plus_2 • Attributes: affy_h g_u 13 3_p I u s_2, chromosome_name, start_position, end_position, band, strand • Values: 211550 at, 202431 s at, 206044 s at BioC 2009 TASK 1 - Ensembl > affyids - c("211550_at", "202431_s_at", "206044_s_at") > annotation = getBM(attributes=c("affy_hg_ul33_plus_2","ense mbl_gene_id", "hgnc_symbol", "chromosome_nam e", "start_position", "end_position", "band", "strand"), filters-'affy_hg_ul33_plus_2", values=affyids, mart = ensembl) BioC 2009 TASK 1 - Ensembl >annotation affy_hg_ul33_plus_2 ensembl_gene_id hgnc_symbol chromosome_name 1 202431_s_at ENSG00000136997 MYC 8 2 206044_s_at ENSG00000157764 BRAF 7 3 211550_at ENSG00000146648 EGFR 7 start_position end_position band strand 128748316 128753671 q24.21 1 140433817 140624564 q34 -1 55086714 55324313 pll.2 1 BioC 2009 TASK 1* - Ensembl Retrieve GO annotation for the following lllumina human_wg6_v2 identifiers: IL MN_1728071, IL MN_1662668 BioC 2009 TASK 1* - Ensembl Retrieve GO annotation for the following lllumina human_wg6_v2 identifiers: IL MN_1728071, IL MN_1662668 > illuminalDs = c("ILMN_l 728071", "ILMN_1662668") > goAnnot = getBM(c("illumina_humanwg_6_v2", "go_biological_process_id","go_biological_proces s_linkage_ type"), filters="illumina_humanwg_6_v2", values=illuminalDs, mart = ensembl) BioC 2009 TASK 1* - Ensembl illumina_humanwg_ 6_ v2 go_biological_process_id go_biological_process_linkage_ type IMP IDA IDA IDA IDA BioC 2009 1 2 3 4 5 ILMN_1662668 ILMN_1662668 ILMN_1662668 ILMN_1662668 ILMN 1662668 GO:0000281 GO:0006461 GO .0006974 GO .0007026 GO .0007050 Using more than one filter • getBM can be used with more than one filter • Filters should be given as a vector • Values should be a list of vectors where the position of each vector corresponds with the position of the associated filter in the filters argument BioC 2009 TASK 2 - Ensembl Retrieve all genes that are involved in Diabetes Mellitus Type I or Type II and have transcription factor activity BioC 2009 TASK 2 - Ensembl 1. Diabetes Mellitus type I MIM accession: 222100 2. Diabetes Mellitus type II MIM accession: 125853 3. GO id for "transcription factor activity": GO:0003700 BioC 2009 TASK 2 - Ensembl diab=getBM(c("ensembl_geneJď,"hgnc_symboľ), filters=c("mim_morbid_accession", "go"), values=list(c("125853", "222100"), "GO .0003700"), mart-ensembl) BioC 2009 TASK 2 - Ensembl ensembl gene id 1 ENSG00000139515 2 ENSG00000108753 3 ENSG00000148737 4 ENSG00000106331 5 ENSG00000162992 6 ENSG00000135100 hgnc_symbol PDX1 HNF1B TCF7L2 PAX4 NEURODÍ HNF1A BioC 2009 Boolean filters • Filters can be either numeric, string or boolean • Boolean filters should have either TRUE or FALSE as values - TRUE: return all information that comply with the given filter (e.g. return only genes that have a hgnc_symbol) - FALSE: return all information that doesn't comply with the given filter (e.g. with no hgnc_symbol) BioC 2009 Boolean filters/ filterType The function filterType allows you to figure out which type each filter is (this function is currently only available in the devel version of biomaRt) > filterType("affy_hg_ul33_plus_2", mart=ensembl) [l]"id_list >filterType("with_affy_hg_ul33_plus_2", mart-ensembl) [1] "booleanjist" BioC 2009 BIOCONDUCTOR TASK 3 - Ensembl Retrieve all miRNAs known on chromosome 13 and their chromosomal locations BioC 2009 TASK 3 - Ensembl >miRNA = getBM (c("mirbase", "ensembl_gene_id", "start_position", "chromosome_name"), filters=c("chromosome_name","with_mirbase"), values=list(13, TRUE), mart=ensembl) > miRNA[l:5,] BioC 2009 ši BIOCONDUCTOR TASK 3 - Ensembl mirbase ensembl_gene_id start_position chromosome_name 1 M10008190 ENSG00000211491 41301964 13 2 M10003635 ENSG00000207652 41384902 13 3 M10000070 ENSG00000208006 50623109 13 4 M10000069 ENSG00000207718 50623255 13 5 M10003636 ENSG00000207858 90883436 13 BioC 2009 attributePages • attributePages gives brief overview of available attribute pages (useful for displaying subset of attributes) > attributePages(ensembl) [1] "feature_page" "structure" "snp" "homologs" "sequences" >listAttributes(ensembl, page = "feature_page") BioC 2009 BIOCONDUCTOR Additional help to figure out which filter and attribute names to use • Go to www.biomart.org and select BioMart you use • Select attributes and filters • Press to XML button to get their names FilterOptions function: enumerates all possible values for a filter (if available) BioC 2009 TASK 4 - Ensembl Retrieve all entrezgene identifiers on chromosome 22 that have a non-synonymous coding SNP BioC 2009 TASK 4 - Ensembl > filterOptions("snptype_filters",ensembl) [1] "[STOP_GAINED,STOP_LOST,COMPLEX_INDEL,FRAMESHIFT_CODING, NON_SYNONYMOUS_CODING,STOP_GAINED,SPLICE_SITE,STOP_LOST,SPLI CE_SITE,FRAMESHIFT_CODING,SPLICE_SITE,NON_SYNONYMOUS_CODI NG,SPLICE_SITE,SYNONYMOUS_CODING,SPLICE_SITE,SYNONYMOUS_C ODING, 5PRIME_ UTR, SPLICE_SITE, 5PRIME_ UTR, 3PRIME_ UTR, SPLICE_SIT E, 3PRIME_ UTR, INTRONIC, ESSENTIAL_SPLICE_SITE, INTRONIC, SPLICE_SI TE,INTRONIC,UPSTREAM, DOWNSTREAM]" > entrez - getBM(''entrezgene'\filters=c("chromosome_nam& values=list(22,"NON_SYNONYMOUS_CODING"),mart=ensembl) > entrez[l:5J > [1] 23784 81061150160 150165 128954 BioC 2009 getSequence • Retrieving sequences from Ensembl can be done using the getBM function or the getSequence wrapper function • Output of getSequence can be exported to FASTA file using the exportFASTA function BioC 2009 getSequence Available sequences in Ensembl: Exon 3'UTR 5'UTR Upstream sequences Downstream sequences Unspliced transcript/gene Coding sequence Protein sequence h—--QKHZK^m----H y.....m^a^~m.....^ h—--CKsn—>d H h-----[iKiiPNn]----H BioC 2009 getSequence • Arguments of getSequence: - id: identifier - type: type of identifier used e.g. hgnc_symbol or affy_hg_ul33_plus_2 - seqType: sequence type that needs to be retrieved e.g. gene_exon, coding, 3utr, 5utr, - upstream/downstream: specify number of base pairs upstream/downstream that need to be retrieved BioC 2009 TASK 5 - Ensembl Retrieve all exons of CDH1 BioC 2009 TASK 5 - Ensembl > seq = getSequence(id-"CDHl", type-"hgnc_symbol",seqType-"gene_exon", mart = ensembl) > seq[lj gene_exon 1 TACAAGGGTCAGGTGCCTGAGAACGAGGCTAACGTCGTAATCAC CACACTGAAAGTGACTGATGCTGATGCCCCCAATACCCCAGCGT GGGAGGCTGTATACACCATATTGAATGATGATGGTGGACAATTTG TCGTCACCACAAATCCAGTGAACAACGATGGCATTTTGAAAACAG CAAAG hgnc_symbol 1 CDH1 BioC 2009 TASK 6 - Ensembl Retrieve 2000bp sequence upstream of the APC and CUL1 translation start site BioC 2009 TASK 6 - Ensembl >promoter=getSequence(id=c("APC", "CUL1 "),type= "hgnc_symbol", seqType="coding_gene_flank", upstream =2000, mart=ensembl) > promoter BioC 2009 Homology - Ensembl • The different species in Ensembl are interlinked • biomaRt takes advantage of this to provide homology mappings between different species BioC 2009 Linking two datasets • Two datasets (e.g. two species in Ensembl) can be linked to each other by using the getLDS (get linked dataset) function • One has to connect to two different datasets and specify the linked dataset using martL, filtersL, attributesL, valuesL arguments BioC 2009 TASK 7 - Ensembl Retrieve human gene symbol and affy identifiers of their homologs in chicken for the following two identifiers from the human affy_hg_u95av2 platform: 1434_at, 1888 s at BioC 2009 TASK 7 - Ensembl > human=useMart("ensembl", dataset-'hsapiens_gene_ensembl") Checking attributes and filters ...ok > chicken-useMart("ensembl", dataset-"ggallus_gene_ensembl") Checking attributes and filters ...ok >out - getLDS(atthbutes=c("affy_hg_u95av2","hgnc_symbor), filters="affy_hg_u95av2", values=c("1888_s_at","1434_at"),mart=human, attributesL-"affy_chicken", martL-chicken) > out VI V2 V3 1 1434_at PTEN GgaAffx.25913.1.Sl_a 21888_s_at KIT Gga.606.1.Sl_at BioC 2009 Variation BioMart • dbSNP mapped to Ensembl > snp = useMart("snp", dataset-'hsapiens_snp")) BioC 2009 TASK 8 - Variation Retrieve all refsnp_ids and their alleles and position that are located on chromosome 8 and between bp 148350 and 158612. BioC 2009 TASK 8 - Variation >out=getBM(attributes=c("refsnp_id", "allele", "chrom_start"), filters=c("chr_name", "chrom_start", "chrom_end"), values=list(8,148350,158612), mart=snp) > out[l:5,] allele chrom start refsnpjd 1 ENSSNP4490669 2 ENSSNP5558526 3 ENSSNP4089737 4 ENSSNP9060169 5 ENSSNP4351891 C/G TIC T/A C/T C/G 148729 148909 149060 149245 149250 BioC 2009 Ensembl Archives • Provide alternate host >HstMarts(host= "may2009. archive, ensembl. org/biomart/martservic el") biomart version 1 ENSEMBL_MART_ENSEMBL Ensembl 54 2 ENSEMBL_MART_SNP Ensembl Variation 54 3 ENSEMBL_MART_VEGA Vega 35 4 REACTOME Reactome(CSHL US) 5 wormbase_current Worm Base (CSHL US) 6 pride PRIDE (EBI UK) >ensembl54=useMart("E NSE MB L_ MA RT_E NSE MBL", host= "may2009. archive, ensembl. org/biomart/martservice/") BioC 2009 Ensembl Archives Access to archives by setting archive=TRUE or connect to specific host (Note that this is currently not up to date in the central repository) >HstMarts(archive=TRUE) biomart 1 ensembl_mart_51 2 snp_mart_51 3 vega_mart_51 4 ensembl_mart_50 1 snp_mart_50 version Ensembl 51 SNP 51 Vega 32 Ensembl 50 SNP 50 > ensembl51 = useMart("ensembl_mart_51", archive-TRUE, dataset- "hsapiens_gene_ensembl") BioC 2009 Gramene • Gramene is a curated, open-source, data resource for comparative genome analysis in the grasses. • Rice, Maize and Arabidopsis BioC 2009 TASK 9 - Gramene Retrieve affy ATH1 ids and CAT MA ids that map to the Arabidopsis thaliana chromosome 1 between basepair 30.000 and 41.000 BioC 2009 TASK 9 - Gramene >gramene = useMart("ENSEMBL_MART_ENSEMBL", dataset="athaliana_gene_ensembl") >getBM(c(''affy_athlJd'',''catma_tigr5_id''), Ťilters=c("chromosome_name", "start", "end") , values=list("l", "30000","41000"), mart=gramene) BioC 2009 TASK 9 - Gramene affy_athl_id catma_tigr5_id 1 261579_at CATMAla00040 2 261569_at CATMAla00045 3 261569_at CATMAla00045 4 261569_at CATMAla00045 5 261576_at CATMAla00050 6 261576_at CATMAla00050 BioC 2009 Wormbase • Database on the genetics of C elegans and related nematodes. BioC 2009 TASK 10 - Wormbase Determine the RNAi ids and the observed phenotypes for the gene with wormbase gene id: WBGene00006763 BioC 2009 TASK 10 - Wormbase > worm = useMart("wormbasel 76", dataset= "wormbase_rnai") >pheno= getBM(c("rnai", "phenotype_primary_name"), filters="gene", values="WBGene00006763", mart=worm) BioC 2009 TASK 10 - Wormbase >pheno rnai phenotype_primary_name 1 WBRNAÍ00021278 slow_growth 2 WBRNAÍ00021278 postembryonic_development_abnormal 3 WBRNAÍ00021278 embryonicjethal 4 WBRNAÍ00021278 larvaljethal 5 WBRNAÍ00021278 larval_arrest 6 WBRNAÍ00021278 maternal_sterile 7 WBRNAÍ00021278 Abnormal 8 WBRNAÍ00021278 sterile_progeny 9 WBRNAÍ00026915 slow_growth Discussion • Using biomaRt to query public web services gets you started quickly, is easy and gives you access to a large body of metadata in a uniform way • Need to be online • Online metadata can change behind your back; although there is possibility of connecting to a particular, immutable version of a dataset BioC 2009 Reporting bugs • Check with MartView if you get the same output - Yes: contact database e.g. helpdesk@ensembl.org - No: contact me - sdurinck@gmail.com BioC 2009 Acknowledgements EBI - Rhoda Kinsella - Arek Kasprzyk - Ewan Birney Bioconductor users EMBL - Wolfgang Huber BioC 2009