Accessing Relational Data from R -- Simplified by SlashDB

September 8, 2017
Elliot Bricker
How-to

When data scientists or analysts outside the software development team analyze the data for insights, many use R for statistical summaries and publication-quality visualizations of the data. But first, to get to the data from the shared data store, those users, many of whom have limited background in SQL and data connectivity, first need to get to the data from the shared data store.

Getting to data is complicated (sort of)

The R language does not include built-in relational database access capability, but there are many data access interfaces available from the various RDBMS vendors and open source implementations. Unfortunately, setting up a connection requires an understanding of the network, server, and database and using each of the built-in access capabilities requires learning the syntax for that vendor. Furthermore, configuration requires testing using an actual username with database privileges.

For the data scientist/analyst in the organization, networking and configuring ODBC falls under IT functions. Sometimes this makes it difficult (depending on your company) to even get started in accessing server-side relational data sources. If you don’t have an IT department, you’ll have to make sure you have permissions to install the correct driver on the server and/or the client, as well. With the many RDBMS implementations, you will be writing code that is either platform-specific or requires relatively complex, platform-specific installation steps.

Data scientist and data analysts consider packages such RJDBC, RODBC and other. To add to the complexity, RJDBC uses a combination of a JDBC compliant database driver and Java Runtime Environment (JRE) to exchange data between R and the database server.

Here are some of the items the hapless data scientist needs for configuring the database listener:

Hostname or IP
Port, e.g., 1521
Service name or SID, e.g., ORCL
Username
Password

In addition, you may have install the R package and load the library for the low-level interface to Java VM – rJava.

An alternative to RJDBC is RODBC. Again, connecting to a database from the RODBC driver involves identifying the location of the server, the name of the database, and supporting credentials. Using the RODBC path will force the data scientist to learn and use R functions such as sqlFetch() to retrieve the data into R. Any valid SQL query can be sent to the database server using the sqlQuery() function. If you have the authority to DROP tables or remove data, then you can use the the sqlDrop() or sqlClear() functions. Obviously, there is a bit of a learning curve for each of these package functions. But first, you will have to code the database connection for the database vendor.

Some examples:

Connecting to Oracle using RODBC

library(RODBC)
channel <- odbcConnect("someDSN", uid="root", pwd="password")
dataFrame <- sqlQuery(channel, "SELECT * FROM dbname.myTable");
close(channel);

Connecting to MS SQL Server using RODBC

library(RODBC)
channel <- odbcDriverConnect('driver={SQL Server};server=MSqlsrvhost;database=mydbname;trusted_connection=true')
dataFrame <- sqlQuery(channel, 'select * from information_schema.tables')

Note that in the above connection string function, more flexibility is available via the function, odbcDriverConnect. Syntax can be described as “DSN=dsn;UID=uid;PWD=pwd” . Or the function can be constructed without a DSN if you specify a driver as in the example above — DRIVER=, as well as other driver-specific parameters that you can include. Definitely not a fun way to start off the process of just getting a data frame in place to do your exploratory data analysis and not what you were hired to do.

Connecting to MySQL can have you using another method, the RMySQL package.

install.packages("RMySQL")
library(RMySQL)

Once the RMySQL library is installed, you’ll have to create a database connection object:

myconn <- dbConnect(MySQL(), user='user', password='password', mydbname='database_name', host='urlForHost')

dbListTables(myconn)

… will list the base tables for the database (mydbname) user in the dbConnect function

You can now enter your SQL command via:

dataFrame   <-dbGetQuery(myconn,'select * from mytable')

You’ll also want to consider housekeeping chores such as a disconnect using:

dbDisconnect(myconn)

All the moving parts of ODBC

You should note that an ODBC environment consists of an ODBC Driver Manager and an ODBC compliant driver for the database server you would like to use.

On Windows, the ODBC Driver Manager is built into the platform, but on Linux or other platforms an ODBC Driver Manager will probably have to be installed. ODBC driver managers such as unixODBC and iODBC will be needed (not to mention iODBC is used by macOS). The connection to the particular RDBMS needs an ODBC driver. These may come with the RDBMS or the ODBC driver manager or be provided separately by the RDBMS vendor. This means that for some RDBMSs there are several different ODBC drivers available, and they can behave differently if and when you need to switch around from one backend data store to another.

The easy way

An alternative to the connectivity considerations seen above can be realized by the SlashDB data API gateway which serves as a web service API. If that sounds foreign, fear not. It all begins with a friendly web-based GUI.

Once SlashDB is installed and connected to your databases, using an intuitive user interface (above), authorized users and their apps can instantly reach your data via reference to URLs. This allows the data scientist/analyst to concentrate on what they do best. Using the Data Discovery feature data analysts browse through available data tables, filter down and even traverse to related records.

SlashDB is unique in that as you browse through data tables in HTML, the URL from the browser’s address bar will be the same for the API, only for a different data format (i.e. JSON, CSV, XML).

The following URL points to an individual record from the Employee table in the Chinook database in HTML format. You can look at it.

https://demo.slashdb.com/db/Chinook/Employee/EmployeeId/1.html

Corresponding API endpoints in JSON, CSV and XML for programs to consume are simply:

Example in R

Here is the above example of a SlashDB HTTP Restful Web Service integration into R.

Or fetch the whole table formatted as CSV straight into a data.frame, in just one line of code:

> read.csv('http://demo.slashdb.com/db/Chinook/Employee.csv')
  EmployeeId LastName FirstName               Title ReportsTo           BirthDate            HireDate                     Address       City State Country PostalCode             Phone               Fax                    Email
1          1    Adams    Andrew     General Manager        NA 1962-02-18T00:00:00 2002-08-14T00:00:00         11120 Jasper Ave NW   Edmonton    AB  Canada    T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457   andrew@chinookcorp.com
2          2  Edwards     Nancy       Sales Manager         1 1958-12-08T00:00:00 2002-05-01T00:00:00                825 8 Ave SW    Calgary    AB  Canada    T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322    nancy@chinookcorp.com
3          3  Peacock      Jane Sales Support Agent         2 1973-08-29T00:00:00 2002-04-01T00:00:00               1111 6 Ave SW    Calgary    AB  Canada    T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712     jane@chinookcorp.com
4          4     Park  Margaret Sales Support Agent         2 1947-09-19T00:00:00 2003-05-03T00:00:00            683 10 Street SW    Calgary    AB  Canada    T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
5          5  Johnson     Steve Sales Support Agent         2 1965-03-03T00:00:00 2003-10-17T00:00:00                7727B 41 Ave    Calgary    AB  Canada    T3B 1Y7  1 (780) 836-9987  1 (780) 836-9543    steve@chinookcorp.com
6          6 Mitchell   Michael          IT Manager         1 1973-07-01T00:00:00 2003-10-17T00:00:00        5827 Bowness Road NW    Calgary    AB  Canada    T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899  michael@chinookcorp.com
7          7     King    Robert            IT Staff         6 1970-05-29T00:00:00 2004-01-02T00:00:00 590 Columbia Boulevard West Lethbridge    AB  Canada    T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485   robert@chinookcorp.com
8          8 Callahan     Laura            IT Staff         6 1968-01-09T00:00:00 2004-03-04T00:00:00                 923 7 ST NW Lethbridge    AB  Canada    T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772    laura@chinookcorp.com

As seen above, using SlashDB, you can decide whether you want the output to be returned in either JSON, XML, or CSV format. You can use tools such as jsonlite and xml2 to allow you to parse the response. These packages will turn HTTP responses into R objects that you can then use in your R scripts and RStudio environment (more below). You can optionally use the tidyverse package features to tidy the data and then transform it to get the data into the shape you want it in for your analysis and data visualization.

Live demo with R-Studio

To see SlashDB in action, we set up an RDBMS a test server environment containing relational data found in the Chinook sample database. Details of this relational database example project can be found at https://github.com/lerocha/chinook-database . The Chinook data model represents a digital media store (e.g. iTunes), including tables for artists, media tracks, invoices, customers and more.

In this R example, we first load up the libraries needed for retrieving and parsing the output of the relational data returned in the HTTP response. The SlashDB endpoint we’ll use is: http://demo.slashdb.com/db/Chinook/Invoice.json which enumerates 400+ purchases found within the Invoice base table in the Chinook database. The full script is presented here:


#  Example using SlashDB: Plot the total purchases from the media store by country
library(httr)
library(jsonlite)
library(magrittr)
library(ggplot2)

ChinookInvoices <- GET("http://demo.slashdb.com/db/Chinook/Invoice.json")

ChinookInvoices$status_code
ChinookInvoices$headers$'content-type'
names(ChinookInvoices)
# Get the content of the HTTP response
text_content <- content(ChinookInvoices, as = 'text', encoding = "UTF-8")

text_content
 

# parse with jsonlite
json_content <- text_content %>% fromJSON
json_content
BillingCountry <- json_content$BillingCountry
TotalPurchaseAmount <- json_content$Total

 

dat <- data.frame(
    BillingCountry,
    TotalPurchaseAmount
)

p <- ggplot(data=dat, aes(x=BillingCountry, y=TotalPurchaseAmount)) +
    geom_bar(stat="identity")


p

The bar chart is displayed as:

Summary

Using the connected data sources and single URL references will assure that you never need to extract data and do your analysis from stale CSV files and other client-side extract files. The data connectivity problems inherent in JDBC and ODBC are mitigated and the use of SlashDB’s SQL Pass-thru and data transversal features will allow easy navigation of the relational data — a pleasant alternative to writing and debugging SQL Select statements.