How to share data with a data scientist

Jeff Leek posted a very interesting guide for anyone who needs to share data with a statistician. The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis. The Leek group works … Continue Reading »

Merge R data frames with SQL

R provides many ways for selecting data from a data frame. You can use, e.g., [], logical functions and functions like subset. If you know SQL you might think that all this could be way easier if you could just use some of the SQL commands that you know. As I found on the Revolutionanalytics blog, there is … Continue Reading »

Cluster Analysis with R

Cluster analysis is a useful method for finding structure in a mass of data. The main question in cluster analysis is: “Which objects are similar and which are not?” To answer this question, cluster analysis algorithms try to separate the data in clusters, where the clusters have a maximized similarity within the cluster and a … Continue Reading »

Databases in R

For some cases it is helpful to store the data not in a file, but in a database. Databases have some advantages when it comes to a large amount of data. The most important factor is that for calculations, just the actual data that is needed for the calculation needs to be loaded in the random access memory (RAM). Another advantage is the possibility to run calculations (stored procedures) with some database engines which will speed up some complex calculations with large data sets, as well as the abolition of exporting the results for other programs e.g. to plot the data with a GIS System. A very good and easy way to implement the database connection in R is with RODBC.