Sunday, March 2, 2014

Basic plots in R, plot, boxplot and hist

There is tens of way to plot a set in R. Today I will just present you 3 of them:
  • plot
  • boxplot
  • hist
To simply plot a vector of values, you can simply use the function plot. Try this:
# Get some data to plot
data(iris)
plot(iris$Sepal.Length)

To have a better idea of the distribution the function boxplot is really useful:
# one set
boxplot(iris$Sepal.Length)
# 2 sets
boxplot(data.frame(iris$Sepal.Length, iris$Sepal.Width))
Finally if you need more details on the distribution you will use hist:
hist(iris$Sepal.Length)

Friday, February 28, 2014

Remove NA values in a data frame in R

Hi everybody,

Just a little post for today, but it is something that will be really useful in many occasions.

If you had read the post of yesterday, you probably noticed that there were some NA values in the final dataset, however NA values are sometime not desired and you would like to replace them with another value.

We use the function is.na to select the NA values in the dataset and then replace them with 0.
myData[is.na(myData)] <- 0

Wednesday, February 26, 2014

How to merge csv files with R?

Hi everybody,

Today we are going to see how we can easily merge several csv files together using R.

When you start to work on a problem, you often need to start by merge the data from many files into one unique file. To illustrate my example, I will use data from the Walmart store sales forecasting challenge on Kaggle.

Link to datasets:http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

In this problem, the information are contained into 3 files (we don't consider test data):
  • train.csv
  • stores.csv
  • features.csv
What we want to do, is to merge the data of stores.csv and features.csv into into train.csv, in order to do this we gonna start by loading the datasets into R :
# Load datasets
dfTrain <- read.csv(file='train.csv')
dfStore <- read.csv(file='stores.csv')
dfFeatures <- read.csv(file='features.csv')
To merge train with stores we will use the function merge. Merge will automatically select the column with the same name between train and stores to use them as a key and then merge the corresponding rows together. If you want to specify yourself the columns to use you can use the parameters by, by.x or by.y. The parameter all.x indicates that the resulting dataset will have all the rows in x (dfTrain) if you are familiar with SQL it is equivalent to a left outer join.
dfTrainTmp <- merge(x=dfTrain, y=dfStore, all.x=TRUE)
We can then merge the features:
dfTrainMerged <- merge(x=dfTrainTmp, y=dfFeatures, all.x=TRUE)
And finally save our new merged dataset:
write.table(x=dfTrainMerged,
            file='trainMerged.csv',
            sep=',', row.names=FALSE, quote=FALSE)

Tuesday, February 25, 2014

Loading and saving a dataset with R

Let's start with the basics!

How to load and save a dataset with R:

To load a dataset you can use the function read.table
myData <- read.table(file='/pathToMyFile/myDataset.txt', sep=';')
We used 2 parameters in this example, file is the path to the dataset to be load and sep is the character used to separate the fields in the dataset.

To save our dataset we can use the function write.table
write.table(x=myData, file='myData.txt', sep=';', row.names=FALSE, col.names=FALSE)
This time we used 5 parameters, the first one is x, it indicates the R object we want to save. Secondly, file is the name we want to give to the file and then sep is the character that will be used to separate each field. Finally, row.names and col.names let us chose if we want to include the row names or the headers (col.names) in the saved file.

Monday, February 24, 2014

Welcome to myR!

I am Alexandre Bujard and I started this blog to share with people the little knowledge I have about R. I hope it will help at least a few R coders to find solutions to their daily R problems.

Cogito, ergo sum.
I think, therefore I am.
-- RenĂ© Descartes,
Principia philosophiae (Principles of Philosophy) (1644), Part I, Article 7