Hi everybody,
Today we are going to see how we can easily merge several csv files together using R.
When you start to work on a problem, you often need to start by merge the data from many files into one unique file. To illustrate my example, I will use data from the
Walmart store sales forecasting challenge on
Kaggle.
Link to datasets:
http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data
In this problem, the information are contained into 3 files (we don't consider test data):
- train.csv
- stores.csv
- features.csv
What we want to do, is to merge the data of stores.csv and features.csv into into train.csv, in order to do this we gonna start by loading the datasets into R :
# Load datasets
dfTrain <- read.csv(file='train.csv')
dfStore <- read.csv(file='stores.csv')
dfFeatures <- read.csv(file='features.csv')
To merge train with stores we will use the function
merge. Merge will automatically select the column with the same name between train and stores to use them as a key and then merge the corresponding rows together. If you want to specify yourself the columns to use you can use the parameters by, by.x or by.y. The parameter all.x indicates that the resulting dataset will have all the rows in x (dfTrain) if you are familiar with SQL it is equivalent to a left outer join.
dfTrainTmp <- merge(x=dfTrain, y=dfStore, all.x=TRUE)
We can then merge the features:
dfTrainMerged <- merge(x=dfTrainTmp, y=dfFeatures, all.x=TRUE)
And finally save our new merged dataset:
write.table(x=dfTrainMerged,
file='trainMerged.csv',
sep=',', row.names=FALSE, quote=FALSE)