Today we are going to see how we can easily merge several csv files together using R.
When you start to work on a problem, you often need to start by merge the data from many files into one unique file. To illustrate my example, I will use data from the Walmart store sales forecasting challenge on Kaggle.
Link to datasets:http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data
In this problem, the information are contained into 3 files (we don't consider test data):
- train.csv
- stores.csv
- features.csv
# Load datasets dfTrain <- read.csv(file='train.csv') dfStore <- read.csv(file='stores.csv') dfFeatures <- read.csv(file='features.csv')To merge train with stores we will use the function merge. Merge will automatically select the column with the same name between train and stores to use them as a key and then merge the corresponding rows together. If you want to specify yourself the columns to use you can use the parameters by, by.x or by.y. The parameter all.x indicates that the resulting dataset will have all the rows in x (dfTrain) if you are familiar with SQL it is equivalent to a left outer join.
dfTrainTmp <- merge(x=dfTrain, y=dfStore, all.x=TRUE)We can then merge the features:
dfTrainMerged <- merge(x=dfTrainTmp, y=dfFeatures, all.x=TRUE)And finally save our new merged dataset:
write.table(x=dfTrainMerged,
file='trainMerged.csv',
sep=',', row.names=FALSE, quote=FALSE)
No comments:
Post a Comment