分类: 大数据
2015-04-12 14:51:24
Hadley Wickham and the RStudio team have created some new packages for R, which will be very useful for anyone who needs to read data into R (that is, everyone). The readr package provides functions for reading text data into R, and the provides functions for reading Excel spreadsheet data into R. Both are much faster than the functions you're probably using now.
The readr package provides several functions for reading tabular text data into R. This is a task normally accomplished with the read.table family of functions in R, and readr provides a number of replacement functions that provide additional functionality and are much faster.
First, there's read_table which provides a near-replacement for read.table. Here's a comparison of using both functions on a file with 4 million rows of data (which I created by stacking copies of this file):
dat <- read_table("biggerfile.txt",
col_names=c("DAY","MONTH","YEAR","TEMP"))
dat2 <- read.table("biggerfile.txt",
col.names=c("DAY","MONTH","YEAR","TEMP"))
The commands look quite similar, but while read.table took just over 30 seconds to complete, readr's read_table accomplished the same task in less than a second. The trick is that read_tabletreats the data as a fixed-format file, and uses C++ to process the data quickly. (One small caveat is that read.table supports arbitrary amounts of whitespace between columns, while read_tablerequires the columns be lined up exactly. In practice, this isn't much of a restriction.)
Base R has a function for reading fixed-width data too, and here readr really shines:
dat <- read_fwf("biggerfile.txt",
fwf_widths(c(3,15,16,12),
col_names=c("DAY","MONTH","YEAR","TEMP")))
dat2 <- read.fwf("biggerfile.txt", c(3,15,16,12),
col.names=c("DAY","MONTH","YEAR","TEMP"))
While readr's read_fwf again accomplished the task in about a second, the standard read.fwf took over 3 minutes — almost 200 times as long.
Other functions in the package include read_csv (and a European-friendly variant read_csv2) for comma-separated data, read_tsv for tab-separated data, and read_lines for line-by-line file extraction (great for complicated post-processing). The package also makes it much easier to read columns of dates in various formats, and sensibly always handles text data as strings (no morestrings.as.factors=FALSE).
For data in Excel format, there's also the new . This package provides function to read Excel worksheets in both .xls and .xlsx formats. I haven't benchmarked the read_excelfunction myself, but like the readr functions it's based on a C++ library so should be quite snappy. And best of all, it has no external dependencies, so you can use it to read Excel data on just about any platform — there's no requirement that Excel itself be installed.
The now, and . If you try them yourself, let us know how it goes in the comments.
RStudio blog: readr 0.1.0