Padding a Time Series in R

Posted by Irene Ros

Jun 06 2012

When analyzing and visualizing a new dataset, you’ll often find yourself working with data over time. Most software assumes that the data in a time series is collected at regular intervals, without gaps in the data: while this is usually true of data collected in a laboratory experiment, this assumption is often wrong when working with “dirty” data sources found in the wild.

This can lead to irregularities in many charts. For example, imagine the following dataset:

Time	Observations
`2011/11/01`	`12`
`2011/12/01`	`100`
`2012/01/01`	`320`
`2012/06/01`	`7`

Note that the gaps between the data points vary in size, from 1 month to 5 months. When we visualize this using d3, the assumption will be to connect the data points in a way that indicates a gradual shift from one value to another.

While this might work for some cases, you may actually want to fill in the gaps in the data like so:

Time	Observations
`2011/11/01`	`12`
`2011/12/01`	`100`
`2012/01/01`	`320`
`2011/02/01`	`0`
`2011/03/01`	`0`
`2011/04/01`	`0`
`2011/05/01`	`0`
`2012/06/01`	`7`

Which would result in a much different chart!

There are many ways to pad the data. I have written scripts in many languages to accomplish this, but settled on R as the quickest way to transform my data. R is an open source programming language and software environment for statistical computing and graphics.

Here’s a quick way to pad your dataset with zero values for missing dates:

# Read a comma-delimited file that has the following content
# time,observations
# 2011/11/01,12
# 2012/01/01,320
# 2011/12/01,100
# 2012/06/01,7
raw.data <- read.delim("timefill.csv", header=T, sep=",")

# Convert the time column to a date column.
# Accessing a column is done by using the '$' sign
# like so: raw.data$time.
raw.data$time <- as.Date(raw.data$time)

# sort the data by time. The [*,] selects all rows that
# match the specified condition - in this case an order function
# applied to the time column.
sorted.data <- raw.data[order(raw.data$time),]

# Find the length of the dataset
data.length <- length(sorted.data$time)

# Find min and max. Because the data is sorted, this will be
# the first and last element.
time.min <- sorted.data$time[1]
time.max <- sorted.data$time[data.length]

# generate a time sequence with 1 month intervals to fill in
# missing dates
all.dates <- seq(time.min, time.max, by="month")

# Convert all dates to a data frame. Note that we're putting
# the new dates into a column called "time" just like the
# original column. This will allow us to merge the data.
all.dates.frame <- data.frame(list(time=all.dates))

# Merge the two datasets: the full dates and original data
merged.data <- merge(all.dates.frame, sorted.data, all=T)

# The above merge set the new observations to NA.
# To replace those with a 0, we must first find all the rows
# and then assign 0 to them.
merged.data$observations[which(is.na(merged.data$observations))] <- 0

This will result in the following dataset:

> merged.data
        time observations
1 2011-11-01           12
2 2011-12-01          100
3 2012-01-01          320
4 2012-02-01            0
5 2012-03-01            0
6 2012-04-01            0
7 2012-05-01            0
8 2012-06-01            7

A substantial portion of any data visualization project involves cleaning, transforming and analysing data. Although R can be intimidating at first, it is a powerful open source tool for working with your data.

Posted by
Irene Ros
on June 6th, 2012

Tagged in

Contact Us

We'd love to hear from you. Get in touch!

Email

hello@bocoup.com