# Padding a Time Series in R

When analyzing and visualizing a new dataset, you’ll often find yourself working with data over time. Most software assumes that the data in a time series is collected at regular intervals, without gaps in the data: while this is usually true of data collected in a laboratory experiment, this assumption is often wrong when working with “dirty” data sources found in the wild.

This can lead to irregularities in many charts. For example, imagine the following dataset:

Time Observations
`2011/11/01` `12`
`2011/12/01` `100`
`2012/01/01` `320`
`2012/06/01` `7`

Note that the gaps between the data points vary in size, from 1 month to 5 months. When we visualize this using d3, the assumption will be to connect the data points in a way that indicates a gradual shift from one value to another.

While this might work for some cases, you may actually want to fill in the gaps in the data like so:

Time Observations
`2011/11/01` `12`
`2011/12/01` `100`
`2012/01/01` `320`
`2011/02/01` `0`
`2011/03/01` `0`
`2011/04/01` `0`
`2011/05/01` `0`
`2012/06/01` `7`

Which would result in a much different chart!

There are many ways to pad the data. I have written scripts in many languages to accomplish this, but settled on R as the quickest way to transform my data. R is an open source programming language and software environment for statistical computing and graphics.

Here’s a quick way to pad your dataset with zero values for missing dates:

``````# Read a comma-delimited file that has the following content
# time,observations
# 2011/11/01,12
# 2012/01/01,320
# 2011/12/01,100
# 2012/06/01,7

# Convert the time column to a date column.
# Accessing a column is done by using the '\$' sign
# like so: raw.data\$time.
raw.data\$time <- as.Date(raw.data\$time)

# sort the data by time. The [*,] selects all rows that
# match the specified condition - in this case an order function
# applied to the time column.
sorted.data <- raw.data[order(raw.data\$time),]

# Find the length of the dataset
data.length <- length(sorted.data\$time)

# Find min and max. Because the data is sorted, this will be
# the first and last element.
time.min <- sorted.data\$time[1]
time.max <- sorted.data\$time[data.length]

# generate a time sequence with 1 month intervals to fill in
# missing dates
all.dates <- seq(time.min, time.max, by="month")

# Convert all dates to a data frame. Note that we're putting
# the new dates into a column called "time" just like the
# original column. This will allow us to merge the data.
all.dates.frame <- data.frame(list(time=all.dates))

# Merge the two datasets: the full dates and original data
merged.data <- merge(all.dates.frame, sorted.data, all=T)

# The above merge set the new observations to NA.
# To replace those with a 0, we must first find all the rows
# and then assign 0 to them.
merged.data\$observations[which(is.na(merged.data\$observations))] <- 0
``````

This will result in the following dataset:

``````> merged.data
time observations
1 2011-11-01           12
2 2011-12-01          100
3 2012-01-01          320
4 2012-02-01            0
5 2012-03-01            0
6 2012-04-01            0
7 2012-05-01            0
8 2012-06-01            7
``````

A substantial portion of any data visualization project involves cleaning, transforming and analysing data. Although R can be intimidating at first, it is a powerful open source tool for working with your data.

We moved off of Disqus for data privacy and consent concerns, and are currently searching for a new commenting tool.

1. Irene, I applaud your effort and as is the case with most R there is always another way. I’ve been using the R package lubridate from Hadley Wickham and it’s wonderful for performing manipulations as you have in the post. Keep hacking! Will

2. Nice one.

Here’s mine:

ts_grid = function(D, ntv=\”day\”,…) {
\u00a0\u00a0\u00a0 m1 = sort(as.character(D[[1]]), decreasing=FALSE)
\u00a0\u00a0\u00a0 mref = as.character(seq( as.Date(m1[1]), as.Date(m1[length(m1)]), by=ntv ))
\u00a0\u00a0\u00a0 idx = mref %in% m1
\u00a0\u00a0\u00a0 dx = mref[-idx]
\u00a0\u00a0\u00a0 stopifnot(dx >= 0)
\u00a0\u00a0\u00a0 D1 = D[1:length(dx),]
\u00a0\u00a0\u00a0 is.na(D1) = TRUE
\u00a0\u00a0\u00a0 D1[[1]] = as.Date(D1[[1]])
\u00a0\u00a0\u00a0 for (c in 1:nrow(D1)) {
\u00a0\u00a0\u00a0 \u00a0\u00a0 \u00a0D1[c,1] = dx[c]
\u00a0\u00a0\u00a0 }
\u00a0\u00a0\u00a0 rbind(D, D1)
}

Yours is shorter which suggests it is probably the better technique.

Is \”padding\” the generally accepted term for this? I refer to it as \”gridding\” because usually i am merging two irregular (incomplete) time series and to do that i usually place both on a single \”grid\” of time values fine-grained enough to accommodate both series.

• Hi Daniele,

You would first need to convert your string into an appropriate R date.
You can read more about it here, but the short answer is that you can provide a format in your asDate function like so:

dates <- as.Date(strDates, \”%m/%d/%Y\”)

Thanks!