Probably Overthinking It: April 2012

Wednesday, April 25, 2012

Fog warning system: part three

Background: I am trying to evaluate the effect on traffic safety of a fog warning system deployed in California in November 1996. The system was installed by CalTrans on a section of I-5 and SR-120 near Stockton where the accident rate is generally high, particularly during the morning commute when ground fog is common. The warning system consists of (1) weather monitoring stations that detect fog and (2) changeable message signs that warn drivers to reduce speed.

I will post my findings as I go in order to solicit comments from professionals and demonstrate methods for students. If I can get permission, I will also post my data and code so you can follow along at home.

Previously: In the first installment I reviewed the first batch of data I am working with, and ran some tests to confirm that Poisson regression is appropriate for modeling the number of accidents in a given day. In part two I ran Poisson regressions to identify factors that influence the number of accidents per day.

Critical events

I have been waiting to get more details about several events that affected traffic safety during the observation period. I was able to get in touch with a Transportation Engineer in the Traffic Safety Branch of Caltrans District 10, which includes the study area. According to Caltrans records, the speed limit on the relevant section of I-5 was increased from 55 to 70 mph on March 25, 1996. The speed limit on SR-120 was increased from 55 to 65 mph about a month later, on April 22, 1996. Many thanks to my correspondent for this information!

The automated warning system was activated in November 1996. My collaborator has collected data on weather measurements made by the system and the warning it displayed. I hope to get this data processed soon.

Accidents per million vehicles

In the previous article, I ran models with raw accident counts as the dependent variable, and found that traffic volume is a significant explanatory variable. Not surprisingly, more cars yield more accidents.

Rather than use volume as an explanatory variable, an alternative is to express the dependent variable in terms of accidents per million vehicles. As a reminder, here's what the traffic volume (in thousands of cars per day) looks like during the observation period:

And here are the raw accident counts:

I divided counts by volume and converted to accidents per million cars. At the same time I smoothed the curves by aggregating quarterly. Here's what that looks like:

The vertical red lines show major events expected to affect traffic safety: increased speed limits in March and April 1996, and the activation of the warning system in November 1996.

This graph suggests several observations:

In the control directions, the accident rate was flat from 1992 through 1994, increased quickly in 1995 (before the speed limits were increased) and has been flat every since.
In the treatment directions, the accident rate was trending down until late 1996, including three quarters after the speed limit was increased. The accident rate increased sharply in 1997 and possibly again in 2000.
The accident rate in both directions was unusually low during the third quarter of 1996, when the warning system was activated. Other than that, there is no obvious relationship between accident rates and the events of 1996.

Since we don't expect the warning system to have much effect on the control directions (that's why they're called "control"), the speed limit changes are by far the most likely explanation for the accident rate changes. But it is puzzling that a large part of the change occurred before the new speed limits went into effect. One possibility is that as new speed limits were rolled out throughout California, drivers became accustomed to higher speeds and drove faster even on roads where the new limits were not in effect. But if that's true, it doesn't explain the continuing decline in the treatment directions.

My collaborator has some data on actual driving speeds before and after 1996. Once I process that data, I will be able to get back to this puzzle.

Injuries and fatal accidents

In response to a previous post, a reader suggested that if the warning system causes drivers to slow down, it might affect the severity of accidents more than the raw number. To investigate that possibility, I also plotted the rates for injury accidents (including fatalities) and fatal accidents.

Here is the graph for injury accidents:

The patterns we saw in the previous graph appear here, too. In addition, this graph suggests, more strongly, the possibility of a second changepoint in late 1999 or 2000.

And here is the graph for fatal accidents:

The number of fatal accidents is, fortunately, small. During more than 10 years of observation, there were only 26 in the study area. The trends in the other graphs are not apparent here, other than the general increase in the rate of fatal accidents in the second half of the observation period.

Summary

Accident rates in the control and treatment directions increased sharply around 1996, but neither effect is related in an obvious way to increased speed limits or deployment of the warning system.
Accident rates were unusually low in the quarter the warning system was activated; other than that, no effect of the warning system is apparent.
It looks like there was a second increase in accident rates in late 1999 or 2000. I will ask my correspondent at Caltrans if he has an explanation.

Next steps

There's not much more I want to do with this data. Now I need more numbers! In particular, I will be able to get data from the warning system itself, including:

Conditions measured at roadside weather stations, which should be better than the data I have from the airport 8 miles away, and
Messages displayed when the warning system was active.

If the warning system has an effect, it should be apparent on the days it is active. By comparing the treatment and control directions, it should be possible to quantify the effect.

Also, I have permission now to share the data; I will try to get it posted, along with my code, before the next update.

[UPDATE April 26, 2012]

A reader asked

I can think of two ways that overall traffic volume affects accident rates: (1) more cars = more accidents overall, which you control for by measuring accident rates, and now you're seeing rising accident rates per car. So this raises the next thought, (2) more cars = more traffic density, which raises accident rates per car for each car on the road.

What happens if you regress on traffic volume squared, or include traffic volume as an independent variable in the accident rate regression? The density effect is likely nonlinear but it's a thought.

This is a great question. If there is a non-linear relationship between traffic volume and the raw number of accidents, then even after we switch to accident rates, there might still be a positive relationship between traffic volume and accident rates.

I ran these regressions, and in fact there is a relationship, but with the limitations of the data I have, I don't think it means much. Specifically, I only have annual estimates for traffic volume, so there's no fluctuation over time; traffic volume increases at a nearly constant rate for the entire observation period (see the figure above).

So traffic volume will have a positive relationship with anything else that's increasing, and a negative relationship with anything decreasing. And that's what I see in the regressions:

All of the relationships are statistically significant, but notice that in the treatment directions, before 1996 when the accident rate was declining, the relationship with traffic volume is negative!

I don't think this variable has any explanatory content; any other ramp function would behave the same way. If I can get finer-grain data on traffic volume, I might be able to look for a more meaningful effect.

Friday, April 20, 2012

Fog warning system: part two

Background: I am trying to evaluate the effect on traffic safety of a fog warning system deployed in California in November 1996. The system was installed by CalTrans on a section of I-5 and SR-120 near Stockton where the accident rate is generally high, particularly during the morning commute when ground fog is common. The warning system consists of (1) weather monitoring stations that detect fog and (2) changeable message signs that warn drivers to reduce speed.

I will post my findings as I go in order to solicit comments from professionals and demonstrate methods for students. If I can get permission, I will also post my data and code so you can follow along at home.

Previously: In the previous installment I reviewed the first batch of data I'll work with, and ran some tests to confirm that Poisson regression is appropriate for modeling the number of accidents in a given day.

Poisson regressions

Traffic volume

To measure the effect of traffic volume on the number of accidents, I ran a Poisson regression with one row of data per day from January 1, 1992 through March 31, 2002, which is 3742 days.

Dependent variable: number of accidents in a day. "Accidents" includes all accidents, "injury" includes only accidents involving an injury or fatality, and "fatal" includes only fatal accidents.
Explanatory variable: AADT, which is annualized average daily traffic, in units of 1000s of cars.

Here are the results:

The columns are:

sig: statistical significance. * indicates borderline significance (p-values near 0.5); ** is significant (p-values less than 0.01); *** is highly significant (very small p-values).
coeff: the estimated coefficient of the regression. For example, the coefficient 0.03 means that an increase of one unit of AADT (1000 cars) yields and increase of 0.03 in the expected log(count), where count is the number of accidents.
% incr is the coefficient converted to percentage increase per unit of AADT. In this example, the coefficient 0.03 indicates that for an increase of 1000 cars per day, we expect an increase in the number of accidents of roughly 3%.

Not surprisingly, increase traffic volumes are associated with more accidents (of all types). In the control and treatment directions, the increase is about the same size, 3-4% for each additional 1000 cars.

For fatal accidents the association is less statistically significant, most likely because the number of fatal accidents is much smaller. There were 1900 accidents, total, during the observation period; 705 involved injuries but no fatalities; only 26 were fatal.

I conclude that traffic volume has a substantial effect on the number of accidents, so

It should be included as an explanatory variable in subsequent models, and
It might be important to get additional traffic data, broken down by direction of travel and a finer scale than annual!

Weather

The next set of explanatory variables I considered is:

Fog: binary variable, whether fog was detected at the airport weather station during the day.
Heavy fog: same as above, but apparently based on a different visibility threshold.
Precip: total precipitation for the day in 0.1 mm units.

And here's the first surprise. Controlling for traffic volume, there is no significant relationship between fog and the number of accidents (of any kind).

For heavy fog, there is generally no significant relationship, but:

In the control directions only, heavy fog has a significant effect, but the coefficient is negative. If this effect is real, heavy fog decreases the number of accidents by about 30%.
If we break the data set into before and after November 1996, the effect disappears in the "before" dataset.
There is no apparent effect on fatal accidents.

There are several possible conclusions:

This effect is real, and for some reason heavy fog actually decreases the number of accidents, but only in the control direction.
This effect is a statistical fluke, and the fog variables have no explanatory power. In that case, it is possible that fog in the study area does cause accidents, but measurements at the airport do not reflect conditions in the study area (8 miles away).

On the other hand, the effect of precipitation is consistent, significant, and (as expected) dangerous. Here are the results for precipitation (controlling for traffic volume):

Each millimeter of precipitation increases the number of accidents by about half a percent. [I am not sure how seriously to take that interpretation, since this relationship is probably non-linear. It might be better to make binary variables like "rain" and "heavy rain".]

Summary

Here's what we have so far:

As expected, more traffic yields more accidents.
Surprisingly, there is not statistical relationship between our fog measurements and accident rates.
There is a consistent relationship between precipitation and accidents, but I might have to come back and quantify it more carefully.

Before going farther, I want to get more specific information about when the speed limits were changed on these road segments and when the warning system was deployed. So that's all for now.

Thursday, April 19, 2012

Fog warning system: life saver or road hazard?

I've started a new project, working with a collaborator at another university, to evaluate the impact of a fog warning system deployed on a highway in California in November 1996. The system was installed by CalTrans on a section of I-5 and SR-120 near Stockton where the accident rate is generally high, particularly during the morning commute when ground fog is common.

The warning system consists of (1) weather monitoring stations that detect fog and (2) changeable message signs that warn drivers to reduce speed. Among people who study traffic safety, there are two theories about these kinds of systems:

1) The mainstream theory is that when drivers are warned to slow down, they slow down, and lower speeds reduce the accident rate.

2) The heterodox theory, which my collaborator holds, is that warning signs introduce perturbations into the flow of traffic so they can cause more accidents than they prevent.

My job is to evaluate which theory the data support. Here's what I have to work with:

1) The warning system was activated in November 1996.

2) My collaborator and his students collected data from the CalTrans Traffic Accident Surveillance and Analysis System (TASAS). It includes all fatal and injury accidents and a large portion of "property damage only" accidents. Their data runs from Jan 1, 1992 to March 31, 2002, about five years before and six years after the warning system was activated.

3) NOAA's National Climatic Data Center (NCDC) operates a weather station at Stockton Airport (KSCK), about 8 miles from the study area. I downloaded daily weather data from 1992 through 2002.

4) California DOT publishes average daily traffic volume (ADT) through several locations in the study area. I downloaded their reports from 1992 through 2002.

But there are some challenges:

1) The biggest problem is that the speed limit in the study area changed in January 1996, 11 months before the warning system was deployed. It will be difficult to separate the effect of the warning system from the effect of the speed limit. I have exchanged email messages with someone at CalTrans who might be able to tell me exactly when the speed limit signs were changed on each stretch of road.

2) The traffic volume data is annualized, so it doesn't account for variation on smaller time scales, and it does not distinguish between traffic in different directions.

3) The weather station at the airport is about 8 miles from the study area, and at a higher altitude, so the fog data may not capture the actual conditions in the study area.

However, we have one ace in the hole: the system only shows warnings to traffic moving in one direction (toward the merger of the two highways), so traffic in the other direction acts as a natural control. I'll refer to the segments with warning signs, 5S and 120W, as the "treatment directions," and the others, 5N and 120E, as the "control directions."

Accident data

To get a quick feel for the data, I plotted the number of accidents per month in the treatment and control directions.

In both directions, the number of accidents increases in 1996. The change is larger in the treatment direction; before 1996, the treatment direction was safer; after 1996, it became more dangerous. It looks like the treatment directions might have become more dangerous again in 2000, but just by looking at this figure, it's hard to say if that effect is significant.

Based on raw number of accidents, there is no evidence that the warning system is effective. But there are several other factors to consider, including traffic volume, speed, and weather.

Traffic volume

This plot shows annualized estimates of average daily traffic volume (ADT) through the study area.

The traffic volume on SR-120 increases consistently during the observation period; volume on I-5 is mostly flat; volume after the merge point increases substantially. Since many accidents occur near the merge point, I will use the estimates from after the merge point for analysis.

These estimates include both directions of travel, so we can't distinguish volumes in the treatment and control directions.

Weather

This plot shows the number of days per month with fog, heavy fog, or more than 3mm of precipitation, as observed at Stockton Airport:

Not surprisingly, all three variables show seasonal variability. Other than that, there are no obvious trends.

Regression analysis

Having cleaned and processed the data, we can look for factors that contribute to accidents. In total, there were 932 accidents in the control directions and 968 in the treatment directions. Over 3744 days, the average number of accidents per day is 0.51. Many days have no accidents. On the worst day in the observation period, there were nine!

I'll use Poisson regression to model the number of accidents in each day as a function of the explanatory factors. A requirement of Poisson regression is that the distribution of the dependent variable should be (wait for it) Poisson. To check this requirement, I computed the number of accidents each day for the control and treatment directions, before and after November 15, 1996 (roughly when the warning system was activated).

To check whether these distributions are Poisson, I plotted the complementary CDF on a log-y scale; under this transform, a straight line is characteristic of a Poisson distribution.

In all four cases the transformed CDF is roughly a straight line, so Poisson regression with this data should be just fine.

The other characteristic of the Poisson distribution is that the mean and variance are the same; we can check that, too.

mean variance

control before 0.11 0.14

after 0.37 0.46

treatment before 0.06 0.07

after 0.44 0.62

In each case, the variance is a little higher than the mean, which suggests that there are more multi-accident days than we expect in a Poisson process (it's easy to imagine an explanatory mechanism). But the difference is small, so again I think we are safe using Poisson regression.

This table also demonstrates the effect I mentioned earlier. Before the changes in 1996 (increased speed limit and activation of the warning system) there were fewer accidents in the treatment directions (about half). After 1996 it's the other way around: there are more accidents in the treatment directions.

That's enough with the preliminaries. Next time we'll get into the analysis and see what factors contribute to the accident rate.

Monday, April 2, 2012

The passive voice is a hoax!

Excessive use of the passive voice in science writing is a self-perpetuated, mutually-perpetrated hoax. Most style guides recommend the active voice, and most readers prefer it. But in some academic fields, especially the sciences, authors use a stilted and awkward style that replaces clear concise sentences like, "We performed the experiment," with circumlocutions like "The experiment was performed."

Asked why they write like that, many scientists admit that they don't like it, but they are under the impression that journals require it. They are wrong. Of the journals that have style guides, the vast majority explicitly ask authors to write in the active voice.

Here's what you can do to help stop the carnage:

If you are writing scientific articles in the passive voice, check the style guide for your journals. Unless you are explicitly required to write in the passive voice, don't!
If you are reviewing articles, check the style guide for your journals. Unless the passive voice is explicitly required, don't "correct" sentences in the active voice.
If you are the editor of a scientific journal, make sure that your style guide explicitly recommends the active voice, and make sure authors and reviewers are aware of your recommendation.
If you are teaching students to write scientific papers in the passive voice, STOP! There is no reason for students to practice bad writing. If, at some point in the future, they actually have to write like that, they can write a first draft in the active voice and then translate.
If you know of any other style guides that make a recommendation on this topic, let me know and I will add them to this page. So far I haven't found any that actually call for the passive voice.

Here are the style guides from some of the top journals in science:

Nature

"Nature journals like authors to write in the active voice ("we performed the experiment..." ) as experience has shown that readers find concepts and results to be conveyed more clearly if written directly."
The Nature Editorial Staff comment on their style recommendations here, and here is a collection of letters to Nature on this topic.
Science

"Use active voice when suitable, particularly when necessary for correct syntax (e.g., 'To address this possibility, we constructed a lZap library . . .,' not 'To address this possibility, a lZap library was constructed . . .')."
Proceedings of the National Academy of Sciences USA (PNAS)

From personal correspondence with PNAS Editorial:

"... we do not have a style guide for authors beyond what can be found in the Information for Authors page (http://www.pnas.org/misc/iforc.shtml#prep). There are no rules recommending passive vs. active voice in research articles. I would recommend looking at some PNAS articles in your specific area of interest to get a flavor of the style used."
However, their Production Department adds:

"[We] feel that accepted best practice in writing and editing favors active voice over passive."
Note: many thanks to my correspondent at PNAS for permission to include these quotes.
IEEE

The IEEE Editorial Style Manual doesn't make an explicit recommendation
on this issue, but for "guidance on grammar and usage,"
it refers to the Chicago Manual of Style, which says:

"As a matter of style, passive voice {the matter will be given careful consideration} is typically, though not always, inferior to active voice {we will consider the matter carefully}."
Update: A reader sent the following: You might be pleased to note that the May 2007 template and instructions for IEEE Transactions articles (http://www.ieee.org/documents/TRANS-JOUR.pdf) is at least permissive:

"If you wish, you may write in the first person singular or plural and use the active voice ("I observed that ..." or "We observed that ..." instead of "It was observed that ...")."
ACM

If the ACM has a style guide I can't find it, but one of their publications, Crossroads, does, and it couldn't be clearer:

"Active voice replaces passive voice whenever possible."
A reader sent me the following note:

The American Chemical Society Style Guide, 3rd Edition writes as follows: "Use the active voice when it is less wordy and more direct than the passive." And "Use first person when it helps to keep your meaning clear and to express a purpose or a decision."

The following are journals whose style guides do not address this issue, which I take as implicit permission to use the active voice, as recommended by virtually all non-scientific style guides:

Physical Review Letters

Their style guide is silent on this issue.
Applied Physics Letters

Their instructions call for "good scientific American English," but they don't address the issue of voice explicitly.

They also suggest, "For general format and style, consult recent issues of the Journal." I chose an
article at random and found that it was generally in the active voice:

"We realized the described structure by first creating a 2D hexagonal pattern of etch pits..." with only a few unfortunate uses of the passive voice: "...to reduce therewith the number of stitching
interfaces, the magnification of the FIB images was reduced."

So I take that as implicit permission to write in the active voice...and to use the word "therewith".
Structure

Nothing explicit, but certainly no call for the passive voice:

"Research papers should be as concise as possible and written in a style that is accessible to the broad Structure readership."

Unfortunately, many journals provide no style guides at all:

New England Journal of Medicine: No explicit style guidelines.

Here is an interesting report from an author whose paper was tranformed from active to passive by misguided editors.

Here is a note from another reader:

"I also found this gem which you may have already read:http://www.amwa.org/default/publications/journal/vol25.3/v25n3.098.feature.pdf Use of the Passive Voice in Medical Journal Articles, Robert J. Amdur,MDa; Jessica Kirwan, MAb; and Christopher G. Morris, MSc, AMWA JOURNAL •VOL. 25, NO. 3, 2010"

Amdur et al measure the use of passive voice in medical articles and find that 20-30% of sentences are passive, compared with 3-5% in their reference corpus, the Wall Street Journal. They write:

"We could not find a survey study or consensus statement addressing the question of why authors of medical journal articles use the passive voice so frequently. No publication guideline mentions goals or limits for the use of the passive voice, and some of the most prestigious references are worded in a way that may encourage authors to use the passive voice whenever it is acceptable to do so. For example, the AMA Manual of Style says that, 'Authors should use the active voice, except in instances in which the actor is unknown or the interest focuses on what is acted on.'"

One point of clarification: I am not an absolutist on this issue. The passive voice has its uses. What I am objecting to is the obsolete tradition of writing scientific papers primarily in the passive voice. Finally, please do not send me email triumphantly pointing out the (occasional and appropriate) use of the passive voice in my essays.

THE HALL OF SHAME

In response to this article, I heard from several readers who found journals that explicitly ask authors to use the passive voice.

Jonathan Livengood indicted the ICES Journal of Marine Science which recommends the passive voice in its style guide:

Note too that the Journal prefers text to be written in the passive voice (e.g. “An experiment on XXX was undertaken …”) rather than in the active voice (e.g. “I undertook an experiment on XXX …”), though modest use of the active voice is acceptable.

David Weisman reported the style guide for Clinical Oncology and Cancer Research, which recommends:

Materials and Methods: Use the "passive voice" when describing experimental detail.

Note too that they compound the offence with spurious use of "quotation marks."

Donna Tucker found a borderline case. She wrote, "The American Meteorological Society no longer recommends the use of passive voice. It has not been that many years since they did... They do, however, have specific requirements for the abstract...

First person construction should not be used in the abstract, and references should be omitted because they are not available per se to abstracting services.

Donna continues, "So if I cannot say 'We collected the data.' I am left with 'The data were collected'. Although this requirement does not explicitly mandate the use of the passive voice, it does make it unavoidable in certain circumstances." The journal gets extra demerits for gratuitous use of "per se."

Finally, Michael Allen indicts the Journal of Animal Ecology for this:

The passive voice is preferred in describing methods and results. The active voice may be used occasionally to emphasize a personal opinion (typically in Introduction and Discussion sections).

Thank you to everyone who submitted a claim. I welcome additional nominations to the Hall of Shame.