Writing about data and statistics

Choose the right statistic

In writing about data and statistics, your first step is to choose the right statistic to present to your audience.

Think about the main message

Any number of values can be derived from a set of data, so part of your job as a writer is to decide what you should quote. Usually, state the quantity that best sums up the main message of the data.

To capture the typical standard of living in a town, you could quote the average (or mean) income. That is, add up all the incomes and divide by the number of people or households earning them.

But what if there is 1 extremely rich person in town? That could result in an average that does not give a good indication of the typical experience of someone in that town.

In this case, you could find the median. To do so, sort all the incomes from smallest to biggest, then find the entry halfway down the list. Half the people earn less and half earn more than this income. One huge income is still just 1 person in the list, so it does not affect the median very much.

The median is a better statistic to quote to capture the typical experience of someone in that town.

Draw on expert advice from statisticians or mathematicians, if necessary, to make sure your message is accurate.

Make sure the statistic is valid

Just because a calculation can be done does not make it valid:

Evaluating participant experience

Let’s say we’ve run a training workshop and we want to evaluate participant experience. We ask participants some questions with answers from a Likert scale. For example:

1. Overall, how would you rate the usefulness of course material?

❑ Very poor ❑ Poor ❑ Average ❑ Good ❑ Very good

To sum this up, we scould say very poor = 1, poor = 2, average = 3, good = 4 and very good = 5, and then average the responses to get (say) 3.4.

This calculation is not valid, because Likert-scale questions give ordinal data. The ‘distances’ between consecutive categories vary – they must, because they are purely psychological and not actually measured. For example, the difference between ‘very poor’ and ‘poor’ need not be the same as the difference between ‘good’ and ‘very good’, so we cannot average the scores. The calculation can be done, but the underlying assumptions are invalid. This average is not meaningful.

To show aggregated Likert-scale responses, you can use the median (the middle value in the set) or the mode (the most frequent value in the set). Mode is generally preferred.

Likert scales are most commonly constructed with 5 or 7 anchor points. Scales with an odd number of possible responses allow the respondent to choose a ‘neutral’ answer because the scale has a midpoint. A Likert scale with an even number of anchor points, most commonly a 4-point scale, is known as a ‘forced choice’ scale because it does not allow for a neutral answer: the respondent must take a position.

Similarly, the correct formula to use when calculating correlations depends on the type of data – are they measurements (e.g. height, weight) or chosen from a list of categories (e.g. eye colour, educational level)?

Do not quote a statistic without understanding it, and knowing that it is valid.

Make sure the number of decimal places included is reasonable

The number of decimal places to which a result is quoted implies the precision of the number. Do not quote the output of a calculation in a way that implies that it is more precise than the inputs:

If we travel 103 km in 1.7 h, our average speed is:

$$\frac{103\mathrm{~km}}{1.7\mathrm{~h}}=60.5882353\mathrm{~km/h}$$

‘103 km’ contains 3 significant figures, and 1.7 h has 2 significant figures. In this case, the least precise input has 2 significant figures, so quote the average speed as 61 km/h.

Be thorough and even-handed

Data are, generally, objective. Attentive readers will notice if data are misrepresented (accidentally or otherwise), and they may become more unengaged and critical of your content.

Provide context

Without useful context, readers may misinterpret material, fail to recognise its significance, or become uncertain and dissatisfied:

If I say ‘1 in 5 people experience mental health problems’, do I mean:

1 in 5 people will experience a mental health problem at some point in their lives
1 in 5 people will experience some kind of mental health problem in the course of a year
1 in 5 adults experience a diagnosable mental health problem at any given time?

In fact, if we examine the references, we find that 1 in 5 adult Australians experienced a mental illness in 2007.

Source: Facts & figures about mental health(Opens in a new tab/window), The Black Dog Institute

Always provide the necessary context. The reader only sees what is on the page (or screen), and they may lack your background in the topic. What do they need to know to make the information meaningful?

Avoid cherry picking

No matter how much you wish to see a certain result in the data, do not focus unduly on the aspects that support the result. This will make your conclusions invalid. Readers will notice, and trust will be lost.

Take care with emotive language

Emotive language is more open to disagreement. Think very carefully before using it:

Do not say:
… appallingly, 45% of respondents said …

This needlessly creates a chance for conflict. There may be no disagreement about the 45%, but the reader may not agree with your value judgement that it is ‘appalling’.

Review every adjective and adverb. When in doubt, leave it out.

Drawing out implications of data and statistics does leave room for drama and big statements, but the data themselves must be treated objectively.

Be clear about correlation versus causation

Summing up what the data show is different from explaining why the effect occurs. You might find a correlation between action A and result B. But, without further investigation or recourse to the literature, do not ascribe a cause:

If you say:

A improves/increases/decreases B

A causes B

A leads to B

to achieve B, do A

you are implying a causal link. Only do so if such a link has been shown.

Otherwise, say:

when A is larger, B improves/increases/decreases

when A is large, B tends to be large/small

A and B co-occur.

Caution! Always make a clear distinction between what the data show and what you are supposing.

You are summing up the results of a survey with 2 questions:

How often do you eat broccoli?
How many days of work have you missed because of illness in the last 6 months?

Do not say:
The survey shows that, because broccoli is a good source of many vitamins, people who eat more broccoli get sick less often.

[Why is this unacceptable? Because the survey does not reveal the cause of fewer sick days. Maybe people who eat a lot of broccoli tend to have other habits – like regular exercise – that contribute to their wellbeing.]

Say:
The survey shows that people who eat more broccoli tend to get sick less often, possibly because broccoli is a good source of many vitamins.
or avoid the explanation entirely:
The survey shows that people who eat more broccoli tend to get sick less often.

Use the correct terminology

Statistics has its own jargon. Always keep your audience in mind. Will they be comfortable with terms such as significance testing, multivariate analysis and confidence interval? If not, can you define the terms or rearrange the discussion to do without them? Conversely, avoid talking down to experts.

Here are a few examples of choosing the right terminology.

Average vs mean

The average is obtained by adding up the values and dividing by how many there are. Within statistics, it is known as the (arithmetic) mean. But, unless your audience is mathematically literate, use average.

Percentages, rates and proportions

In 2017, more than a quarter of deaths in Australia were caused by cardiovascular disease. This can be expressed in several ways:

Percentage	27%
Proportion	0.27 or ‘more than a quarter’
Rate	1 in 4

Source: Cardiovascular disease(Opens in a new tab/window)

In a technical publication, use the percentage (or the proportion in decimal terms). It is precise. Use the rate (or the proportion in verbal terms, such as more than a quarter) when writing for a broad audience. This is more human and immediate, although it lacks the precision to please an expert.

Take care when discussing changes in percentages:

If 5% of the population was vegetarian in 2010 and 10% of the population was vegetarian in 2020, this does not mean:
Vegetarianism has grown by 5% over the decade. [This would be an increase of 5% of 5%, or 5.25% in 2020]
It should be expressed as:
Vegetarianism has grown by 5 percentage points to 10%.
or:
Vegetarianism has grown from 5% of the population in 2010 to 10% in 2020.

Probabilities

Impossible events have a probability of 0, certain events a probability of 1. The probability of tossing a head (or tail) with a fair coin is 0.5.

Consider reminding your audience that probabilities cannot predict the outcome of a single event. Probabilities do not say what will happen, only what will happen on average.

Do not confuse probabilities with odds:

Probability of rolling a 2 on a fair dice: 1 ÷ 6 = 1.67 (1 of 6 possible outcomes).
Odds of rolling a 2 on a fair dice: 1:5 (1 outcome is a 2; 5 outcomes are not a 2).

Avoid discussions in terms of odds unless you are sure your audience is familiar with the idea.

Taking care with technical terms

Some words have specific meanings within data and statistics. Use them with care. When in doubt, check. Here are a few examples.

Word	Does mean	Does not mean
Significant	The result has been shown by rigorous statistical tests to be likely to be real (with a specified probability)	Big Impressive Wow! I did not expect that
Trend	A consistent pattern that occurs over a fairly long period	Something that happens once (or even twice) Something that is common A fashion (something ‘trendy’) A change in behaviour (such as a change in slope of a graph)
Average	I have added up all the values and divided by the number of values	Typical Normal Usual Disappointing (‘that was pretty average’)
Normal	Distributed according to the ‘normal’ distribution (the technical term synonymous with the bell curve)	Typical Usual As expected

Always look out for terms that could be confusing, ambiguous or misunderstood. Use the word that has the fewest possible interpretations.

Combine words and images

Data-rich documents usually include text, tables and graphs. Ensure that:

all contribute to the discussion
any repetition of content between text, tables and graphs is for a good reason
the elements do not contradict each other
the data are presented in the most suitable form (Visuals and data discusses this topic in detail)
the rules for correct presentation of text, tables and figures are followed (see Visuals and data).

Graphs should use industry-standard formats and terminology, and work with tables and text to illustrate the points you are making. Keep them simple and clear. If your audience may not be comfortable with statistics, data and mathematics, help them to understand by writing plain English aided by instructive, uncluttered graphs.

Data can be very persuasive because they take us away from the realm of opinion, but are most effective when presented in a comprehensible, logical and clear way.

On this page