Using administrative data for labour market research: getting the balance right

Isabel Stockton, PhD Student, School of Economics, Finance and Management, and panel participant in Research without Borders 2017

Administrative data: it’s one of those phrases that can generate much excitement among economists and some other social scientists, but will never make for scintillating party conversation in any other setting.

However, the possibilities and limits on the use of administrative data for research can have a big impact on the policymaking process and raise tricky ethical questions, so it is important that the conversation is as broad as it can possibly be.

What is administrative data?

Administrative data is collected by the government for a non-research purpose.

For example, as part of my doctoral research I analyse national insurance data on jobs, wages and commuting distances in Germany.

Whenever someone starts or leaves a job, starts to claim unemployment benefits, is assigned to a jobseekers’ training programme or goes on parental or sick leave, this leaves a paper trail.

Economists in particular are very interested in this information: Many of us still subscribe to the traditional credo “Believe what people do, not what they say”.

A better evidence base for policy

A new job, a jobseeker’s allowance claim or a place on a training course allocated by the employment agency all generate paper trails: this is administrative data. Credit – Isabel Stockton.

This kind of data has many advantages: It has already been collected in large quantities, which gives us more confidence in our findings and allows us to analyse very specific subgroups (say, immigrants in service occupations outside of urban areas).

It is very accurate – in many cases, it would even be an offence for employers to misreport wages for national insurance purposes.

Finally, we avoid an issue that often plagues surveys called sample selection.

This problem arises because a survey not only covers a limited number of people, these people are not always randomly selected from the general population.

A famous example of just how problematic this can be was electoral polling ahead of President Roosevelt’s reelection in 1936, still told as a cautionary tale to statistics students the world over.

Using a telephone survey, researchers predicted that Roosevelt would suffer a crushing defeat, but he was actually re-elected by a large margin.

Nowadays, the consensus is that this error arose chiefly because people who had access to a telephone were wealthier than those who didn’t and were also less likely to vote for FDR.

Because everyone is obliged to provide certain types of data to the government, we can avoid these sorts of mistakes using administrative data.

How can the unemployed be supported back into work? This paper uses a combination of administrative and survey data to find out how a series of labour market reforms in Germany managed to slash the unemployment rate (and whether it is an unmitigated success story). Credit – Isabel Stockton.

The data I draw on has been used to evaluate the effects of major labour market reforms like the introduction of the minimum wage in Germany in 2015 and to study the determinants of unemployment to understand what works and what doesn’t in terms of supporting different groups of people back into paid work.

Better evidence on these questions can inform policy and give a factual base to controversial policy debates, both of which are extremely valuable.

But workers and firms never explicitly consented to the use of this data for research, and in fact many of them probably don’t even know about it, which is a far cry from the informed consent that research ethics demand.

Protecting citizens’ privacy

The German data I work with can be used for research under strict conditions.

The data is anonymised as much as possible, which means that names, addresses and other identifiers are erased.

If I still tried to identify individuals in my data, for example through the combination of their occupation and place of residence, that would be a criminal offence punishable by a prison sentence.

The data is stored on secure servers in Germany and I had to travel there to work with it.

Other versions of the dataset that have less potential for identifying individuals can be accessed remotely.

Finally, when applying to use the data, researchers have to justify – to the Ministry of Work and Social Issues, not to other researchers – why their question is relevant to policy and practice, so that workers and firms whose data is used can ultimately benefit from better labour market policies based on sound evidence.

What about the UK?

A topic that has generated much interest: How have immigrants with different characteristics fared in the labour market? This paper studies discrimination against immigrants using data from the same database used in my research. Credit – Isabel Stockton

In Britain, comparable national administrative datasets on the labour markets have not been used by researchers – yet.

But the Digital Economy Act, which has recently been passed in the House of Commons opens up more sources of administrative data.

In addition, the ESRC (which also funds my PhD) has made major investments in creating the infrastructure for this research to take place, establishing the Administrative Data Research Network.

Their web presence offers a wealth of further information for both researchers and the public on the use of administrative data for research in the UK.

Going forward, it is up to us – policymakers, researchers and the public – to continue the debate on the right balance between protecting privacy and creating an evidence base for policy.

A framework on the use of administrative data for research has to set out procedures on a host of issues.

We need to decide who can access the data and what the criteria should be for deciding on applications.

Do researchers need to be affiliated with a UK university? What exactly makes a topic “policy relevant enough” to justify access? And who gets to decide this?

We need to specify under which conditions different datasets can be linked, and we need to establish anonymisation procedures.

We need to train researchers and very importantly, we need to establish monitoring procedures to continually evaluate whether the balance between enough access for good, relevant science and protection of privacy is right.

So the debate is not over when the first papers get written – that should only be its beginning.