A data journey pt 2: from collecting to publishing

In the second part of the journey we are taking with our data, we will look at the efforts to make the data ready for publication.

Collect data from IHHD

After the initial contact from the team in the Institute for Health and Human Development (IHHD), we met them in person to collect the data on a CD.

We were talked through the data, shown it open in SPSS, and given information on how to get the documentation. What struck us was how useful the documentation could be. We were eager to get access to the field manual, as it sounded like it could be useful as a stand alone document.

Examine data and check for identifiers

Once we had the data files back in the office, we opened each one in SPSS to check it was valid. We also checked to make sure we were not publishing any identifiers which would compromise the anonymity of the participants.

This can be a complicated process with SPSS files. The data was split into many different files for different types of processing. Printing off a check sheet for every file in the collection (data and documentation) helps.

At this stage we created a record in our test EPrints repository. This allowed us to show the project team what was needed and how it would be arranged.

Translate/Get access to correct documentation

The documentation we were given, while being very full, was written in two local Indian languages (both spoken by many millions of people). These languages do no use the Latin alphabet that English uses, so additional font files were needed to read them in word format.

As our PCs are locked down by IT (and installing fonts requires admin privileges) this was not a trivial task. It was something we did not think of at all when planning how the ingest process would work. If others are going to be working with international data, language documentation/data issues could be a real problem.

Confirm metadata and other details are correct with all contributors

We had to contact all of the contributors to confirm that they were happy with the metadata, the files uploaded, the wording or various parts of the abstract/metadata. This was again not something we had planned for. While the PI might be happy with when they’ve given us, there were many people to go through to make sure we were doing everything correctly. This could be the case for all larger teams, different members have different responsibilities and will need to be contacted to make sure we are describing their contribution correctly.

This took time as they are not all in the same time zone, or working on the project anymore.

Publish and mint DOI

Once everyone in the project team was happy with the data, we could move the record from our test server, to our live server.

We made the record live on the 23rd of March 2015 and minted a DOI at the time (http://dx.doi.org/10.15123/DATA.4). 


When depositing data like this in a repository there are bound to be issues you can’t predict as you go along. We’ve learned from this experience, and will use it in future as an example for others within UEL.

Our experience also highlights in need for good relationships with the researchers, and how much self depositing could smooth the process.


A data journey pt 1: from conception to completion

We’re delighted to make available our first research data collection on data.uel. It involved a major survey to demonstrate the take up of healthcare services in one Indian state, compared to another state and to the situation several years earlier. The dataset and associated documentation is available at http://dx.doi.org/10.15123/DATA.4. The project was led by Mala Rao OBE, until recently Professor of International Health in the Institute of Health and Human Development (IHHD) at UEL.

The project was a major international collaboration with contributors from the Administrative Staff College of India (Hyderabad), ACCESS Health International (Hyderabad), SughaVazhvu Healthcare (Thanjavur), Indian School of Business (Hyderabad) and the Development Research Group (DECRG) of The World Bank, Washington, DC as well as colleagues in IHHD. Funding came from Canada (International Development Research Centre), UK (Wellcome Trust and Department for International Development), USA (Rockefeller Foundation) and from the World Bank. Professor Rao wrote a detailed guide to the contributions made to the study in a BMJ Open article reporting on the results:

[Mala Rao] conceived and designed the study, applied for funding, and was responsible for the supervision and management of all aspects of the study as well as the dissemination of its results. She is the guarantor. [Sofi Bergkvist] shared responsibility for the conception of the study, applications for funding, study design and data collation and analysis, contributed to the questionnaire design and commented on drafts of the report. [Prabal V Singh] contributed to the conception of the study and study design, led the questionnaire design and survey implementation, including training of survey staff, monitoring survey progress and data collation and verification, commented on drafts of the report and helped prepare the references. [Anuradha Katyal] undertook the data collation, verification and analysis, assisted with the survey and questionnaire design and survey implementation and prepared the tables for the report. [Amit Samarth] led the literature review, assisted with the study and questionnaire design, survey implementation and preparation and analysis of baseline data, and commented on drafts of the report. [Manjusha Kancharla] helped with the data analysis. [Adam Wagstaff] devised the methodology for the estimation of the programme impacts, advised during the data-collection and data-preparation stages, wrote and implemented the computer code for the model estimation, helped to oversee the production of the results, and contributed text to the report. [Gopalakrishnan Netuveli] provided technical advice on accounting for the complex survey structure in the analysis, developed a STATA equation, helped to compute an asset index, advised on the output tables, verified the analysis and commented on drafts of the report. [Adrian Renton] helped develop a conceptual framework for the evaluation, advised on funding proposals, the study design, analytical methodology and presentation of results and contributed text to the report. [Mala Rao] wrote the first draft of the paper and its redrafts in accordance with the comments of all other authors and reviewers.

It must have been a major undertaking to organise and coordinate research activity on three continents, with concurrent surveys in two Indian states. The data in http://dx.doi.org/10.15123/DATA.4 is available as an SPSS zip file in .SAV format. It comes with extensive documentation: for each of the two states (Andhra Pradesh and Maharashtra) there is

  • Field training manual (detailing how to conduct household surveys in rural areas)
  • Bilingual household listing schedule
  • Household survey tool (the questions asked in the survey)
  • Code book of values encoded in SPSS

In addition, we have linked the data to several publications based on it, as well as to the baseline data from an earlier government survey:

  1. Rao, Mala and Katyal, Anuradha and Singh, Prabal V. and Samarth, Amit and Bergkvist, Sofi and Kancharla, Manjusha and Wagstaff, Adam and Netuveli, Gopalakrishnan and Renton, Adrian (2014) ‘Changes in addressing inequalities in access to hospital care in Andhra Pradesh and Maharashtra states of India: a difference-in-differences study using repeated cross-sectional surveys’, BMJ Open, 4(6), e004471. (10.1136/bmjopen-2013-004471).
  2. Narasimhan, H. and Boddu, V. and Singh, Prabal V. and Katyal, Anuradha and Bergkvist, Sofi and Rao, Mala (2014) ‘The Best Laid Plans: Access to the Rajiv Aarogyasri community health insurance scheme of Andhra Pradesh’, Health, Culture and Society, 6(1) (10.5195/hcs.2014.163).
  3. Bergkvist, Sofi and Wagstaff, Adam and Katyal, Anuradha and Singh, Prabal V. and Samarth, Amit and Rao, Mala (2014) What a difference a state makes: health reform in Andhra Pradesh. Working Paper. New York: World Bank. Available at http://documents.worldbank.org/curated/en/2014/05/19546767/difference-state-makes-health-reform-andhra-pradesh.
  4. Katyal, A., Singh, P. V., Samarth, A., Bergkvist, S., & Rao, M. (2013) ‘Using the Indian National Sample Survey data in public health research’, National Medical Journal of India, 26(5), pp. 291-294. Available at http://www.nmji.in/Volume-26-Issue-5.asp.
  5. Rao, M., Ramachandra, S. S., Bandyopadhyay, S., Chandran, A., Shidhaye, R., Tamisettynarayana, S., Thippaiah, A.,  Sitamma M., Sunil George, M., Singh, V. Sivasakaran, S. and Bangdiwala, S. I. (2011) ‘Addressing healthcare needs of people living below the poverty line: A rapid assessment of the Andhra Pradesh Health Insurance Scheme’, National Medical Journal of India, 24(6), pp. 335-341. Available at http://www.nmji.in/Volume-24-Issue-6.asp.
  6. National Sample Survey Office (2004) ‘Survey on MORBIDITY AND HEALTH CARE: NSS 60th Round : January 2004 – June 2004’, [dataset] New Delhi: MOSPI, 2004. Available at http://mail.mospi.gov.in/index.php/catalog/138.

In the next part of this Data Journey, David will talk about the work involved in archiving and publishing this data collection.

Lords of the Data: psychology and research data

One of the work outputs on the TraD project is is to deliver a course in research data management for Psychology postgradate students at University of East London. We’ve therefore been keenly following an academic scandal directly affecting the world of social psychology involving the Dutch psychologists Diederik Stapel (falsifying data) and Dirk Smeesters (massaging data).  Similar scandals and retractions in the field have also involved Lawrence Sanna (2012)  Marc Hauser (fabricating data 2010) and Karen Ruggiero (2001).

Brain storm: Social Psychology Theory / Distributed Cognition / CSCW by Rob Enslin

Brain storm: Social Psychology Theory / Distributed Cognition / CSCW by Rob Enslin

Dutch investigators have released their final report into the case of Stapel from Tilburg University, entitled: Flawed science: The fraudulent research practices of social psychologist Diederik Stapel.

Its findings reveal that Stapel fabricated data in 55 articles and book chapters. So far, 31 of those published papers have been retracted — three others have expressions of concern — although more might follow. In addition, 10 dissertations by students Stapel supervised were found to contain fraudulent data and this is what should be brought to the attention of our psychology Postgrad students here at UEL.

The final report makes for sober reading (Stapel personally taught his department’s scientific ethics course, for example) and a damning assessment of the discipline itself referring to “a failure to meet normal standards of methodology. [bringing] into the spotlight a research culture in which… sloppy science, alongside out-and-out fraud, was able to remain undetected for so long.” P. 5

The report has highlighted a number of issues with regards to research data handling, standards and attitudes and which we are likely to cover in our course.  More points from the report:

  • “The Stapel group had no protocols for, for example, the collection of data (including standards for questionnaires) or research reports. The PhD students in Mr Stapel’s group were not familiarized with fixed and clear standards. (Stapel) underscored the lack of (fixed and clear) standards; but that too was far from a local issue, according to Mr Stapel.” P. 42

We cannot be certain if as Stapel says this is not a local issue but one that crosses social psychology as a whole but how research data is collected and standards applied to that data (as well as taught) should be something that is paramount in the teaching of postgraduates. The report again:

  • “The doctoral examination board must form a clear impression of the way in which research data has been collected.” P. 57

The importance of research data verification and replication of findings from research data will be something we shall likely emphasize to  students studying for doctorates in our course.

  • “Research data that underlie psychology publications must remain archived and be made available on request to other scientific practitioners. This not only applies to the dataset ultimately used for the analysis, but also the raw laboratory data and all the relevant research material, including completed questionnaires, audio and video recordings, etc. It is recommended that a system be applied whereby on completion of the experiment, the protocols and data used are stored in such a way that they can no longer be modified. It must be clear who is responsible for the storage of and access to the data. The publications must indicate where the raw data is located and how it has been made permanently accessible. It must always remain possible for the conclusions to be traced back to the original data. Journals should only accept articles if the data concerned has been made accessible in this way.” P. 58

There is not much we can add to this set of recommendations; whether they can or will be implemented is as yet unclear particularly as as far back as 2002 similar recommendations were made following the Ruggiero case. But there seems to be a desire to change and the way data is managed in this field is something that unbeknownst to us is something we are now a part of.

As Uri Simonsohn, the researcher who flagged up questionable data in studies by social psychologist Dirk Smeesters, has said: “We in psychology are actually trying to fix things… It would be ironic if that led to the perception that we are less credible than other sciences are. My hope is that five years from now, other sciences will look to psychology as an example of proper reporting of scientific research.”

We hope that the psychology course in research data management we will be running at UEL will be a part of this hoped-for progress and will be seen as an example of good practice for future psychologists.