A data journey pt 2: from collecting to publishing

In the second part of the journey we are taking with our data, we will look at the efforts to make the data ready for publication.

Collect data from IHHD

After the initial contact from the team in the Institute for Health and Human Development (IHHD), we met them in person to collect the data on a CD.

We were talked through the data, shown it open in SPSS, and given information on how to get the documentation. What struck us was how useful the documentation could be. We were eager to get access to the field manual, as it sounded like it could be useful as a stand alone document.

Examine data and check for identifiers

Once we had the data files back in the office, we opened each one in SPSS to check it was valid. We also checked to make sure we were not publishing any identifiers which would compromise the anonymity of the participants.

This can be a complicated process with SPSS files. The data was split into many different files for different types of processing. Printing off a check sheet for every file in the collection (data and documentation) helps.

At this stage we created a record in our test EPrints repository. This allowed us to show the project team what was needed and how it would be arranged.

Translate/Get access to correct documentation

The documentation we were given, while being very full, was written in two local Indian languages (both spoken by many millions of people). These languages do no use the Latin alphabet that English uses, so additional font files were needed to read them in word format.

As our PCs are locked down by IT (and installing fonts requires admin privileges) this was not a trivial task. It was something we did not think of at all when planning how the ingest process would work. If others are going to be working with international data, language documentation/data issues could be a real problem.

Confirm metadata and other details are correct with all contributors

We had to contact all of the contributors to confirm that they were happy with the metadata, the files uploaded, the wording or various parts of the abstract/metadata. This was again not something we had planned for. While the PI might be happy with when they’ve given us, there were many people to go through to make sure we were doing everything correctly. This could be the case for all larger teams, different members have different responsibilities and will need to be contacted to make sure we are describing their contribution correctly.

This took time as they are not all in the same time zone, or working on the project anymore.

Publish and mint DOI

Once everyone in the project team was happy with the data, we could move the record from our test server, to our live server.

We made the record live on the 23rd of March 2015 and minted a DOI at the time (http://dx.doi.org/10.15123/DATA.4). 


When depositing data like this in a repository there are bound to be issues you can’t predict as you go along. We’ve learned from this experience, and will use it in future as an example for others within UEL.

Our experience also highlights in need for good relationships with the researchers, and how much self depositing could smooth the process.

A data journey pt 1: from conception to completion

We’re delighted to make available our first research data collection on data.uel. It involved a major survey to demonstrate the take up of healthcare services in one Indian state, compared to another state and to the situation several years earlier. The dataset and associated documentation is available at http://dx.doi.org/10.15123/DATA.4. The project was led by Mala Rao OBE, until recently Professor of International Health in the Institute of Health and Human Development (IHHD) at UEL.

The project was a major international collaboration with contributors from the Administrative Staff College of India (Hyderabad), ACCESS Health International (Hyderabad), SughaVazhvu Healthcare (Thanjavur), Indian School of Business (Hyderabad) and the Development Research Group (DECRG) of The World Bank, Washington, DC as well as colleagues in IHHD. Funding came from Canada (International Development Research Centre), UK (Wellcome Trust and Department for International Development), USA (Rockefeller Foundation) and from the World Bank. Professor Rao wrote a detailed guide to the contributions made to the study in a BMJ Open article reporting on the results:

[Mala Rao] conceived and designed the study, applied for funding, and was responsible for the supervision and management of all aspects of the study as well as the dissemination of its results. She is the guarantor. [Sofi Bergkvist] shared responsibility for the conception of the study, applications for funding, study design and data collation and analysis, contributed to the questionnaire design and commented on drafts of the report. [Prabal V Singh] contributed to the conception of the study and study design, led the questionnaire design and survey implementation, including training of survey staff, monitoring survey progress and data collation and verification, commented on drafts of the report and helped prepare the references. [Anuradha Katyal] undertook the data collation, verification and analysis, assisted with the survey and questionnaire design and survey implementation and prepared the tables for the report. [Amit Samarth] led the literature review, assisted with the study and questionnaire design, survey implementation and preparation and analysis of baseline data, and commented on drafts of the report. [Manjusha Kancharla] helped with the data analysis. [Adam Wagstaff] devised the methodology for the estimation of the programme impacts, advised during the data-collection and data-preparation stages, wrote and implemented the computer code for the model estimation, helped to oversee the production of the results, and contributed text to the report. [Gopalakrishnan Netuveli] provided technical advice on accounting for the complex survey structure in the analysis, developed a STATA equation, helped to compute an asset index, advised on the output tables, verified the analysis and commented on drafts of the report. [Adrian Renton] helped develop a conceptual framework for the evaluation, advised on funding proposals, the study design, analytical methodology and presentation of results and contributed text to the report. [Mala Rao] wrote the first draft of the paper and its redrafts in accordance with the comments of all other authors and reviewers.

It must have been a major undertaking to organise and coordinate research activity on three continents, with concurrent surveys in two Indian states. The data in http://dx.doi.org/10.15123/DATA.4 is available as an SPSS zip file in .SAV format. It comes with extensive documentation: for each of the two states (Andhra Pradesh and Maharashtra) there is

  • Field training manual (detailing how to conduct household surveys in rural areas)
  • Bilingual household listing schedule
  • Household survey tool (the questions asked in the survey)
  • Code book of values encoded in SPSS

In addition, we have linked the data to several publications based on it, as well as to the baseline data from an earlier government survey:

  1. Rao, Mala and Katyal, Anuradha and Singh, Prabal V. and Samarth, Amit and Bergkvist, Sofi and Kancharla, Manjusha and Wagstaff, Adam and Netuveli, Gopalakrishnan and Renton, Adrian (2014) ‘Changes in addressing inequalities in access to hospital care in Andhra Pradesh and Maharashtra states of India: a difference-in-differences study using repeated cross-sectional surveys’, BMJ Open, 4(6), e004471. (10.1136/bmjopen-2013-004471).
  2. Narasimhan, H. and Boddu, V. and Singh, Prabal V. and Katyal, Anuradha and Bergkvist, Sofi and Rao, Mala (2014) ‘The Best Laid Plans: Access to the Rajiv Aarogyasri community health insurance scheme of Andhra Pradesh’, Health, Culture and Society, 6(1) (10.5195/hcs.2014.163).
  3. Bergkvist, Sofi and Wagstaff, Adam and Katyal, Anuradha and Singh, Prabal V. and Samarth, Amit and Rao, Mala (2014) What a difference a state makes: health reform in Andhra Pradesh. Working Paper. New York: World Bank. Available at http://documents.worldbank.org/curated/en/2014/05/19546767/difference-state-makes-health-reform-andhra-pradesh.
  4. Katyal, A., Singh, P. V., Samarth, A., Bergkvist, S., & Rao, M. (2013) ‘Using the Indian National Sample Survey data in public health research’, National Medical Journal of India, 26(5), pp. 291-294. Available at http://www.nmji.in/Volume-26-Issue-5.asp.
  5. Rao, M., Ramachandra, S. S., Bandyopadhyay, S., Chandran, A., Shidhaye, R., Tamisettynarayana, S., Thippaiah, A.,  Sitamma M., Sunil George, M., Singh, V. Sivasakaran, S. and Bangdiwala, S. I. (2011) ‘Addressing healthcare needs of people living below the poverty line: A rapid assessment of the Andhra Pradesh Health Insurance Scheme’, National Medical Journal of India, 24(6), pp. 335-341. Available at http://www.nmji.in/Volume-24-Issue-6.asp.
  6. National Sample Survey Office (2004) ‘Survey on MORBIDITY AND HEALTH CARE: NSS 60th Round : January 2004 – June 2004’, [dataset] New Delhi: MOSPI, 2004. Available at http://mail.mospi.gov.in/index.php/catalog/138.

In the next part of this Data Journey, David will talk about the work involved in archiving and publishing this data collection.

Open Access workflows

David and I have been thinking about workflows at UEL to support an institutional Open Access policy – which was driven by the recent HEFCE policy for the post-2014 REF.

I thought about the various processes for adding content to a repository like ROAR where an accepted version is being made available (or at least announced) ahead of publication, where some metadata would need to augment the record at a later date. And I thought about who might do this work. ROAR is a mediated repository where I add and check everything (barring a handful of self-depositing academics). Anticipating a great increase in work, we are investigating the case for extra staff in the library to continue this mediated service. So David created two workflows, one for the mediated service and another where researchers manage their own affairs on ROAR. This will help us discuss an institutional approach with academics and senior staff.

Here they are – I wonder how other universities are planning to deal with adding accepted manuscripts to their repositories?

Mediated workflow

Mediated workflow

Researcher workflow

Researcher workflow

Research Dialogues

Research Dialogues is a new initiative to help researchers at UEL share good practice in research activity. Facilitated by Library & Learning Services and the Graduate School, lunchtime sessions at each campus will offer a short presentation on a topic of interest and plenty of time for discussion. The first Dialogue takes place during International Open Access Week 2014 and will focus on UEL’s response to the new HEFCE policy for the next REF exercise. This will require articles to be made openly available (not behind a subscription paywall) even to be considered for submission in the next exercise. November’s Dialogue will look at ORCID, an open and persistent ID for researchers.
Here is the programme for October and November, but if you have a topic you want to cover, contact Stephen Grace the Research Services Librarian at s.grace@uel.ac.uk or David McElroy the Research Data Management Officer at d.mcelroy@uel.ac.uk.

Topic Date Venue
Open Access: why should I care, what should I do? Tuesday 21 October 13.00-14.00 in ED.4.02
Open Access: why should I care, what should I do? Thursday 23 October 13.00-14.00 in EB.1.39
ORCID – an open ID for researchers Thursday 20 November 13.00-14.00 in EB.1.39
ORCID – an open ID for researchers Friday 21 November 13.00-14.00 in ED.4.02

What, this old thing? When people actually use your outputs

In discussions this morning about research information, REF2020 and Open Access I was heartened to see our IT Director produce the survey on managing research data I conducted with John as part of the TraD project. He said that he kept “important documents” like this on the top of his filing system (also called a window sill), and this survey gave him good evidence of what is needed/wanted by UEL researchers. Too often, people start with a system (existing or potential) rather than their requirements.

So I thought I would remind myself what researchers said they wanted from a central RDM support service – and what we are offering in Research Data Services as a response:

What they wanted in 2012 What we offer in 2014
Guidance and procedures Website at www.uel.ac.uk/researchdata in final preparation, ad hoc advice in response to emails/phone calles
Training (staff and students) Workshops via Graduate School’s Researcher Development Programme and embedded in School of Psychology
Help with writing DMPs Drafting/consultancy offered, promoted by Research and Development Support colleagues
Storage Data repository for archiving/sharing data; IT investigating storage for active research data

More to do, and plenty to communicate more fully with our audiences at UEL, but I feel this is good progress.

And I was reminded that I once visited a senior researcher, who had pinned the University’s RDM policy on his wall as “inspiration” when preparing funding bids. So these things are used, after all!

Fun with LARD

Friday afternoon saw the inaugural meeting of LARD, the London Area Research Data group for those involved in RDM services. LARD aims to be a practical and supportive environment for sharing news and views with colleagues at other institutions, and the meeting certainly delivered on that. Kindly hosted by King’s College London, people came from across London (and even farther afield): East London, Goldsmiths, Imperial, King’s College, London School of Hygiene and Tropical Medicine, LSE, Middlesex, NatCen Social Research, Queen Mary, Reading, Royal Holloway, Royal Veterinary College and University of London Computer Centre

We all shared what we have been up to in our respective institutions, and it was interesting to hear how quite different approaches have been taken to supporting researchers. A few things I’ve taken from the meeting:

Go to where researchers are to talk to them about RDM

Institutional funds for examples of good practice

Access to training and information at different points of research lifecyle

Not just formal workshops, but bite-sized resources too

Standard guidance texts to help draft DMPs

Ask Research Offices to notify of successful awards for RDM support

Lowlander Grand Cafe

Lowlander Grand Cafe, copyright image by Ewan Munro CC BY-SA 2.0

Gareth Knight led us on a SWOT analysis of our respective RDM offers, to help us think about what we might want from LARD. We agreed to meet again in the autumn, with a special focus on data repositories. We’d just about dried off from the downpour which struck as people arrived by the time we adjourned for the liquid meet along the street at Lowlander Grand Café.

DataCite client meeting

Yesterday at the British Library, as one of 15 new members since the last meeting of DataCite clients managed by BL. 20140722_114350_resizedWe heard from Cambridge Crystallography Data Centre, which applied over 500,000 new DOIs to existing CIF crystal structure files in short order! And the ODIN project – a collaboration between ORCID and DataCite – has developed tools to make it easier to relate researchers and data. We will look at maximising the interaction in our new data repository once we are up and running. STFC assigns DOIS to the software it produces (the product, and each version and each release) to make citation easier, and there was discussion about the value of assigning DOIs to grey literature: good to remind everyone that not all data is a) structured b) numerical. We will look to assign DOIs to appropriate material in ROAR where UEL is the publisher, such as the working papers that various UEL research centres produce.

One issue to consider is where in a workflow a DOI is created and minted – before or after publication? I can see that authors would find it useful to know a DOI ahead of time to incorporate it into their texts, but this has to be mediated by our repository software which controls the coining/minting of DOIs. One to explore.

Overall, it was a very worthwhile meeting, with a chance to hear from other institutions applying DOIs to their research data. And the bottle? A reminder of the “Comics Unmasked: Art and Anarchy in the UK” exhibition, showing at the British Library until 19 August.

Working with PGR students

David and I ran a concluding workshop for the 2nd-year clinical psychology professional doctorate students yesterday. They were a lively and motivated group – an opinion in no way influenced by being  plied us with mince pies and biscuits as this was their last session at university this semester.

David asked them to recall the introductory presentation I gave them on 1 October. Luckily they could: they remembered discussion of backing up data, issues with using USB drives and Dropbox, Data Protection etc. The timing was seen as “ideal” since they were starting to think about their research projects. Only one had started to look at the MANTRA material I recommended, though he found it useful and detailed. Others said they would look at relevant modules when they were underway with their data gathering activity.

I then gave them an exercise based around an existing thesis by one of their predecessors. Using the abstract (taken from the entry in ROAR our institutional repository), and some bullet points on data aspects I extracted from the thesis, they had to consider in small groups answering the questions posed in a template Data Management Plan. Our template adopted that developed by Jez Cope during the University of Bath’s Research360 project (available at http://opus.bath.ac.uk/30772/). Split into four groups, the students tackled one section from

  • Defining your data
  • Looking after your data
  • Sharing your data
  • Archiving your data

They found the exercise worthwhile, and were able to relate it to their own concerns as students managing data as part of their own studies. There were also some challenging questions we couldn’t answer about some of the processes they have to engage with (like what they put in their research ethics application). I suggested they use the template as the basis for discussions with a supervisor.

We will repeat this exercise in the future, and adapt it to other disciplinary settings. I think this suggests a useful model for generating training material specific to a particular discipline.

Research Data Management Workshop

Firstly I should introduce myself; I’m UELs new Research Data Management Officer. I performed a similar role at the University of Glasgow, working on the C4D project. The outcome for Glasgow was a live data repository built on the EPrints platform. I am excited to now be part of the team at UEL, where we have an excellent opportunity to provide a fantastic new RDM infrastructure and service to our staff and students.


Stephen and I ran a Research Data Management Workshop yesterday in our Stratford Campus. We had 11 participants, from a variety of backgrounds. We aimed to give a wide outline of the importance of good RDM and the services we offer in the library.2013-11-05 12.33.04

12.00     Welcome and Introductions

12.15     Presentation on managing your research data

13.00     Briefing on exercise using a simple Data Management Plan template

13.30     Feedback and discussion on exercise, and next steps

14.00     Close

Stephen led the introduction and invited the participants to tell us and each other about the sort of research they do, and their relationship with data. There was a very good variety of research data being created and reused, from sensitive patient data, foreign government data, and interviews, to large quantitative datasets.

Stephen then started the presentation on managing your research data. I took over at one point and gave information on backing up and securing data. Once I had finish Stephen finished off by talking about Data Management Plans.

We then took a short break and encouraged everyone to have a go at completing a sample data management plan we provided, based on work by the DCC. The feedback at the time suggests that this was very helpful. Some saying that it helped make what they need to do for their research clearer.

We gathered the DMPs and plan on providing feedback to those who left their email addresses.

Our feedback forms show that overall the workshop was very well received. It has also given us ideas on how we can improve the flow in future. We are very pleased with the level of interest shown by the participants, reinforcing our view on the importance of providing good support for research at UEL.

Writing a Data Management Plan

Yesterday I ran a 2hr workshop with Sarah Jones (Digital Curation Centre) on writing a DMP. We had six participants from across UEL, who were very engaged and willing to participate – making the trainers’ task more enjoyable. The aim was to introduce the rationale and structure of DMPs, to look at some real-world examples and to start drafting a plan for one’s own research project. Here is the outline of the workshop:

  • 12:00 Welcome and introductions
  • 12:15 Data Management Planning presentation by Sarah Jones
  • 12:45 Walk through example plans
  • 13:15 Work through a template to create a DMP
  • 13:45 Feedback and summary

In the introduction we heard about the data activity of participants, both research students and staff. Sarah then walked through the need to have a data management plan when seeking Research Council funding, but also stressed that they are useful tools for researchers themselves (even without an external requirement). She highlighted the common topics covered by plans, whether from funders or institutions. And we had a walkthrough of DMPonline (in its new improved version, in beta at http://dmponline-beta.dcc.ac.uk/) which helps create a plan customised to a particular need.

Next we looked at a couple of real DMPs – the sample AHRC Technical Plan offered by the University of Bristol (which helpfully includes the assessor’s comments on each section of the plan), and a UK Data Archive one from the ESRC/BBSRC/NERC Rural Economy and Land Use programme. These helped to reassure the participants that DMPs are not long or complicated, and laid the ground for the next exercise – drafting a plan using a straightforward template.

We reused the Research360 project’s template devised for PGR students at the University of Bath. This uses six basic headings, with more specific questions under each to prompt authors:

  • Overview
  • Defining your data
  • Looking after your data
  • Sharing your data
  • Archiving your data
  • Executing your plan

We’re grateful to Jez Cope the template’s creator and to Bath for making the template available under a CC-BY licence, which allows others to reuse and adapt it. The template is available in Bath’s OPUS repository (at http://opus.bath.ac.uk/30772/), as is a similar one for research staff.

I wrapped up the workshop with a quick mention of Research Data Services, the support service we are developing at UEL to help staff and students manage their research data. We got some good ideas about what this should cover from participants, so thanks for that. Participants took away a copy of Sarah’s DCC guide on writing DMPs and the UK Data Archive’s Managing and Sharing Data, and an offer to review any plan they worked up after the workshop.