A data journey pt 2: from collecting to publishing

In the second part of the journey we are taking with our data, we will look at the efforts to make the data ready for publication.

Collect data from IHHD

After the initial contact from the team in the Institute for Health and Human Development (IHHD), we met them in person to collect the data on a CD.

We were talked through the data, shown it open in SPSS, and given information on how to get the documentation. What struck us was how useful the documentation could be. We were eager to get access to the field manual, as it sounded like it could be useful as a stand alone document.

Examine data and check for identifiers

Once we had the data files back in the office, we opened each one in SPSS to check it was valid. We also checked to make sure we were not publishing any identifiers which would compromise the anonymity of the participants.

This can be a complicated process with SPSS files. The data was split into many different files for different types of processing. Printing off a check sheet for every file in the collection (data and documentation) helps.

At this stage we created a record in our test EPrints repository. This allowed us to show the project team what was needed and how it would be arranged.

Translate/Get access to correct documentation

The documentation we were given, while being very full, was written in two local Indian languages (both spoken by many millions of people). These languages do no use the Latin alphabet that English uses, so additional font files were needed to read them in word format.

As our PCs are locked down by IT (and installing fonts requires admin privileges) this was not a trivial task. It was something we did not think of at all when planning how the ingest process would work. If others are going to be working with international data, language documentation/data issues could be a real problem.

Confirm metadata and other details are correct with all contributors

We had to contact all of the contributors to confirm that they were happy with the metadata, the files uploaded, the wording or various parts of the abstract/metadata. This was again not something we had planned for. While the PI might be happy with when they’ve given us, there were many people to go through to make sure we were doing everything correctly. This could be the case for all larger teams, different members have different responsibilities and will need to be contacted to make sure we are describing their contribution correctly.

This took time as they are not all in the same time zone, or working on the project anymore.

Publish and mint DOI

Once everyone in the project team was happy with the data, we could move the record from our test server, to our live server.

We made the record live on the 23rd of March 2015 and minted a DOI at the time (http://dx.doi.org/10.15123/DATA.4). 


When depositing data like this in a repository there are bound to be issues you can’t predict as you go along. We’ve learned from this experience, and will use it in future as an example for others within UEL.

Our experience also highlights in need for good relationships with the researchers, and how much self depositing could smooth the process.


LIKE a lot: Introducing RDM to a wider audience

On Thursday 25 April I gave an evening talk about research data and universities to a strange-sounded organisation by the name of LIKE. I’m still unsure how to pronounce this acronym (similar to Nike I suppose) but it stands for London Information and Knowledge Exchange and according to their web page is a “community of Library, Information, Knowledge and Communication professionals. We meet monthly to share stories, learn and exchange knowledge in an informal and relaxed setting.” It’s a social gathering as well as a learning environment and as people pay to attend holds a captive audience which I was grateful for.

Velichka presenting on open data

Velichka presenting on open data

Read More

M25 annual conference

No, its not for afficionados of London’s orbital motorway but for librarians – I attended the M25 Consortium’s annual conference at the Wellcome Collection yesterday.

The M25 Consortium of Academic Libraries is a collaborative organisation that works to improve library and information services within the M25 region and more widely across the East and Southeast

The conference was wide ranging – with MOOCs, engaging with students, linked open data and open source library management systems all on the agenda. There was also an hour of research data management before lunch. Dr Jonathan Tedds from Leicester gave a good presentation on open data and the depth of data creation in astronomy, then I talked about how libraries can support researchers in managing their data.

The conference theme was The Joy of Sharing, and I entitled my talk “Sharing the load: librarians and research data support services“. You can see it on SlideShare. I wanted to reassure the audience that researchers would be happy to have support and guidance in managing and sharing their research data, and that librarians had relevant skills. These skills may need to be augmented with specific expertise in RDM, but such an enhanced skillset will make one eminently employable. DCC publications, the supportDM course and the wealth of material from the other projects in Jisc’s MRD project will allow other universities to get started in RDM support. We shall certainly be making productive use of MRD outputs when we plan our RDM support service at UEL this summer.

Open data for open science

The Royal Society issued a major report on 21 June 2012 – one that may have been overlooked in the blizzard of comments on the Finch report published three days earlier. “Science as an open enterprise: open data for open science” is well worth reading – quite inspiring but also practical in its ten recommendations. I paraphrase them here:

  1. Scientists should make their data free and open access, including in an appropriate data repository
  2. Universities should develop data strategies and their capacity to curate their own knowledge resources and support the data needs of researchers, having open data as a default position
  3. Assessment should reward the development of open data on the same scale as journal articles and other publications
  4. Professional bodies should promote the priorities of open science, explore how enhanced data management could benefit their members and how habits might need to change to achieve this
  5. Funders should recognise the costs of preparing data and metadata for curation as part of the costs of the research process
  6. Journals should enfore requirement that data on which the argument of an article depends should be accessible, assessable, usable and traceable through information in the article, and the article should state conditions of access to the data
  7. Industry sectors and relevant regulators should work together on sharing data that is in the public interest, including negative or null results
  8. Governments should develop policies to open up scientific data that complement policies for open government data
  9. Datasets should be managed according to a system of proportionate governance – this means that personal data is only shared if it is necessary for research with the potential for high public value
  10. Follow good practice and common protocols for security and safety, but remember that security can come from greater openness as well as from secrecy

This report is welcome reading in the light of institutional efforts to build a good data management infrastructure. It casts the primary responsibility within the culture of scientists (since it is a Royal Society report), and says that they need support and encouragement from institutions, funders, publishers and their peers. It talks about intelligently open data, which has four attributes

  • accessible
  • intelligible
  • assessable
  • usable

and it presumes data are open by default. I shall certainly be making use of the Royal Society report in my data management advocacy at UEL.

One thing, though that I probably won’t take from the report – especially when I talk to arts researchers – is the following definition of data on p. 14

Data are numbers, characters or images that designate an attribute of a phenomenon

Finch et al (2012), Accessibility, sustainability, excellence: how to expand access to research publications

Royal Society (2012), Science as an open enterprise: open data for open science