Adventures in #LibraryCarpentry #1: Excel tips and APIs

Back in March 2017 I had the good fortune to join in with a Library Carpentry course run at Sheffield University. To make it clear from the start the tools I learnt to use there were not chisel, plane and screwdriver but Excel, OpenRefine and Python. ‘Library Carpentry‘, it turns out, is “Software and data skills for library professionals”.

I would recommend this course to other library professionals as we tend to spend more time than we would like staring at spreadsheets hoping that they will enable us to make a better job of managing our resources. In this series of posts I want to explore some to the techniques and tips I picked up through the Library Carpentry training and beyond.

Aim: Follow this tutorial to learn some tips on using Excel and OpenRefine to refine and enhance the data available to you.

Data Need: The problem I want to solve is how to enrich the data in an Institutional Repository by discovering and adding the relevant Document Object Identifier (DOI) for each citation. DOIs are a form of URL that provides a persistent link to content on the internet. Most published articles have a DOI assigned to them by the publisher and registered with a registration agency (the International DOI Foundation).

Having a DOI as an identifier is a useful key into other data sources, as I hope to explore in later posts.

First: find some data

My university has an Institutional Repository called DORA, the De Montfort Open Research Archive. It uses DSpace software to keep track of all the examples of research outputs contributed as citations or full text documents by the university’s research community. We have developed our own custom set of scripts for making reports, so my starting point may be slightly different to that of other repository managers.

I asked the repository for a report of all research outputs submitted by members of a particular department. The report has a column for DOIs, but this is blank where no DOI is known. Although I have 289 lines in the spreadsheet there are 78 that lack a DOI.

I found this by highlighting the whole range of data in the spreadsheet and turning it into a Table using Excel’s Insert > Table prompt. Having the data in a table allows you to filter the view, in this case to show only lines where the DOI cell is blank.

Get some more data

To manually find the DOI for a citation you can use the Crossref website (https://search.crossref.org/).  Paste a citation into the search box and see if it finds a match. For example:

Edwards, G. et al. (2015) Exploring Critical Perspectives of Toxic and Bad Leadership through Film. Advances in Developing Human Resources, 17 (3), pp. 363-375

There is a match in this case and the site reports the DOI as https://doi.org/10.1177/1523422315587903

It would be tedious to keep doing this manually for a long list of items. This is the kind of job that machines can be tasked to do for us. Crossref have provided an Application Programming Interface (API) that enables this to be developed. We are going to be using their REST API to develop such a tool. But the API, being a machine, prefers urlencoded strings to human readable citations with spaces between the words.

Encode a column

Excel has a function for turning text like that in the citation column into a urlencoded format, at least, since Excel 2013. The syntax is ENCODEURL(text), e.g.

ENCODEURL(Edwards, G. et al. (2015) Exploring Critical Perspectives of Toxic and Bad Leadership through Film. Advances in Developing Human Resources, 17 (3), pp. 363-375)

or to encode a cell ENCODEURL(A2)

Apply that function to all the rows in your spreadsheet. The result is the less human-readable, but more machine-friendly:

Edwards%2C%20G.%20et%20al.%20%282015%29%20Exploring%20Critical%20Perspectives%20of%20Toxic%20and%20Bad%20Leadership%20through%20Film.%20Advances%20in%20Developing%20Human%20Resources%2C%2017%20%283%29%2C%20pp.%20363-375

If you have a whole column of similar (but not identical) strings close the file as we are going to open it up again in another application, OpenRefine, so that we can run API calls to enhance the data.

OpenRefine

I am using OpenRefine to explore and enhance the data. there are lots of useful features that we could explore, but I want to press on with the current task. First  we want to create a project by selecting and opening our original spreadsheet in the new OpenRefine interface. Use the Browse button to locate your file and Open to select it. Then click on Next to import the data into OpenRefine. before we can get further we have to give the project a name and click the Create Project button on the top right of the screen.

For this exercise we do not want the lines which already have a DOI so we can use the drop-down arrow on the DOI column to make a Text Facet. This lists all the unique strings in the column. In this case I have 266 rows in the spreadsheet, but only 208 unique strings. The last line in Facet/Filter box that has appeared on the left of the screen shows that 78 cells here are blank.

 

 

 

 

 

 

 

 

Click on the ‘(blank)’  line to include on the non-DOI lines in the spreadsheet. OpenRefine should now be displaying and working on the 78 matching rows. By default 10 are displayed, but if you want to see more lines, you can select a view with 25 or 50 rows instead.

Crossref API

For an introduction to the Crossref API there is a useful website, but I have found the examples in their demo-scripts useful for getting the format of valid requests.

The Drop-down arrow on the OpenRefine column heading offers a menu with an ‘Edit column’ option and a sub-menu offering ‘Add column by fetching URLs…’.

The Preview looks a bit ugly, but is just repeating the value in the original column. We can change this by amending the Expression. First though we should give the Column a new name: ‘Crossref values’ and turn the Throttle delay down to 200 milliseconds.

The trickery is all in what we type into the Expression field. Note that the language operating here is the ‘General Refine Expression Language (GREL)’. Learning more about GREL from the help pages will give you lots of ideas about what you can do with the Expression field, particularly with Strings Functions.

Try this:

“https://api.crossref.org/works?query.bibliographic=”+value+”&mailto=someone@example.org&select=DOI,title”

  • We are going to use the API service at api.crossref.org.
  • We are letting them know that we want to use the works, rather than journals or funders data, and that we are sending them a bibliographic citation.
  • The mailto value should be your own email address so that we get our requests processed more quickly than unidentified requests.
  • We are asking for DOI and title data to be returned to us.

Click OK to run the query. When it completes we should have our Crossref values column populated with the responses from the API. The response is in JSON format, still more machine-readable than human readable at this stage.

How to tame your JSON

The response looks like:

{“status”:”ok”,”message-type”:”work-list”,”message-version”:”1.0.0″,”message”:{“facets”:{},”total-results”:18368540,”items”:[{“DOI”:”10.7202\/050376ar”,”title”:[“D.N. ASHTON, J.J. MAGUIRE : Young Adult in the Labour Market. Paper no 55, Leicester, University of Leicester, 1986, 163 pp.”]},{“DOI”:”10.1017\/s0963926800007033″,”title”:[“Michael Wolff, Urbanity and Journalism: The Victorian Connection, Leicester: Victorian Studies Centre, University of Leicester, 1980. 31 ppDavid A. Reeder, Suburbanity and the Victorian City, Leicester: Victorian Studies Centre, University of Leicester, 1980. 26 ppAsa Briggs, Cities and Countrysides: British and American Experience (1860\u20131914), Leicester: Victorian Studies Centre, University of Leicester, 1982. 22 ppPhilip Collins, Trollope’s London, Leicester: Victorian Studies Centre, University of Leicester, 1982. 26 pp. \u00a31.00 each.”]},…  [there’s more.]

We have several suggestions for matches being returned to us and we need to comb these out into a human readable format.

Use the Dropdown arrow on the new ‘Crossref values’ column to select: edit column > Add column based on this column…

We have a new GREL window and need a command to process the JSON returned to us.

Give the new column the name ‘Crossref title’.

As the expression, type in:

value.parseJson()[“”]

GREL is case-sensitive, if you type in ‘value.parsejson()’, you get an error message: Parsing error at offset 16: Unknown function parsejson. It often happens to me.

The preview shows ‘null’, as we have not yet stated what we want from the JSON. If we change this to:

value.parseJson()[“status”]

We can get the ‘ok’ message out from the jumble of results. Interesting, but not what we are looking for today. Instead, try:

value.parseJson()[“message”]

We get back more JSON, but less than we started with. Dig a bit further with:

value.parseJson()[“message”][“items”]

We are stepping through a nested array of data. Within the items there are a number of pairs of titles and DOIs. As we are dealing with computers, we have to remember that the first item in the array occupies position zero ‘0’.

value.parseJson()[“message”][“items”][0][“title”]

This returns something like ‘[“Flexible Working in Europe: A Review of the Evidence”]’ which is near enough what we are looking for in a title, so press the OK button to create this new column. However, we want to run this again to get the Crossref DOI out as well. it should be easier this time to construct the required GREL expression:

value.parseJson()[“message”][“items”][0][“DOI”]

But this results in something like ‘10.1007/978-3-322-90987-9_6’, which is not completely what we want. So we can change the expression to:

“https://doi.org/”+value.parseJson()[“message”][“items”][0][“DOI”]

We are using some quoted text joined to the results of our query with a plus ‘+’ to get a more useful result. Don’t forget to give you new column a name like ‘Crossref DOI’.

OpenRefine display tweaks

Now we have lots of extra data, but it is not as easy to read as we might like. Hide some of the columns that we don’t currently need with the View > Collapse this column option. use this on all the columns between Title (which we want) and the Crossref title (that we want to compare it to. In fact, I kept the Crossref DOI on display as well.

Scanning down the new columns enables you to spot the matches that show up.

 

They do not all match as some items don’t have DOIs set for them and others may show up as the second or third item in the array. You know how to find those now though, don’t you?

It is still worth investigating whether the content identified by the DOI is the same as the entry in you repository, but you do have a good list of potential candidates for enhancing your repository with persistent links to content. DOIs are also useful as a key to making more matches with the data available through other APIs, as we will explore in later posts.

 

About Philip Adams

Senior Assistant Librarian at De Montfort University. I am interested in digital preservation and the use of data to measure a library's impact. All comments own.
This entry was posted in Library Carpentry. Bookmark the permalink.