User:Pintoch/accessibility of sources

From Wikipedia, the free encyclopedia

Here is a simple experiment to evaluate the access level of |url= in cite journal templates.

It uses a parsed dump of the CS1 templates found on the English Wikipedia. (The parsed representation is the internal representation of the citation in the Lua code that generates structured metadata out of citation templates).

#!/bin/bash

wget https://zenodo.org/record/55004/files/enwiki_2016-06-01_CS1_citations.tsv.bz2
bzip2 -d enwiki_2016-06-01_CS1_citations.tsv.bz2

# Extract the cite_journal templates with a valid URL
cat enwiki_2016-06-01_CS1_citations.tsv | awk -F"\t" '($5 == "cite journal") && (index($6,"\"URL\":") != 0)' > cite_journal_with_urls.tsv

# Shuffle them
sort -R cite_journal_with_urls.tsv > shuffled.tsv

# Take the first 100
cat shuffled.tsv | uniq | head -n 100 > samples.tsv

Once you have these templates, manually check whether you can access the full text from the URL given. I classified the results in 3 categories:

  • open: full text available, in a few clicks on the same website: 59
  • closed: subscription or registration required: 25
  • broken: the link no longer leads to the designated resource: 16

The full classification for my 100 samples can be found here: [1].

Note that many of the open links are using cite journal but should rather use another template such as cite news or cite magazine given the source they refer to (cite journal is designed for "academic and scientific papers and journals").