flowchart TD linkStyle default interpolate basis A["Get last 10 posts with DOIs from RS"] subgraph loop["Loop through posts"] B["Read YAML preface"] --> C{"Is post a cross-post?"} C -->|yes| skip C -->|no| D{"Is post citeable?"} D -->|no| skip D -->|yes| E{"Do we have a DOI?"} E -->|no| skip E ---->|yes| F["Get DOI"] end A -----> E F --> apploop["Append loop to YAML"] apploop --> write["Write YAML"] write --> qr["Quarto Render"] write --> ghc["Github commit action"]
Auto-DOI for Quarto posts via Rogue Scholar
Oh, that’s mint. We can finally use Rogue Scholar to mint DOIs for Quarto posts and append them automagically.
I love posts that allow me to merge some of my addictions. In this case, it’s my love for Quarto project scripts (which I’ve written about elsewhere), my fondness for Rogue Scholar and the overuse of the word ‘mint’ to mean ‘generally really quite rather nice’.
Rogue Scholar is a fantastic tool for science bloggers, and while it’s a little artisanal (i.e. hand-made much of the time) at this point, it’s got some really cool automated features. One is that it registers (mints, hence the abundance of lame peppermint puns across this post) DOIs for your posts.
I’ve been using Rogue Scholar to mint DOIs for my posts for a while now, but it’s always been a bit of a manual process. I’d have to wait for a while for the post to go on the Rogue Scholar feed, then copy/paste the DOI, then copy the DOI into the YAML front matter. It’s not a lot of work, but it’s a bit of a pain. I’ve been meaning to automate it for a while, but I’ve been busy with other things.
Just after I posted about this solution, Martin Fenner, who runs Rogue Scholar, pointed out that there’s now an API. The API is great, and would have spared me the part of having to scrape the HTML. I will, one of these days, switch over – if I had to build it, I’d obviously use the API, and simply parse the JSON result. The rest, ceteris paribus, holds true.
This weekend, I was laid up with being on the receiving end (for once) of the bounties of a clinical trial, so I’ve decided to finally build it. It’s a bit of a hack, but it works.
First, we scrape Rogue Scholar for titles and DOIs. Rogue Scholar’s CSS isn’t really helpful here, as the link isn’t a particular class/id of its own as far as I could discern, so I just grabbed the link by the fact that only DOI links are formatted like DOI links. Not the most elegant way, but it does the job.
def scrape_blog_for_dois(url) -> Dict[str, str]:
= requests.get(url)
response = BeautifulSoup(response.content, 'html.parser')
soup = soup.select('article')[:10]
articles
= {}
posts_with_dois
for article in articles:
= article.select_one('h3').text
title = next((a['href'] for a in article.select('a') if a['href'].startswith('https://doi.org')), None)
doi_link if doi_link:
= doi_link.split('https://doi.org/')[1]
doi else:
= None
doi = doi
posts_with_dois[title]
f"Found {len(posts_with_dois)} posts with DOIs:")
logging.info(for title, doi in posts_with_dois.items():
f"{title}: {doi}")
logging.info(return posts_with_dois
- Technically unnecessary, as Rogue Scholar currently only displays ten links, but hey.
- This is where we split the DOI link into the link prefix and the DOI. We don’t need the prefix, so we just grab the second part of the split.
Next, we iterate through each blog post. This is actually quite fast, since (1) we have relatively few of them, (2) they’re text documents. We parse the YAML preface at the beginning of each of them. This looks something like this:
categories:
- Quarto
citation: true
date: 2023-11-13
description: 'Oh, that''s mint. We can finally use Rogue Scholar to mint DOIs for Quarto posts and append them automagically.'
google-scholar: true
title: Auto-DOI for Quarto posts via Rogue Scholar
What this tells us is that we do want a citation (someday), which is why we’re doing this in the first place. That, according to our beautiful flowchart in Figure 1, means this post is eligible to get a DOI appended. We also know there isn’t one – DOIs are appended as key-value pairs (with the key being, unsurprisingly, doi
) to the citation
object in the YAML preface. So, we’ll see if we can get one by looking in the dictionary we scraped from Rogue Scholar in Listing 1.
def process_qmd_file(file_path: str, posts_with_dois: Dict[str, str]) -> None:
with open(file_path, 'r') as stream:
= stream.read()
contents
= re.compile(r'^---$', re.MULTILINE)
delim = re.split(delim, contents)
splits = splits[1].strip() if len(splits) > 2 else "" #
yaml_preamble = splits[2] if len(splits) > 2 else contents
rest_of_post
= yaml.safe_load(yaml_preamble) if yaml_preamble else None
yaml_contents
if yaml_contents:
= yaml_contents.get('citation')
citation = yaml_contents.get('google-scholar')
google_scholar = yaml_contents.get('categories', [])
categories = yaml_contents.get('title')
title
# Check files from crosspost categories
if any("cross-post" in category.lower() for category in categories):
if yaml_contents["citation"] == True or yaml_contents["google-scholar"] == True:
"citation"] = False
yaml_contents["google-scholar"] = False
yaml_contents[f'Modified crosspost {title} to remove Google Scholar and/or citation reference.')
logging.info(else:
# Ensure that google-scholar is set to true if citation is required
if citation and not google_scholar:
"google-scholar"] = True
yaml_contents[f'Setting google-scholar to true for {title}')
logging.info(
# If citation is true but no DOI, and post exists in scraped posts
if citation is True and posts_with_dois.get(title):
'citation'] = {'doi': posts_with_dois[title]}
yaml_contents[f'Adding doi for {title}.')
logging.info(
= yaml.dump(yaml_contents).rstrip()
new_preamble = f"---\n{new_preamble}\n---"
new_yaml_doc
# write the modified YAML document back to file
with open(file_path, 'w') as yaml_file:
+ rest_of_post)
yaml_file.write(new_yaml_doc f'Updated file {title}')
logging.info(
- We have to split the document in two because only the preamble is proper, parseable YAML. The rest of the document is just text, so we have to recombine it later.
- If it’s a cross-post, we don’t want it to have a Google Scholar link, and we’ll definitely not attach a DOI. In theory, we could have built this to be overridable in case I’ll ever produce a cross-post I do want to have a DOI, but I don’t see that happening.
- While we’re at it, might as well prune the cross-posts.
- And anything with a DOI should also get a Google Scholar metadata.
- The
.rstrip()
is pretty useful – otherwise, every time you run this, you’ll get another newline appended to the YAML preface. - Don’t forget the
\n
before the YAML block’s end, otherwise you’ll end up with a YAML block that’s not properly separated from the rest of the document and won’t parse.
Finally, we write the YAML back to the file, and we’re done. We can now declare this as a project script, and we’re good:
project:
type: website
pre-render:
- scripts/pre_doi_from_rogue_scholar.py
One thing worth noting is that we’re not actually running this on the Quarto project itself, but on a copy of it. The consequence is that the changes are made ‘on the fly’ to the .qmd
files and do not necessarily propagate into the repo. This is a pain, because recall that we’re only fetching the last ten posts’ DOIs so as to be kind on the server: as time goes on, that means older posts ‘lose’ their DOI.
To prevent this, we can simply check our changes back in:
on:
workflow_dispatch:
push:
branches: main
name: Quarto Publish
jobs:
build-deploy:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2
- name: Install Python and dependencies
uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- run: pip install jupyter
- run: pip install -r requirements.txt
- name: Render and Publish
uses: quarto-dev/quarto-actions/publish@v2
with:
target: gh-pages
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Commit changes to reflect source file changes
run: |
git config --global user.name 'Chris von Csefalvay'
git config --global user.email 'chrisvoncsefalvay@users.noreply.github.com'
git diff-index --quiet HEAD || git commit -am "Automated commit of changes to source files" git push
- The
diff-index --quiet HEAD
checks if there have been changes to the working tree.git
returns an error if you’re trying to commit on an empty working tree, so we’re checking for that first.
And that’s it. We can now run this as a Github action, and it’ll automatically append DOIs to our posts.
As noted: Quarto project scripts are pretty awesome stuff. I’m thinking of setting up an awesome-
for it on Github, because way too few of them are shared properly. I’m hoping this will change.
Citation
@misc{csefalvay2023,
author = {{Chris von Csefalvay}},
title = {Auto-DOI for {Quarto} Posts via {Rogue} {Scholar}},
date = {2023-11-13},
url = {https://chrisvoncsefalvay.com/posts/auto-doi/},
doi = {10.59350/5hxdg-fz574},
langid = {en-GB}
}