Crossref Scholarly Works Scraper

Search 150M+ scholarly works on Crossref and export DOI, authors, journal, citation count, and abstracts as JSON or CSV. Filter by type and date.

Run this in the cloudRun on Apify →

Developer & Research Tools

How it works

1
Open it on Apify
Hit Run on Apify — it opens the tool in the cloud, no install.
2
Set the inputs
Adjust query, filterType, fromDate (sensible defaults are pre-filled).
3
Click Run
The tool runs on Apify’s cloud and collects the data for you.
4
Export the results
Download as JSON, CSV or Excel, or pipe straight into your app, Google Sheets, or an AI agent.

Inputs

Field	What it does	Type
`query`	Keywords to search Crossref for across titles, authors, abstracts, and metadata (e.g. "deep learning", "CRISPR gene editing", "climate change adaptation").	string
`filterType`	Only return works of this Crossref type. Leave empty for all types. "journal-article" is the most common for research papers.	string
`fromDate`	Only return works published on or after this date, in YYYY-MM-DD format (e.g. 2020-01-01). Leave empty for no date floor.	string
`sort`	How to order results. "Relevance" matches the query best; "Most cited" surfaces influential papers; "Newest first" sorts by publication date descending.	string
`maxItems`	Maximum number of scholarly works to return. Uses deep cursor pagination to fetch beyond 100 reliably.	integer
`notionConnector`	Optional. Write each result as a page into your Notion when the run finishes. Authorize a Notion connector once in Settings → API & Integrations → MCP connectors, then pick it here. Leave empty to skip (default) — results are always saved to the dataset regardless.	string
`notionParentId`	Optional. The Notion data source ID of the database to write into (only used if a Notion connector is set). Leave empty to create the pages privately in your workspace instead.	string

What you get

A structured dataset — each result includes fields like:

abstractauthorscitationsdoiissnjournalpublishedDatepublishersubjectstitletypeurl

Export every run as JSON, CSV or Excel, or send it to your app, a database, Google Sheets, or an AI agent.

2 ready-to-run use cases

Most-Cited CRISPR Papers Ranked by Citations | Crossref

Rank CRISPR gene-editing papers by citation count from Crossref's 150M-work index. DOIs, titles, authors, and journals for literature reviews.

Microplastics Literature Search: All Crossref Works

Every microplastics publication on Crossref in one dataset: journal articles, books, datasets, and preprints with DOIs for systematic reviews.

Crossref Scholarly Works Scraper

Search the Crossref catalog of 150M+ scholarly works (journal articles, preprints, books, datasets, and more) via its public REST API — no API key, no login, no anti-bot.

The actor is a polite Crossref client: it identifies itself with a contact User-Agent and a mailto query parameter so Crossref routes it to the faster "polite pool", and it uses deep cursor pagination (cursor=* → next-cursor) which is the only reliable way to page past 1,000 rows.

Input

Field	Type	Default	Description
`query`	string (required)	`deep learning`	Keywords searched across titles, authors, abstracts and metadata.
`filterType`	string	_all_	Restrict to a Crossref work type, e.g. `journal-article`.
`fromDate`	string `YYYY-MM-DD`	_none_	Only works published on/after this date.
`sort`	enum	`relevance`	`relevance`, `is-referenced-by-count` (most cited), or `published` (newest).
`maxItems`	integer	`100`	Max works to return (cursor pagination handles >100).
`proxyConfiguration`	object	_none_	Optional and off by default; Crossref is a public, no-key API with no anti-bot, so a proxy adds no benefit. Only enable it if you hit IP-level rate limits.

Output

Each successful row:

{
  "ok": true,
  "doi": "10.1038/nature14539",
  "title": "Deep learning",
  "authors": ["Yann LeCun", "Yoshua Bengio", "Geoffrey Hinton"],
  "journal": "Nature",
  "publisher": "Springer Science and Business Media LLC",
  "type": "journal-article",
  "publishedDate": "2015-05-28",
  "citations": 70000,
  "subjects": ["Multidisciplinary"],
  "issn": ["0028-0836", "1476-4687"],
  "abstract": null,
  "url": "https://doi.org/10.1038/nature14539"
}

authors are formatted "Given Family" (organizational authors fall back to their name).
publishedDate is assembled from Crossref's date-parts (may be year-only or year-month for older records).
citations is Crossref's is-referenced-by-count.
abstract is the JATS-XML abstract stripped to plain text, or null when Crossref has none.
Nullable fields: title, journal, publisher, type, publishedDate, abstract, and url may be null, and authors, subjects, and issn may be empty arrays, depending on what the publisher deposited with Crossref. doi is always present (rows without a DOI are dropped). citations defaults to 0 when absent.

Results are deduplicated by DOI. Charging is per successful work (work event). Diagnostic / empty / blocked rows (ok: false with an errorCode) are never charged — this includes BAD_INPUT (empty query or malformed fromDate), NO_RESULTS, and any network/block error.

Troubleshooting

BAD_INPUT row, no results: you left query empty or fromDate isn't YYYY-MM-DD. Fix the input and re-run — you were not charged.
NO_RESULTS row: your query/filter combination matched nothing in Crossref. Try broader keywords or drop the type/date filters.
RATE_LIMITED / BLOCKED row: rare for Crossref. The actor already retries with backoff; if it persists, enable a proxy to use a different IP.

Notes

Powered entirely by the public Crossref REST API (https://api.crossref.org/works). Please be considerate of the shared, free service.
Citation counts and abstracts depend on what publishers deposit with Crossref; coverage varies by record.