arXiv Scraper

Search arXiv and get clean JSON: titles, abstracts, authors, categories, DOI, dates and PDF links. No API key. Sort by relevance or date; push to Notion.

Run this in the cloudRun on Apify →

Developer & Research Tools

How it works

1
Open it on Apify
Hit Run on Apify — it opens the tool in the cloud, no install.
2
Set the inputs
Adjust query, sortBy, maxItems (sensible defaults are pre-filled).
3
Click Run
The tool runs on Apify’s cloud and collects the data for you.
4
Export the results
Download as JSON, CSV or Excel, or pipe straight into your app, Google Sheets, or an AI agent.

Inputs

Field	What it does	Type
`query`	arXiv search query. Use arXiv field prefixes: all: (all fields), ti: (title), au: (author), abs: (abstract), cat: (category). Examples: "all:large language models", "cat:cs.CL", "ti:attention AND au:vaswani". Combine terms with AND / OR / ANDNOT.	string
`sortBy`	How to order results. Relevance ranks by match quality; Submitted date sorts by original submission; Last updated date sorts by most recent revision. Newest/most-relevant first.	string
`maxItems`	Maximum number of papers to return. The actor paginates 100 per request and pauses ~3s between pages to respect arXiv's rate guidance. arXiv hard-limits total reachable results to about 30000.	integer
`notionConnector`	Optional. Write each paper as a page into your Notion when the run finishes — handy for building a literature-review database. Authorize a Notion connector once in Settings → API & Integrations → MCP connectors, then pick it here. Leave empty to skip (default) — results are always saved to the dataset regardless.	string
`notionParentId`	Optional. The Notion data source ID of the database to write papers into (only used if a Notion connector is set). Leave empty to create the pages privately in your workspace instead.	string

What you get

A structured dataset — each result includes fields like:

absUrlabstractarxivIdauthorscategoriesdoipdfUrlprimaryCategorypublishedAttitleupdatedAt

Export every run as JSON, CSV or Excel, or send it to your app, a database, Google Sheets, or an AI agent.

2 ready-to-run use cases

arXiv LLM Paper Scraper - Abstracts, Authors, PDFs

Run a keyword search across arXiv for large language model papers, ranked by relevance, with abstracts, author lists, and direct PDF download links.

Latest arXiv cs.CL Papers - Newest NLP Preprints

Track new cs.CL (Computation and Language) preprints the moment they hit arXiv, sorted by submission date, so NLP researchers never miss a release.

arXiv Scraper

Search arXiv via the official arXiv API and get clean, structured paper metadata. No API key, no login, no anti-bot — just polite, paginated requests to export.arxiv.org.

What it does

Given an arXiv search query, the actor fetches the Atom feed from the arXiv API, parses every <entry>, dedupes by arXiv ID, and returns one clean record per paper.

Input

Field	Type	Default	Description
`query`	string	`all:large language models`	arXiv search query (see syntax below). Required.
`sortBy`	select	`relevance`	`relevance`, `submittedDate`, or `lastUpdatedDate`.
`maxItems`	integer	`50`	Max papers to return (1–30000). Paginates 100/page.
`proxyConfiguration`	proxy	none	Optional — the public arXiv API has no anti-bot, so no proxy is used by default. Only enable one if you hit IP rate limits.

Search query syntax

arXiv uses field prefixes you can combine with AND / OR / ANDNOT:

all:transformer — search all fields
cat:cs.AI — papers in a category (e.g. cs.CL, cs.LG, stat.ML)
ti:attention — title
au:hinton — author
abs:diffusion — abstract
ti:attention AND au:vaswani — combine

Examples: all:large language models, cat:cs.CL, ti:"retrieval augmented generation".

Output

One record per paper:

{
  "ok": true,
  "arxivId": "2401.12345",
  "title": "Paper title",
  "abstract": "Whitespace-collapsed abstract…",
  "authors": ["First Author", "Second Author"],
  "primaryCategory": "cs.CL",
  "categories": ["cs.CL", "cs.AI"],
  "publishedAt": "2024-01-23T18:00:00Z",
  "updatedAt": "2024-02-01T12:00:00Z",
  "doi": "10.1000/xyz123",
  "absUrl": "http://arxiv.org/abs/2401.12345v1",
  "pdfUrl": "http://arxiv.org/pdf/2401.12345v1"
}

Nullable fields: doi is null when the paper has no registered DOI. primaryCategory, publishedAt, updatedAt, and absUrl can also be null if the arXiv feed omits them for an entry. pdfUrl is derived from the arXiv ID when the feed doesn't include a PDF link, so it is null only when the ID itself is missing.

Results are deduplicated by arxivId, falling back to absUrl and then title when a record has no arXiv ID.

Notes

Be polite: the actor pauses ~3 seconds between pages, as arXiv requests.
arXiv hard-limits the total reachable results per query to about 30000.
On an error, empty result, or missing query, the actor writes a single diagnostic row (ok: false, with an errorCode such as NO_RESULTS, BAD_INPUT, RATE_LIMITED, or NETWORK) and does not charge. The run still finishes cleanly so you can inspect the reason in the dataset.

Charging

Charges one paper unit per successfully returned paper. Diagnostic / empty rows are never charged.