We use cookies to improve your experience
We use essential cookies to keep the platform running and optional analytics cookies (PostHog) to understand how you use our tools. Read our Cookie Policy.
How the Geospatial Careers dataset is collected, normalised, and validated
Latest snapshot
6 May 2026
Composite quality
8.7/10
Jobs
24,885
Intel corpus
27,237
Papers
10,898
Benchmarks
2,045 · 13 sources
Sixteen sources feed daily into intel_jobs (the intelligence corpus, append-only) and jobs (the site-facing table). Each source row is gated by a 160-term ontology with multilingual broad-stem patterns covering 7 EU languages (DE / FR / ES / IT / NL / PT / EN) before persistence, so non-geospatial postings are rejected at the ingestion boundary rather than filtered downstream.
| Source | Type | Countries | Cadence |
|---|---|---|---|
| Adzuna | Aggregator API | 19 | Daily |
| JSearch (Google Jobs) | Aggregator API | 9 | Daily |
| Reed | Aggregator API | GB | Daily |
| USAJobs | Government | US | Daily |
| Jooble | Aggregator API | ~30 | Daily |
| Arbeitnow | Aggregator API (EU) | ~7 | Daily |
| Careerjet | Aggregator API | 8 locales | Daily |
| EURES | Government EU | 27 | Daily |
| Greenhouse | ATS direct | Global | Daily |
| Lever | ATS direct | Global | Daily |
| Ashby | ATS direct | Global | Daily |
| SmartRecruiters | ATS direct | Global | Daily |
| Remotive / RemoteOK | Remote board | Remote | Daily |
| ReliefWeb | UN humanitarian | Global | Daily |
| DOL OFLC LCA (H-1B) | US Govt visa | US | Quarterly |
| Bundesagentur für Arbeit | Government DE | DE | Daily |
Slug discovery for ATS platforms (Greenhouse / Lever / Ashby / SmartRecruiters / Workable) runs weekly via .github/workflows/ats-discovery-weekly.yml. The candidate company list grows automatically; the daily collector picks up new viable boards without code changes.
No-scraping policy
Geospatial Careers does not scrape job listings from other curators, newsletters, or competitor platforms. Listings come exclusively from official APIs (Adzuna, JSearch, Reed, USAJobs, Bundesagentur), ATS endpoints (Greenhouse, Lever, Ashby, SmartRecruiters) where the employer publishes their board, public government open data (US DoL H1B LCA), or one-off public datasets (Kaggle LinkedIn snapshots). 14,609 rows previously ingested from a public industry newsletter were removed on 2026-05-06 — see migration 20260506050000_remove_scraped_newsletter_jobs.sql. Future curator partnerships go through explicit consent and attribution.
data/intelligence/esco/esco-labels-multilingual.json.k / M suffixes, locale separators (EU 1.500 vs US 1,500 vs decimal 1.5), period inference from text and from magnitude, and per-period bounds validation. Replaced a previous parser that silently truncated $80k-$120k to 80-120 (1000× corruption).geocode_method column for downstream precision filtering.salary_normalisation_audit with the rate, rate_date, source, quality grade, and pipeline_run_id — so historical conversions are fully reproducible.13 benchmark sources back the salary surfaces with citation-grade provenance. Every row in intel_salary_benchmarks carries a JSONB provenance object containing retrieval URL, retrieval date, dataset version, methodology URL, licence, and authority weight.
| Source ID | Description | Coverage | Authority |
|---|---|---|---|
| eurostat_ses22_28 | Eurostat SES 2022 | 30 countries × ISCO-08 | 0.85 |
| bls_oews | BLS Occupational Employment & Wage Statistics | US, 6-digit SOC | 0.95 |
| ons_ashe | ONS Annual Survey of Hours and Earnings | GB, 4-digit SOC | 0.90 |
| statcan_wages | Statistics Canada wages | CA, 4-digit NOC | 0.90 |
| dol_h1b_lca | US DOL OFLC LCA Disclosures | US visa, exact wages | 0.95 |
| bls_qcew | BLS Quarterly Census of Employment & Wages | US, NAICS | 0.95 |
| entgeltatlas | Bundesagentur Entgeltatlas | DE, KldB | 0.90 |
| urisa_gpn | URISA Geospatial Practitioner Network | Survey | 0.65 |
| nz_linz_giss | LINZ NZ Geospatial Industry Survey | NZ | 0.75 |
| adzuna_history | Adzuna historical posting median | 19 countries | 0.55 |
| linkedin_kaggle | LinkedIn Kaggle dataset (filtered) | Global, snapshot | 0.55 |
Two complementary collectors feed intel_papers:
DOIs are normalised on every write via a database trigger (doi_normalised): lowercase, with the leading https://doi.org/ prefix stripped. When the same normalised DOI appears under two distinct sources, the row is flagged cross_source_verified = true — a high-confidence signal for downstream consumers.
Daily snapshot. data_quality_snapshot captures volume, coverage, freshness, stale-source alerts, and six dimension scores (D1-D6) per day. Citable in every release; surfaced publicly at /research/quality.
Mart parity gate. CI test asserts that public.mart_* views and public_analytics.mart_* tables match on row counts and column counts; fails the workflow on schema drift.
Quarantine layer. stg_intel_jobs_rejected classifies every dropped row by reason (no_title, no_company, low_source, no_description, no_geo, no_salary, composite_low) so quality loss is observable instead of silent.
RPC hardening. 23 SECURITY DEFINER functions in public have explicit search_path, REVOKE PUBLIC and GRANT-only-to-needed-roles. Documented in migration 20260505130000.
geospatial-careers repository.mart_geo_jobs publishes weekly to Hugging Face under CC-BY 4.0. Each release tagged with the snapshot_date + snapshot_version from data_quality_snapshot for traceability.provenance JSONB.From June 2026 the EU Pay Transparency Directive requires employers to disclose salary ranges. This dataset is positioned to back compliance + market-intelligence workflows with:
Methodology version 1 · 2026-05-05 · This page revalidates every hour. Cite via the snapshot ID on /research/quality.