← Back to examples
Agent skill

Cecil data analysis with xarray-sql

Use this skill when the user asks an analysis question about Cecil earth-observation data — forest carbon, biomass, deforestation, land cover, soil moisture, biodiversity, protected areas, ecosystem integrity. The skill writes SQL with xarray-sql (xql) against the user's existing subscription and presents query → result table → plain-English interpretation as one compact block.

Use this skill for questions like "how has tree cover changed in my AOI?" or "what's the dominant land cover here?" — anything that wants to query an existing Cecil subscription with SQL.

For setup, installation, and the runnable scripts that drive these queries, see tutorials/cecil-data-analysis-xql/. This SKILL.md covers only what an agent needs to know in-context.

When this skill is in scope

Yes: dominant-class, time-series, change-detection, area, or overlap questions on Cecil rasters (biomass, LULC, soil moisture, BII, FLII) or vector datasets (IBAT family — protected areas, KBA, species ranges).

No: weather forecasts, atmospheric chemistry / GHGs, ocean / marine data, generic Python help, anything without a spatial AOI. Cecil doesn't carry that data — say so plainly and stop. Don't force-fit a dataset.

Subscription gate (the rule that protects the wallet)

Default: read-only on existing subscriptions. Run scripts/list_subscriptions.py first. If a subscription for the chosen dataset already exists, go straight to the SQL.

If no matching subscription exists:

  1. Summarise what you'd subscribe to: dataset name, provider, coverage, relevant variables.
  2. Show the exact calls you'd make: client.create_aoi(geometry=...) (if no AOI) and client.create_subscription(aoi_id=..., dataset_id=...).
  3. Stop and ask the user to confirm before calling either method. Subscriptions are billed. Get explicit confirmation for this dataset now, even if the user agreed to a different one earlier in the conversation.

If the user declines, say so and stop. Don't try to find a free substitute that might silently answer a different question.

Golden rule: query → table → insight

Whenever this skill is in scope, present every analysis as one compact block, in this order:

  1. SQL — fenced sql block. Applies at every stage: DESCRIBE, exploratory queries, final analysis.
  2. Result table — Markdown, capped at 10 rows. If longer, end with "… N more in result.csv". Never dump 200 rows inline.
  3. Interpretation — two or three sentences of plain-English insight. No header, no "here are the results:" preamble.

For exploratory queries with nothing to interpret (DESCRIBE, etc.), just show 1 + 2.

Table naming

The xarray Dataset is registered under a name derived from the dataset (provider prefix stripped, slugified):

Dataset name Table name
Impact Observatory — Land Cover 9-Class land_cover_9_class
Chloris — Aboveground Biomass Stock and Change 10m aboveground_biomass_stock_and_change_10m
Lobelia Earth — Surface Soil Moisture 30m surface_soil_moisture_30m

Use the same name in your SQL (FROM land_cover_9_class, not FROM cecil). scripts/run_analysis.py does the derivation automatically and prints the table name.

SQL idioms

Always start with DESCRIBE to confirm column names before writing real SQL:

DESCRIBE land_cover_9_class;

Top class per timestampROW_NUMBER OVER:

SELECT time, ref."Name", cnt
FROM (
  SELECT time, land_cover_class, COUNT(*) AS cnt,
    ROW_NUMBER() OVER (PARTITION BY time ORDER BY COUNT(*) DESC) AS rn
  FROM land_cover_9_class
  GROUP BY time, land_cover_class
) r
JOIN ref_land_cover_class ref ON r.land_cover_class = ref."Index"
WHERE rn = 1
ORDER BY time;

Change detectionLAG OVER per pixel:

SELECT x, y, time,
  land_cover_class AS new_class,
  LAG(land_cover_class) OVER (PARTITION BY x, y ORDER BY time) AS prev_class
FROM land_cover_9_class;

Filter with WHERE prev_class IS DISTINCT FROM new_class AND prev_class IS NOT NULL to keep only changed pixels.

Time-series aggregation — plain GROUP BY time:

SELECT time, AVG(aboveground_biomass) AS mean_biomass
FROM aboveground_biomass_stock_and_change_10m
WHERE aboveground_biomass IS NOT NULL
GROUP BY time
ORDER BY time;

Quoting rule (datafusion gotcha)

Reference-table columns from the API are PascalCase ("Index", "Name"). Double-quote them in SQL — datafusion lower-cases unquoted identifiers and you'll see "Schema error: no field named index". Get this right and joins work.

Vector vs raster datasets

Most Cecil datasets are raster (load_xarray). The IBAT family (IUCN Red List, KBA, WDPA, STAR) is vector — use load_dataframe. run_analysis.py accepts --vector for these:

uv run python scripts/run_analysis.py --vector \
  --subscription <id> --sql 'SELECT * FROM wdpa_protected_areas LIMIT 5'

The full SDK gotchas (auth, subscription_id vs dataset_id, AOI geometry rules) live in tutorials/cecil-data-analysis-xql/references/sdk.md.

Failure modes — say so plainly

  • No subscriptions and no clear dataset match → say what you looked for, list candidates, stop.
  • SQL fails twice with the same error → stop and show the user. Often the fix is the DESCRIBE you skipped.
  • Question isn't really about Cecil data → decline. Better than silently producing a confident-sounding irrelevant answer.

For the full worked end-to-end example (dominant land cover 2020 vs 2023 with chart), see the tutorial.