Analysis Recipes (R)¶
Per-question-type recipes for going from fact_response to a chart in R.
The SQL is the same as in the Python recipes and is
canonical in Query Patterns by Question Type. This page
swaps the chart layer for ggplot2 and
the connection layer for the duckdb R
package.
quickq itself is a Python CLI, but its OLAP output is a DuckDB file that R
can query directly with no Python in the loop.
Setup: get a database to query¶
Every recipe below assumes a DuckDB OLAP file (analytics.duckdb) produced
by quickq refresh. Two ways to get one:
Option A: use the bundled demo¶
Run these once at the shell to produce a 500-respondent seeded study:
curl -O https://raw.githubusercontent.com/quickq-io/quickq/main/examples/health_intake_demo.yaml
quickq init study.db
quickq load health_intake_demo.yaml study.db
quickq seed study.db 1 --n 500 --seed 20260503
quickq refresh study.db analytics.duckdb
The link_id values in the recipes below match the demo (e.g.
demo.general_health, demo.self_management).
Option B: use your own study¶
If you have a populated study.db, just refresh it:
quickq refresh study.db analytics.duckdb
Adjust the link_id in each recipe to a question that exists in your
instrument. If your study spans multiple questionnaires, add
AND dq.questionnaire_id = ? to each query's WHERE clause.
R session setup¶
library(duckdb)
library(DBI)
library(ggplot2)
con <- dbConnect(duckdb::duckdb(), "analytics.duckdb", read_only = TRUE)
q <- function(sql) dbGetQuery(con, sql)
When you're done: dbDisconnect(con, shutdown = TRUE).
single_choice¶
Bar chart of the option distribution.
df <- q("
SELECT opt.option_text AS choice, MIN(opt.option_value) AS sort_key, COUNT(*) AS n
FROM fact_response fr
JOIN dim_question dq USING (question_id)
JOIN dim_response_option opt USING (option_id)
WHERE dq.link_id = 'demo.general_health'
GROUP BY 1
ORDER BY sort_key
")
df$choice <- factor(df$choice, levels = df$choice)
ggplot(df, aes(x = choice, y = n)) +
geom_col() +
labs(x = "Response", y = "Respondents")
multiple_choice¶
Selection rate (% of respondents who picked each option). Distinct-session count is the right denominator because one respondent can pick many options.
df <- q("
WITH n_resp AS (
SELECT COUNT(DISTINCT session_id) AS total FROM fact_response
)
SELECT
opt.option_text AS choice,
COUNT(DISTINCT fr.session_id) AS n_selected,
ROUND(100.0 * COUNT(DISTINCT fr.session_id) / n_resp.total, 1) AS pct
FROM fact_response fr
JOIN dim_question dq USING (question_id)
JOIN dim_response_option opt USING (option_id)
CROSS JOIN n_resp
WHERE dq.link_id = 'demo.management_strategies'
GROUP BY 1, n_resp.total
ORDER BY pct DESC
")
ggplot(df, aes(x = pct, y = reorder(choice, pct))) +
geom_col() +
labs(x = "% of respondents", y = NULL)
sata_other¶
Same as multiple_choice for the structured options. The free-text "Other"
content arrives as rows where option_id IS NULL and response_text holds
the typed-in value.
# Frequency, including the "Other" bucket
df <- q("
SELECT
COALESCE(opt.option_text, 'Other (free text)') AS choice,
COUNT(*) AS n
FROM fact_response fr
JOIN dim_question dq USING (question_id)
LEFT JOIN dim_response_option opt USING (option_id)
WHERE dq.link_id = 'demo.current_symptoms'
GROUP BY 1
ORDER BY n DESC
")
# Inspect the Other free-text values for thematic coding
others <- q("
SELECT response_text, COUNT(*) AS n
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id = 'demo.current_symptoms'
AND fr.option_id IS NULL
AND fr.response_text IS NOT NULL
GROUP BY 1
ORDER BY n DESC
")
boolean¶
Proportion positive with a Wilson 95% confidence interval.
df <- q("
SELECT
dq.link_id,
COUNT(*) AS n,
SUM(CASE WHEN fr.response_boolean = TRUE THEN 1 END) AS n_true
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.question_type = 'boolean'
GROUP BY dq.link_id
")
wilson_ci <- function(k, n, z = 1.96) {
if (n == 0) return(c(NA_real_, NA_real_))
p <- k / n
denom <- 1 + z^2 / n
centre <- (p + z^2 / (2 * n)) / denom
half <- z * sqrt(p * (1 - p) / n + z^2 / (4 * n^2)) / denom
c(centre - half, centre + half)
}
ci <- t(mapply(wilson_ci, df$n_true, df$n))
df$pct_true <- 100 * df$n_true / df$n
df$ci_low <- 100 * ci[, 1]
df$ci_high <- 100 * ci[, 2]
ggplot(df, aes(x = pct_true, y = link_id)) +
geom_errorbarh(aes(xmin = ci_low, xmax = ci_high), height = 0.2) +
geom_point(size = 3) +
labs(x = "% true (95% CI)", y = NULL)
response_boolean is a typed BOOLEAN column on fact_response. The OLTP
stores boolean answers as 'true' / 'false' strings in response_text;
the OLAP refresh promotes them to a real BOOLEAN at load time.
text¶
Free text is rarely chart-friendly. Two useful summaries are response rate and length distribution; the actual content typically goes to qualitative review or downstream NLP.
df <- q("
SELECT LENGTH(fr.response_text) AS chars
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id = 'demo.additional_notes'
AND fr.response_text IS NOT NULL
AND LENGTH(fr.response_text) > 0
")
ggplot(df, aes(x = chars)) +
geom_histogram(bins = 30) +
labs(x = "Response length (characters)", y = "Responses")
numeric¶
Histogram. The OLAP refresh pre-computes percentile and spread metrics in
agg_numeric_stats if you want them without rolling your own.
df <- q("
SELECT response_numeric AS value
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id = 'demo.symptom_days'
AND fr.response_numeric IS NOT NULL
")
ggplot(df, aes(x = value)) +
geom_histogram(bins = 30) +
labs(x = "Symptom days in past 30", y = "Respondents")
date¶
Time series of when answers fell. For dateTime questions, swap
response_date for response_text (ISO 8601) and parse upstream.
df <- q("
SELECT
DATE_TRUNC('month', fr.response_date) AS month,
COUNT(*) AS n
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id = 'demo.diagnosis_date'
AND fr.response_date IS NOT NULL
GROUP BY 1
ORDER BY 1
")
df$month <- as.Date(df$month)
ggplot(df, aes(x = month, y = n)) +
geom_line() +
geom_point() +
labs(x = "Diagnosis month", y = "Respondents")
slider¶
Identical to numeric for analysis purposes. The slider's range and labels
matter for rendering, not aggregation. A density chart often reads better
than a histogram for slider distributions:
df <- q("
SELECT response_numeric AS value
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id = 'demo.pain_vas'
AND fr.response_numeric IS NOT NULL
")
ggplot(df, aes(x = value)) +
geom_density(fill = "steelblue", alpha = 0.4) +
xlim(0, 100) +
labs(x = "Pain VAS (0-100)", y = "Density")
likert¶
A Likert is an ordered single-choice. The canonical visualization is a diverging stacked bar centered on the neutral midpoint. The simpler form is a colored ordinal bar:
df <- q("
SELECT
CAST(opt.option_value AS INTEGER) AS score,
opt.option_text AS label,
COUNT(*) AS n
FROM fact_response fr
JOIN dim_question dq USING (question_id)
JOIN dim_response_option opt USING (option_id)
WHERE dq.link_id = 'demo.self_management'
GROUP BY 1, 2
ORDER BY 1
")
df$pct <- 100 * df$n / sum(df$n)
df$label <- factor(df$label, levels = df$label)
ggplot(df, aes(x = label, y = pct, fill = score)) +
geom_col() +
scale_fill_distiller(palette = "RdYlGn", direction = 1, guide = "none") +
labs(x = NULL, y = "% of respondents")
grid¶
Heatmap with rows = grid rows and columns = grid columns. The OLAP fact
table carries grid_row_id / grid_column_id; the row/column text labels
live in the OLTP. Attach the OLTP read-only and join in one query:
dbExecute(con, "ATTACH 'study.db' AS oltp (TYPE sqlite, READ_ONLY)")
df <- q("
SELECT
gr.row_text AS area,
gc.column_text AS severity,
CAST(gc.column_value AS INTEGER) AS severity_value,
COUNT(*) AS n
FROM fact_response fr
JOIN dim_question dq USING (question_id)
JOIN oltp.grid_row gr ON fr.grid_row_id = gr.row_id
JOIN oltp.grid_column gc ON fr.grid_column_id = gc.column_id
WHERE dq.link_id = 'demo.daily_impact'
GROUP BY 1, 2, 3
")
severity_order <- c("Not at all", "A little", "Moderately", "Quite a bit", "Extremely")
df$severity <- factor(df$severity, levels = severity_order)
ggplot(df, aes(x = severity, y = area, fill = n)) +
geom_tile() +
scale_fill_distiller(palette = "Blues", direction = 1) +
labs(x = NULL, y = NULL, fill = "Respondents")
ranked¶
response_numeric holds rank position (1 = first choice). Mean rank is the
natural summary; lower values indicate more frequent prioritization.
df <- q("
SELECT
opt.option_text AS priority,
ROUND(AVG(fr.response_numeric), 2) AS mean_rank,
COUNT(*) AS n
FROM fact_response fr
JOIN dim_question dq USING (question_id)
JOIN dim_response_option opt USING (option_id)
WHERE dq.link_id = 'demo.care_priorities'
GROUP BY 1
ORDER BY mean_rank
")
ggplot(df, aes(x = mean_rank, y = reorder(priority, -mean_rank))) +
geom_col() +
xlim(1, max(df$mean_rank) + 0.5) +
labs(x = "Mean rank (1 = most important)", y = NULL)
repeating_group¶
Children of a repeating group carry a repeat_index (0-based instance
number). The natural unit of analysis is (session_id, repeat_index) — one
loop instance per row. Pivot to one row per instance, then summarize across
or within instances as needed.
The bundled demo doesn't include a repeating group, so the snippet below
uses placeholder visits.* link_ids. Substitute your own.
df <- q("
SELECT
fr.respondent_id,
fr.repeat_index AS visit,
MAX(CASE WHEN dq.link_id = 'visits.week'
THEN fr.response_numeric END) AS gestational_week
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id LIKE 'visits.%'
GROUP BY 1, 2
")
# Distribution of visit count per respondent.
# The aggregate() call below errors on an empty data.frame, so guard it —
# this lets the snippet run cleanly against a demo that has no repeating
# group while still producing the chart whenever real data is present.
if (nrow(df) > 0) {
visit_count <- aggregate(visit ~ respondent_id, df, max)
visit_count$n_visits <- visit_count$visit + 1
ggplot(visit_count, aes(x = factor(n_visits))) +
geom_bar() +
labs(x = "Visits recorded", y = "Respondents")
}
Skip patterns¶
Two questions about a conditional question:
- Did the right people see it? Compare the count of answers against the count of respondents whose trigger answer met the show-when rule.
- What does the answer look like for the people who saw it? Standard per-type recipe, restricted to the conditional sub-cohort.
# Integrity check: every "yes" to the trigger should produce one date answer
df <- q("
SELECT
SUM(CASE WHEN dq.link_id = 'demo.has_chronic_condition'
AND fr.response_boolean = TRUE THEN 1 END) AS triggered,
SUM(CASE WHEN dq.link_id = 'demo.has_chronic_condition'
AND fr.response_boolean = FALSE THEN 1 END) AS suppressed,
SUM(CASE WHEN dq.link_id = 'demo.diagnosis_date'
THEN 1 END) AS conditional_answers
FROM fact_response fr
JOIN dim_question dq USING (question_id)
")
stopifnot(df$triggered == df$conditional_answers)
# Conditional analysis: only the triggered sub-cohort
df <- q("
SELECT fr.response_date AS diagnosis_date
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id = 'demo.diagnosis_date'
AND fr.session_id IN (
SELECT fr2.session_id FROM fact_response fr2
JOIN dim_question dq2 USING (question_id)
WHERE dq2.link_id = 'demo.has_chronic_condition'
AND fr2.response_boolean = TRUE
)
")
The EXISTS / IN (subquery) pattern is the workhorse for skip-aware
denominators. If you cohort the same way often, materialize a session-level
flag table once and join against it.
Cross-question comparisons¶
Two questions, one cohort. Pivot one row per session with both answers.
df <- q("
SELECT
fr.session_id,
MAX(CASE WHEN dq.link_id = 'demo.self_management'
THEN fr.response_numeric END) AS confidence,
MAX(CASE WHEN dq.link_id = 'demo.symptom_days'
THEN fr.response_numeric END) AS symptom_days
FROM fact_response fr
JOIN dim_question dq USING (question_id)
WHERE dq.link_id IN ('demo.self_management', 'demo.symptom_days')
GROUP BY fr.session_id
")
ggplot(df, aes(x = symptom_days, y = confidence)) +
geom_point(alpha = 0.5) +
labs(x = "Symptom days (past 30)", y = "Self-management confidence (1-5)")
For instrument-level scores (e.g. PHQ-9), don't recompute — use
agg_respondent_scores which is materialized by quickq refresh from the
scoring rule defined in the YAML. See
Analytics & Data Model for the score table schema.