This model defines how to handle CVR coding details that are easy to misread in raw data.
Why this exists
AI and humans should not have to guess where code lists live or how code formats change over time.
This page defines one consistent policy for:
- municipality codes (
kommunekode) - branch coding transitions (DB07 → DB25)
- BS/NACE hierarchy handling
- local codebook copies vs. source links
Operational Rules for Extraction
Apply these rules before joining, filtering, or aggregating CVR data:
- Normalize municipality codes to 4-digit strings before any match.
- Treat
registrering*andvirkning*as different time axes. - Treat branch code systems as time-bound (
DB07andDB25are not interchangeable). - Treat
CVRAdresseas text-based address components, not geometry. - Do not geocode until address fields are normalized.
- For current-state extraction, filter
registreringTil IS NULLandvirkningTil IS NULLunless the question explicitly asks for history.
1) Municipality codes
Practical issue
The same municipality can appear as 101 or 0101.
Canonical rule
- Canonical storage format: 4-digit zero-padded string
- Example: Copenhagen =
0101
Query/interoperability rule
- Input may be accepted as
101or0101 - Normalize before joins and filtering
- Publish normalized value in outputs
Failure mode to avoid
If one side of a join uses 101 and the other uses 0101, the join will silently fail or undercount. Always normalize first.
2) Branche coding over time (DB07 and DB25)
Practical issue
A firm can have historical records in DB07 and newer records in DB25 (from 2025 updates), which can look like code changes even when business meaning is stable.
Canonical rule
- Treat branch code as time-bound classification
- Do not merge DB07 and DB25 blindly
- Always record:
code_system(DB07orDB25)code_value- time context (virkning/registrering scope)
Analysis rule
When combining periods across 2007-2024 and 2025+:
- keep native code system per record
- apply explicit mapping table between systems
- report mapping uncertainty where one-to-many or many-to-one mappings occur
Detection rule
- Records from 2025 onward should be assumed to use DB25 unless source metadata says otherwise.
- Historical records may still contain DB07 values.
- Never infer code-system equivalence from identical text labels alone.
3) DB/NACE representation
Practical issue
Branche values are often represented as six digits without dot separators.
Canonical rule
- Keep raw code as provided (6-digit string)
- Derive hierarchy by prefix instead of punctuation:
- level 1 = first 2 digits
- level 2 = first 4 digits
- level 3 = all 6 digits
This allows stable grouping even when display punctuation differs.
Query rule
- If the user supplies dotted codes such as
56.11.10, strip punctuation before matching raw CVR values. - Preserve the raw stored code in output and optionally add derived hierarchical prefixes.
4) Current vs Historical filtering
Current-state rule
Use both:
registreringTil IS NULLvirkningTil IS NULL
Historical rule
For a target date t, filter by legal validity first:
virkningFra <= tvirkningTil IS NULL OR t < virkningTil
Then decide whether system-registration time also matters for the use case.
Failure mode to avoid
Using only registreringTil IS NULL returns what is currently stored, not necessarily what was legally valid at a historical date.
5) Address normalization for geocoding
Before matching CVRAdresse to an external address register:
- zero-pad
kommunekode - normalize
vejnavncase and whitespace - parse
husnummerfraandbogstavinto canonical house-number text - preserve
postnummeras string - keep unmatched addresses in a separate exception set
Recommended match tuple
Use this order of preference:
kommunekode+vejnavn+husnummerfra+bogstav- add
postnummerwhen ambiguity remains - fall back to formatted free text only as a last resort
6) Local copy vs source-only links
Recommendation: Hybrid policy (required)
Use both:
- Authoritative source links for provenance
- Local frozen codebook snapshots for reproducibility
Why:
- source links provide legal/official traceability
- local snapshots prevent future source changes from silently altering historical analyses
- AI can resolve lookups deterministically without web guessing
7) Minimum codebook package per project
Each project should maintain local snapshots in sanctuary/lookup assets with metadata:
municipality_codes.vYYYYMMDD.csvbranche_db07.vYYYYMMDD.csvbranche_db25.vYYYYMMDD.csv- optional
db07_to_db25_crosswalk.vYYYYMMDD.csv
Each snapshot must include:
- source URL
- retrieval date
- code system version
- checksum/hash
8) Common failure cases
| Failure case | Why it happens | Preventive rule |
|---|---|---|
| Empty municipality join | 101 vs 0101 mismatch | zero-pad before joins |
| Wrong sector aggregation | DB07 and DB25 mixed without mapping | detect code system per record |
| Missing geocodes | raw CVRAdresse text not normalized | normalize address components before DAR join |
| False current snapshot | only one temporal axis filtered | use both registreringTil and virkningTil for current state |
| Wrong location semantics | Virksomhed used instead of Produktionsenhed | decide legal entity vs operational site up front |
9) AI execution checklist
Before filtering or joining:
- Normalize municipality codes to 4-digit strings
- Detect branch code system by time and metadata
- Resolve DB07/DB25 mapping policy
- Use local snapshot lookup tables
- Log all normalization and mapping steps in Design_Rationale