Data Warehouse

A data warehouse (DWH) is a centralized repository optimized for analytical queries across multiple source systems — ERP, CRM, e-commerce, marketing, finance — rather than the transactional patterns of operational databases. Data warehousing has been the foundation of business intelligence since the 1990s, with cloud-native warehouses (Snowflake, Google BigQuery, Amazon Redshift, Databricks) dominating new deployments since the late 2010s.

Architecture and modeling

Two dominant modeling paradigms. Kimball dimensional modeling: facts (measurements like sales, order quantities) plus dimensions (descriptive attributes like time, customer, product) organized in star schemas. Optimized for query performance and business-user readability. Inmon corporate information factory: normalized data warehouse with data marts derived for specific analytical needs. More rigid but cleaner enterprise data structure. Modern cloud-warehouse practice often combines both: raw layer (operational source data preserved), staging layer (cleaned and conformed), presentation layer (dimensional for end-user consumption). Tools like dbt (data build tool) have standardized the SQL-based transformation patterns that build these layers.

Cloud data warehouses

Snowflake — cloud-native, multi-cloud (AWS, Azure, Google), strong presence across US mid-market and enterprise. Google BigQuery — serverless, pay-per-query, deep ML and AI integration. Amazon Redshift — AWS-native, more traditional warehouse model. Microsoft Azure Synapse Analytics (formerly SQL Data Warehouse) and Microsoft Fabric — integrated with the Microsoft data stack, common in organizations standardized on Microsoft 365. Databricks — data-lakehouse platform combining warehouse and data-lake capabilities. SAP Datasphere (formerly Data Warehouse Cloud) — SAP-native with direct S/4HANA integration. Open source: Apache Druid, ClickHouse, DuckDB (single-machine analytics). For US mid-market companies starting fresh, Snowflake and Microsoft Fabric are the most-commonly evaluated options.

Data lake versus lakehouse

Adjacent categories worth distinguishing. Data lake: raw-data storage in flexible formats (parquet, JSON, CSV) on object storage (S3, ADLS, GCS), supporting unstructured and semi-structured data. Lower cost than a warehouse, less optimized for SQL queries. Data lakehouse: combines lake-style storage with warehouse-style ACID transactions and SQL performance. Apache Iceberg, Delta Lake and Apache Hudi are the open table formats enabling lakehouse architectures; Databricks and Snowflake both support lakehouse patterns. For ERP-centric analytics in the US mid-market, the classical data-warehouse pattern suits most use cases; lakehouse becomes valuable when unstructured data (logs, IoT, documents) joins the analytics scope.

ERP-to-DWH integration

ERP data flows into the warehouse through ETL or ELT pipelines (see ETL). Common patterns. (1) Replication via CDC: tools like SLT (SAP), Debezium or Fivetran capture every ERP database change and stream to the warehouse with minute-level latency. (2) Scheduled batch extracts: nightly or hourly batch pulls of changed records via OData or vendor APIs. (3) Vendor data hubs: SAP Datasphere, Oracle Autonomous Data Warehouse, Microsoft Dataverse provide ERP-vendor-managed data layers that simplify integration. US mid-market companies typically combine tools: Fivetran or Airbyte for SaaS sources, vendor connectors for SAP and Oracle, custom Python for legacy on-premises ERPs. Downstream BI tools (Power BI, Tableau, Qlik) consume the warehouse for analytics and dashboards.

Data Warehouse

Architecture and modeling

Cloud data warehouses

Data lake versus lakehouse

ERP-to-DWH integration

Related Topics