Introduction to Data Transformation
Functional Overview
Data transformation provides data engineers and developers with an efficient, professional, and intelligent data development platform. By providing capabilities such as script development, visual development, task orchestration, task publishing, and task operations and maintenance, it helps organizations and businesses efficiently build real-time data lakehouses.
Feature Details
Modeling
- Visual Mode (Recommended): You can create a new transformation model table through a graphical interface by navigating to: Data Source -> Output Source -> Transformation Warehouse -> ETL Layer -> New Table.
- DDL Mode: You can create a new table using SQL statements by navigating to: Data Source -> Output Source -> Transformation Warehouse -> ETL Layer -> Query.
Layer Concept
- All source database data is synchronized to the
input
layer of the data warehouse. All transformation tables are created in theetl
transformation layer of the data warehouse. - A task's level is determined by the maximum level of the transformation tasks that produce its input tables.
- If all input tables are from the
input
layer, the current task is Level 1. - If the input tables include an output table from a Level n transformation task, the current task becomes Level n+1.
- Task levels ensure clear data transformation dependencies, enabling layered transformation and streamed triggering while preventing circular dependencies.
Scripting Guide
- To reduce code duplication, the platform supports using global variables
${var}
in SQL to replace repetitive code. - For a list of supported SQL functions, refer to the documentation: 👉 Yaoqing SQL
- For SQL transformation standards, see: 👉 Transformation Guidelines
- For SQL editor shortcuts, see: 👉 Keyboard Shortcuts
Task Details
- The platform processes data in real-time streams based on user-provided SQL. To prevent the real-time state from growing indefinitely, users must specify a time-based field in the
WHERE
clause to constrain the scope of real-time calculations:- Daily Job - Computes on the last 2 days of data, with support for hourly backfills of historical data.
- Hourly Job - Computes on the last 3 hours of data, with support for backfills every 5 minutes.
- Minute-level Job - Computes on the last 2 minutes of data, with support for backfills every 5 minutes.
- By default, when a task starts, it resumes incremental processing from the point where it last stopped or encountered an error. On its first run, it reads and processes data based on the
WHERE
clause. - The platform provides two distinct runtime environments, optimized for small and large tasks respectively, and intelligently switches between them automatically based on the job's characteristics.