The Quiet Change in Data Engineering on Google Cloud

The Quiet Change in Data Engineering on Google Cloud

In 2026, the main focus regarding GCP is not whether AI can write your SQL. The assistant has moved from the editor to the pipeline, and the real limit was never the model.

By: Mahesh T V, Author (Data Engineering with GCP)

For most of the last decade, creating data platforms on Google Cloud meant adapting to a specific kind of abstraction. You stopped thinking about clusters and began to focus on slots. BigQuery managed the scale, Dataflow handled the streaming, Composer coordinated various components, and your main job was to connect these elements smoothly while managing costs. The interesting work happened at the intersections.

The landscape is moving fast, and honestly, most teams are behind. Google didn't invent the data assistant in 2026; every vendor had shipped something by 2024. What's different is the scope. The assistant now operates inside the pipeline instead of just sitting in the editor, which changes what's actually possible.

From autocomplete to autonomy

The clearest sign of this shift is the Data Engineering Agent now integrated into BigQuery pipelines and Dataform. Earlier, Gemini features primarily provided smart autocomplete. They explained a query, completed a line of SQL, and suggested fixes. This was useful, but you still had to do the engineering work. The agent now offers a different experience. You can describe the pipeline you want in simple terms, and it generates the transformation code according to your specified conventions. A key operational aspect is that it can identify its own failures. When a job fails, instead of going through execution logs at 2 AM, you let the agent read the logs and pinpoint the root cause before suggesting a solution.

AI moved into the SQL engine

The second change is less flashy but likely more important. A surprising amount of what used to require three separate services can now be done with one SQL function.

Take OCR and layout parsing. That used to require Document AI, custom code, and a lot of cursing. Now it's one function: AI.PARSE_DOCUMENT. ObjectRef lets you query unstructured data (PDFs, images, audio) the same way you query tables, which most warehouses still treat as a separate problem. AI.CLASSIFY and AI.IF are simpler — they're classification and conditional logic in SQL. Add hybrid search and graph queries and you've got a platform that doesn't force you to pick between structured and unstructured.

The less glamorous but essential aspect is cost. Google introduced an "optimized mode" that trains small, task-specific models on the fly for these functions, claiming about a 230x reduction in tokens compared to calling a generative model row by row. This reduction is the difference between a clever feature you showcase once and one you can run across a billion rows without concern from finance. Native Gemma embeddings can now be created on standard CPUs, and embedding pipelines can automatically keep vector indexes updated as new data comes in. Most data warehouses used to treat unstructured data as someone else's problem. Now, it has become a necessity.

It's the data, not the model

The most straightforward takeaway from Google Cloud Next this year is a simple acknowledgment. When agents struggle in enterprises, the model is rarely the problem. The data is. It often remains unused, inconsistent, and scattered across silos, leaving the agent confused about its meaning. An agent that mistakenly joins the wrong "revenue" column is worse than having no agent at all.

This is where the Knowledge Catalog comes in. It’s a semantic graph built across your data landscape that uses Gemini to tag and enrich your assets, mapping their relationships. This means that when anyone, whether human or agent, asks about a metric, it resolves to one defined meaning. When combined with BigQuery's measures and the new LookML agent, the strategy becomes clear: agents are only as effective as the semantic layer supporting them. Google believes that data readiness, not just the power of the model, is the real problem to address.

For those of us who have spent years pushing for governance and a proper semantic layer only to meet blank stares, there’s a small sense of validation. The aspect that nobody wanted to fund has become crucial for the things everyone now values.

What it means for the person doing the work

It’s easy to see all of this as the gradual removal of the data engineering role. However, I don’t think that's true, and everyday experience suggests otherwise. The tasks being automated are the tedious parts of the job: standard transformations, the first eighty percent of a pipeline, and the log digging after a 2 AM failure. What stays with you is deciding what the pipeline should do in the first place, what "correct" actually means for your business, where data can and can't go, and whether the agent's answer is right or just confident.

The skill mix shifts. You're no longer writing SQL; you're setting the guardrails that the SQL gets written within. Writing the instruction file that outlines your standards gives you leverage, as does designing the semantic layer and reviewing code generated by the agent. You must be sharp enough to spot subtle cost or correctness mistakes that a quick read might overlook. You transition from writing SQL to establishing the boundaries within which SQL is created.

The marketing around all this is loud. "Agentic data cloud" is going to be insufferable by next year, and most of these features are still rough — preview APIs, changing behaviour, missing edge cases. However, despite the branding, the direction is clear and real: data warehouses are becoming something you can interact with and delegate tasks to, not just something you query. The teams that benefit the most won’t be the ones chasing every new feature. They’ll be the ones who quietly established clean data lineage, sensible naming, and a well-governed semantic layer, allowing them to direct an agent at their data and trust the results.

Back to blog