Google unifies data lakes and warehouses with BigLake
Unifying cloud-based big data promises lower risk and cost
Data lakes hold raw enterprise data until it’s ready to be analyzed; data warehouses process and transforms that data. This is the foundation of business intelligence (BI) systems. Google’s newest product for this space is BigLake. Google said that BigLake can help reduce risk and lower big data querying costs by helping businesses unify data warehouses and lakes.
“BigLake unifies data warehouses and data lakes into a consistent format for faster data analytics across multi-cloud storage and open formats,” said Google.
Gerrit Kazmaier, Google VP and GM of Database, Data Analytics and Looker, explained further.
“With BigLake, customers gain fine-grained access controls, with an API interface spanning Google Cloud and open file formats like Parquet, along with open-source processing engines like Apache Spark. These capabilities extend a decade’s worth of innovations with BigQuery to data lakes on Google Cloud Storage to enable a flexible and cost-effective open lake house architecture,” said Kazmaier.
BigQuery is Google’s managed, serverless data warehouse, capable of petabyte scale analysis. Google provides BigQuery as a Platform as a Service (PaaS) which supports Structured Query Language (SQL) queries. Features
Features of BigLake include table, row, and column-level security policies on object storage, multi-compute analytics including BigQuery, Vertex AI, Spark, Presto, Trino, and Hive, multi-cloud governance including Amazon S3 and Azure data lake Gen 2.
Google said BigLake was developed to support open data formats including Parquet, Avro, ORC, CSV, and JSON. The API serves multiple compute engines through Apache Arrow, Google said.
“By creating BigLake tables, BigQuery customers can extend their workloads to data lakes built on Google Cloud Storage (GCS), Amazon S3 and Azure data lake storage Gen 2. BigLake tables are created using a cloud resource connection, which is a service identity wrapper that enables governance capabilities. This allows administrators to manage access control for these tables similar to BigQuery tables, and removes the need to provide object store access to end users,” explained Justin Levandoski, Google Cloud software engineer, and Gaurav Saxena, Google Cloud product manager. The two offered up a blog post to detail some of BigLake’s features.
The two emphasized BigLake’s integration with Dataplex, Google’s data management service.
“Customers can logically organize data from BigQuery and GCS into lakes and zones that map to their data domains, and can centrally manage policies for governing that data. These policies are then uniformly enforced by Google Cloud and OSS query engines. Dataplex also makes management easier by automatically scanning Google Cloud storage to register BigLake table definitions in BigQuery, and makes them available via Dataproc Metastore. This helps end users discover these BigLake tables for exploration and querying using both OSS applications and BigQuery,” they said.
BigLake is available as a preview.