How Notion Built and Expanded Its Data Lake

0

Notion is an all-in-one tool that helps users manage notes, schedules, projects, and more, and it has seen rapid growth in recent years. But how does a fast-growing company like Notion efficiently manage its data? Today, we’ll share how Notion built and expanded its data lake, achieving successful data management.

Notion’s data has increased tenfold over the past three years, doubling every 6 to 12 months due to the growth in users and content. This rapid growth created the need for Notion to build and expand its data management system. Particularly, to meet the data requirements of key products like the recent Notion AI feature and various analytics use cases, a more efficient data infrastructure was necessary.

pixabay

Challenges from Data Growth

At Notion, all data is modeled as “blocks” and stored in a Postgres database. In early 2021, there were over 20 billion blocks, but now there are over 200 billion. This massive amount of data brought about several challenges.

  • Operational overhead: Managing and monitoring 480 Fivetran connectors created significant overhead.
  • Speed and cost: Notion’s update-heavy workload slowed down data ingestion into Snowflake and increased costs.
  • Supporting use cases: Complex data transformation logic led to issues that exceeded the capabilities of a standard SQL interface.

Solving Problems Through Data Lake Construction

Notion addressed these challenges by building a data lake. The primary goals of the data lake were:

  • Establish a data repository capable of storing raw and processed data at scale
  • Enable fast, scalable, operable, and cost-effective data ingestion and computation for Notion’s update-heavy block data
  • Support use cases requiring AI, search, and other products needing denormalized data

To achieve this, Notion adopted various technologies. They started with an ELT pipeline using Fivetran to ingest data from Postgres WAL into Snowflake, then collected incremental updates from Postgres into Kafka using Debezium CDC connectors, and finally wrote these updates into S3 using Apache Hudi. S3 was utilized as both a data repository and lake to store all raw and processed data.

Results of a Successful Data Lake Implementation

Notion began developing its data lake infrastructure in the spring of 2022 and completed it by the fall of the same year. This resulted in net savings of over $1 million in 2022 alone, with even higher savings expected in 2023 and 2024. Additionally, end-to-end ingestion time from Postgres to S3 and Snowflake was reduced from over a day to just minutes for small tables and up to a few hours for large tables.

This data lake enabled Notion to successfully launch its Notion AI feature in 2023 and 2024.

Conclusion: The Importance of Data Management

Notion’s case clearly demonstrates the importance of data management. Despite rapid growth and increasing data, they achieved both cost savings and performance improvements through efficient data lake construction. This success story can serve as a valuable lesson for other companies.

If you want to learn more about data management, exploring Notion’s data lake construction process is a great place to start. Data management can be challenging and complex, but with the right approach, it can be managed efficiently.

References: Notion, “Building and scaling Notion’s data lake”

Leave a Reply