跳转至

Database basics

导言

数据库相关的常用名词和相关概念基础,商业化公司了解

基础概念

智能云

AI 加持的 云, 如:百度网盘AI检索

公有云、私有云和混合云

"公有云"、"私有云"和"混合云"是云计算中的不同部署模型,它们具有不同的概念和特点:

  1. 公有云(Public Cloud):
    • 公有云是由云服务提供商(如亚马逊AWS、微软Azure、谷歌云等)建立和管理的云计算基础设施。
    • 这些云服务提供商将计算资源(如虚拟机、存储、数据库等)提供给公众,客户可以按需租用这些资源,通常以按使用量付费。
    • 公有云是多租户模型,多个客户共享相同的基础设施,但彼此之间的资源是隔离的。
    • 客户无需关心硬件和基础设施管理,因为这由云服务提供商处理。
  2. 私有云(Private Cloud):
    • 私有云是建立在组织内部或由第三方托管的云基础设施,它是专为单一组织或实体使用的。
    • 私有云通常用于组织对数据隐私、安全性和合规性有严格要求的情况。
    • 它可以在内部数据中心搭建,也可以托管在专门的私有云提供商设施中。
    • 私有云通常提供更多的控制权和自定义选项,但通常需要更多的资本支出。
  3. 混合云(Hybrid Cloud):
    • 混合云是将公有云和私有云结合使用的模型,允许数据和应用在这两种环境之间进行移动和互操作。
    • 这种模型允许组织根据需求选择最合适的部署场景,以实现灵活性和成本效益。
    • 组织可以将敏感数据或特定工作负载保留在私有云中,同时在公有云中运行弹性工作负载。
    • 混合云需要有效的管理和集成,以确保数据流畅和安全地在不同云环境之间迁移。

总结:公有云、私有云和混合云是不同的云计算部署模型,适用于不同的业务需求。公有云提供商托管云基础设施,私有云专为单一组织提供,而混合云允许组织在两者之间实现灵活性和可操作性。选择哪种部署模型取决于组织的需求,包括安全性、合规性、成本、可扩展性和控制权等因素。

TP,AP,RTL

  1. HTAP (Hybrid Transactional/Analytical Processing): HTAP databases are designed to handle both transactional (OLTP) and analytical (OLAP) workloads within a single database system. This allows real-time data analysis without the need for complex data ETL (Extract, Transform, Load) processes.
  2. TP (Transaction Processing): This refers to the handling of transactional workloads, such as inserting, updating, and deleting records in a database.
  3. AP (Analytical Processing): This refers to the handling of analytical workloads, such as running complex queries and aggregations on large datasets.
  4. ETL (Extract, Transform, Load): ETL is a process used to extract data from source systems, transform it into the desired format, and load it into a target system (e.g., a data warehouse) for analysis.
  5. Multi-Tenant Serverless: This refers to a serverless computing model in which multiple tenants (different users or applications) share the same resources and can automatically scale their resource usage based on demand without having to manage infrastructure.
Serverless

Serverless 应用: Serverless 不意味着没有服务器,而是指开发者无需管理底层服务器和基础设施。在 Serverless 架构中,应用的部署、扩展、和管理都由云服务提供商处理。典型的 Serverless 架构包括函数即服务(Function as a Service,FaaS)和后端即服务(Backend as a Service,BaaS),例如 AWS Lambda、Azure Functions、或 Firebase。

常用大数据数据服务

Apache Hadoop, Apache Hive, and HMS

Apache Hadoop, Apache Hive, and the Hive Metastore (HMS) are components of the Hadoop ecosystem used for distributed data storage, processing, and querying. Here's an overview of each and their relationships:

  1. Apache Hadoop:
    • Apache Hadoop is an open-source framework for distributed storage and processing of large datasets. It provides a scalable, fault-tolerant, and distributed storage system called the Hadoop Distributed File System (HDFS) and a parallel processing framework known as MapReduce.
    • Hadoop is designed for handling Big Data and is widely used for data processing tasks, including data storage, batch processing, and data analysis.
    • While Hadoop originally used MapReduce for data processing, it has since evolved to support other processing models, including Apache Spark and Apache Flink.
  2. Apache Hive:
    • Apache Hive is a data warehousing and SQL-like query language system built on top of Hadoop. It provides a high-level abstraction for processing and querying data stored in Hadoop's HDFS.
    • Hive allows users to write SQL-like queries called HiveQL, which it then translates into MapReduce or other Hadoop processing jobs. This makes it easier for users who are familiar with SQL to work with Hadoop.
    • Hive is particularly useful for ad-hoc querying, data summarization, and analysis of large datasets.
  3. Hive Metastore (HMS):
    • The Hive Metastore (HMS) is a component of Hive that acts as a centralized metadata repository. It stores metadata about Hive tables, partitions, columns, and their associated schemas.
    • The Hive Metastore serves as a catalog for Hive, enabling it to locate and query data stored in HDFS. It provides a mapping between Hive tables and the data files they reference.
    • The metadata stored in the Hive Metastore includes information about the structure of the data, such as data types, column names, and file locations.

Relationships:

  • Apache Hive is a query and data analysis tool that works on top of Hadoop, making it easier to analyze data stored in HDFS using SQL-like queries.
  • The Hive Metastore (HMS) is a critical component of Hive. It stores the metadata required to interpret the data stored in HDFS by Hive. The metadata is essential for Hive to understand the structure of the data, such as table schemas and data file locations.
  • Together, Apache Hadoop, Hive, and the Hive Metastore form a powerful ecosystem for storing, processing, and querying large-scale data. Hive uses Hadoop for distributed data storage, and the Hive Metastore keeps track of metadata to enable efficient querying and analysis using HiveQL.

Hadoop Apache版本与第三方CDH版本的区别

开源,免费的CDH全称:Cloudera's Distribution, including Apache Hadoop;CDH是Hadoop众多分支中的一种,由Cloudera维护,基于稳定版本的Apache Hadoop构建;1

vivo 的 Hive Metastore 扩展基础

大数据元数据服务 Hive Metastore Service(以下简称 HMS),存储着数据仓库中所依赖的所有元数据并提供相应的查询服务,使得计算引擎(Hive、Spark、Presto)能在海量数据中准确访问到需要访问的具体数据,其在离线数仓的稳定构建上扮演着举足轻重的角色。vivo 离线数仓的 Hadoop 集群基于 CDH 5.14.4 版本构建,HMS 的版本选择跟随 CDH 大版本,当前使用版本为 1.1.0-cdh5.14.4。2

常见/流行的数据库

Here are the characteristics of MySQL, PostgreSQL, MongoDB, and ClickHouse, which are different types of database management systems:

  1. MySQL:

    • Relational Database Management System (RDBMS): MySQL is a popular open-source RDBMS known for its reliability, scalability, and ease of use.
    • Structured Data: It stores data in structured tables with predefined schemas, making it suitable for applications with well-defined data models.
    • ACID Compliance: MySQL ensures ACID (Atomicity, Consistency, Isolation, Durability) properties, making it a good choice for applications requiring data integrity.
    • Support for SQL: It uses SQL for querying and managing data, making it compatible with a wide range of SQL-based tools and applications.
  2. PostgreSQL:

    • Advanced Open-Source RDBMS: PostgreSQL is a powerful open-source RDBMS known for its extensibility and support for advanced data types, custom functions, and more.
    • Extensible: It allows you to define custom data types, operators, and functions, making it suitable for complex data requirements.
    • ACID Compliance: Like MySQL, PostgreSQL also ensures ACID compliance, ensuring data integrity.
    • Support for JSON and NoSQL: PostgreSQL offers support for JSON data, making it capable of handling both structured and semi-structured data.
  3. MongoDB:

    • NoSQL Database: MongoDB is a NoSQL database that stores data in a flexible, JSON-like format, making it suitable for unstructured or semi-structured data.
    • Document-Oriented: It uses a document-based data model, where data is stored in collections of documents. This allows for flexibility in data schemas.
    • Scalability: MongoDB is designed for horizontal scaling and is well-suited for applications with high data volumes and varied data structures.
    • Schemaless: It's schemaless, meaning you can add fields to documents on the fly, making it adaptable to changing data requirements.
  4. ClickHouse:

    • Columnar Database: ClickHouse is a columnar database management system designed for analytics and reporting purposes.
    • High Performance: It's optimized for analytical queries and is capable of handling large volumes of data with high query performance.
    • Columnar Storage: Data is stored in columns rather than rows, which is efficient for analytical workloads.
    • Distributed: ClickHouse is built for distributed computing, allowing you to scale it horizontally for handling big data workloads.

The choice of which database system to use depends on the specific needs of your application, including the data structure, query requirements, scalability, and performance considerations. Each of these databases has its own strengths and weaknesses, so it's essential to assess your project's requirements before making a decision.

结构化数据 Structured Data

Structured data is organized and follows a well-defined schema, typically in the form of tables with rows and columns. Each piece of data has a specific type, and it adheres to a fixed structure. Examples of structured data include:

  • Relational database tables with rows and columns (e.g., a table for customer information with columns for name, address, and phone number).
  • Spreadsheets where data is organized into rows and columns.
  • JSON data that adheres to a strict schema, where each key-value pair has a predetermined structure.

Semi-Structured Data

Semi-structured data is more flexible than structured data but has some level of structure. It doesn't conform to rigid schemas but still retains some organization. Examples of semi-structured data include:

  • JSON Data: JSON allows flexibility in data structure. While it can have a defined structure, it's possible to have optional fields or nested structures that vary between records. For example, JSON data representing products might include optional fields like "reviews" or "product images" that aren't present in every record.
  • XML Data: XML is another semi-structured format. It uses tags to organize data hierarchically but allows for optional or varying elements.

NoSQL Data

NoSQL databases, such as MongoDB, Cassandra, and Redis, often deal with semi-structured or unstructured data. They are designed to handle various data types and flexible schemas. Examples of NoSQL data include:

  • Document Stores: In MongoDB, data is stored as JSON-like documents. Each document can have different fields, making it semi-structured.
  • Key-Value Stores: NoSQL key-value stores like Redis store data in an unstructured manner, where values can be strings, numbers, or more complex data structures.
  • Wide-Column Stores: Cassandra is a wide-column store that allows for flexible data models within a column family, which is semi-structured.

differences between these data types

  • Structure: Structured data follows a rigid, predefined schema, while semi-structured data has some structure but allows for flexibility, and NoSQL data can be entirely unstructured or semi-structured.

  • Schema Flexibility: In structured data, any change in the schema often requires significant modifications. In semi-structured and NoSQL data, changes to the data structure can be accommodated more easily.

  • Query Flexibility: Structured data typically relies on SQL for querying. Semi-structured and NoSQL data often use other query languages or methods that adapt to the data structure.

  • Use Cases: Structured data is suitable for applications with fixed, well-defined data structures, like traditional business applications. Semi-structured and NoSQL data are better for applications where data formats may change over time or where flexibility is needed, like content management systems, e-commerce, or big data analytics.

Ultimately, the choice between structured, semi-structured, or NoSQL data depends on your specific application and its requirements.

ClickHouse how to get higher performance

TODO:

高性能数据库

NewSQL 和 MPP (Massively Parallel Processing) 数据库系统都是在处理大规模数据时的高性能数据库解决方案。以下是对 NewSQL 数据库系统(例如 TiDB 和 OceanBase)以及 MPP 数据库系统(例如 GaussDB)的特点和区别:

NewSQL 数据库(如 开源TiDB(1) 和 阿里蚂蚁自研OceanBase):

  1. TiDB is a distributed SQL database built by PingCAP and its open-source community.

  2. 分布式事务处理: NewSQL 数据库旨在提供关系数据库的 ACID 事务支持,使其适合需要数据完整性和事务一致性的应用程序。

  3. 水平扩展性: NewSQL 数据库可以轻松水平扩展,以适应不断增长的数据和用户负载,而不需要大规模数据迁移。
  4. 分布式架构: 它们使用分布式架构,允许数据在多个节点上分布存储和处理,以提高性能和容错性。
  5. 兼容 SQL: NewSQL 数据库支持标准 SQL 查询语言,这使得迁移到这些系统相对容易,因为开发人员可以继续使用他们熟悉的查询语言。
  6. 强调一致性: NewSQL 数据库强调强一致性,这意味着它们会确保所有数据在事务之间保持一致。

MPP 数据库(如 GaussDB 华为自研发的分布式关系型数据库):

  1. 并行处理: MPP 数据库系统专注于大规模数据处理和分析,通过在多个节点上并行执行查询来提高性能。
  2. 列存储: MPP 数据库通常使用列存储,这对于分析查询非常高效,因为它们只需要访问所需的列数据,而不是整个行。
  3. 复杂查询优化: MPP 数据库系统经过优化,以处理复杂的分析查询,如数据汇总、连接和聚合。
  4. 数据仓库: MPP 数据库通常用于数据仓库和分析任务,而不太适用于事务性应用程序。
  5. 高度并行: MPP 数据库系统将工作负载分布到多个节点,以实现高度并行处理,从而提高了性能。

区别:

  1. 应用领域: NewSQL 数据库系统适合于事务性应用程序,强调关系型数据和 ACID 事务。MPP 数据库系统主要用于大规模数据分析和数据仓库任务,重点是高性能查询和分析。
  2. 数据存储: NewSQL 数据库通常使用行存储,而 MPP 数据库使用列存储。列存储对于分析查询非常高效,但在事务处理中的性能可能较低。
  3. 数据模型: NewSQL 数据库遵循传统的关系模型,而 MPP 数据库更灵活,通常能够处理半结构化和非结构化数据。
  4. 查询类型: NewSQL 数据库主要处理事务性查询,而 MPP 数据库专注于复杂的分析查询。

总之,选择 NewSQL 还是 MPP 数据库取决于您的应用需求。如果您需要处理大量事务性数据和关系型数据,NewSQL 数据库可能更适合。如果您需要进行大规模数据分析和查询,那么 MPP 数据库可能更适合。

康俊彬大牛 commentary on TiDB

Tidb一开始炒作NewSQL,结果没有看出来哪里New了,只看出哪里都很慢。 然后炒作HTAP,TP和AP都不如人家专用数据库快,别人ETL就解决了。 现在又炒作多租户Serverless,我基于MySQL来做多租户serverless不比这个强。

企业的选择

NewSQL或MPP像Tidb,OB,高斯都面临一个尴尬,交易型SQL业务根本没有那么高的计算吞吐的需求,但对SQL兼容性以及事务延时有着极强的要求。 大家发现像Oracle一样,存储容量可以扩就完全满足需求了,出现了很多Oracle RAC弱化版的云原生数据库,很受欢迎。

参考文献

https://github.com/cosen1024/Java-Interview/blob/main/MySQL/MySQL.md

https://juejin.cn/post/6844903665367547918

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。