Waylon Wang

Querying Data on the Web - 1

有 N 人看过

Lecture 1&2

Database Management Systems

DBMS

  • Managing very large volumes of data

  • Linguistic abstractions - 语言抽象

    • A DBMS supports:
      • DDL: definition
      • DML: manipulation
      • DQL: query
    • SQL is a DBMS language

DBMS: Languages

  1. Database languages VS general-purpose programming languages
    1. each sub-language of database languages are designed to cater for distinct needs while general-purpose languages are not.
    2. database languages are designed to operate on collections, without explicit iteration.
  2. Database languages are limited in some way
    1. They are not Turing-complete.

    2. They are only for operating on large collections and for simple calculations.

    3. These make QLs declarative and can be optimized.

DBMS-Centred Applications

Internal Architecture of Classical DBMSs

  1. 5 main service types:
    1. Query processing
      1. Parse the declarative query
      2. translate to algebraic expression
      3. rewrite into a logical query execution plan (QEP)
      4. select the algorithms and access methods to obtain a quasi-optimal and costed concrete QEP (近似最优的、消耗明确的QEP)
      5. execute
    2. Transaction processing
    3. Concurrency control
    4. Recovery
    5. Storage management
  2. 4 main types of data
    1. Metadata: schema-level info and some infrastructure description
    2. Statistics: info about sizes and cardinalities database instances
    3. Indices
    4. Data

DBMS: Architectural Variations

  1. Beyond structured data:
    1. support un- and semi-structured data in document form
    2. stored in content repositories
    3. using keyword-based search and access methods for graph/tree fragments.
  2. Beyond OLTP:
    1. pre-process, aggregate, materialize separately
    2. support OLAP
    3. using multidimensional denormalized(非规范化的) logical schemas
  3. Parallelization并行化
    1. Shared-disk parallelism
      • 在内存和相对较慢的磁盘之间放一个快的interconnect

      • 并行磁盘可以避免辅助内存的抢占

    2. Shared-memory parallelism
      1. 在process和内存之间放一个interconnect

      2. 并行内存可以避免主内存抢占

    3. Shared-nothing parallelism
      1. 在整个processing units之间放一个interconnect

      2. 使用黑盒模式来并行处理使得本地资源充裕

  4. Distribution

Lecture 3 The Relational Case

The Relational Model

  1. relational database is a set of relations.
  2. schema: 关系名和其属性的集合
  3. tuple: relation中除属性名那一行之外的其他行
  4. relations are considered to be a set. (意味着所有的tuple都是唯一的,并且顺序不影响)
  5. Arity:
    1. 元数
    2. 所有属性的数目
  6. Cardinality:
    1. 基数
    2. 所有tuple的数目
  7. Integrity constraint
    1. IC
    2. 完整性约束:某一个属性必须为true
  8. Domain Constraint
    1. 属性值需要为schema-specified domain
  9. Key:
    1. 不会重复
    2. 如果一个关系中有多个key,每个key叫做candidate key(关系键)
    3. primary key从candidate key中选择
    4. foreign key:
      1. 在其他关系中为主键
      2. 可以用来对应其他关系中的tuple
      3. 作为逻辑连接点
      4. 表示两个entities之间的关系

Lecture 4 Relational Query Languages

Query Languages

  1. Relational model defines three expressively-equivalent abstract languages:
  • the domain relational calculus DRC
  • the tuple relational calculus TRC - SQL是TRC的一种具体语言
  • the relational algebra RA
  1. View: 创建一个view,相当于代替一些sql语句

  1. View为何有用
    1. 可以只显示关键信息,隐藏不必要的细节
    2. views可以被具体化 - 指的是我们可以计算view并储存结果
    3. 在分布式环境中很有用

The Tuple Relational Calculus

“\” : 返回在第一个参数里不在第二个参数里的行

SQL

常见操作函数

  1. ⚠️select后面如果使用聚合函数(上面的几个),那必须对select的所有属性都聚合,或者其他属性都用GROUP BY
  2. HAVING语句:
    1. 因为WHERE语句不能和聚合函数一起用

    2. HAVING可以和聚合函数一起用,起到筛选的作用