DatabaseReview

Posted on 2020-02-18 In Database

This is a review reference and some notes/my own understanding of datbase course.
REFERENCE to CSCI 585 2020 SPRING lectured by Prof. Saty in University of Southern California

Introduction
Data Modeling
ER
Relational Modeling
Extended ER (“EER”)-> Advanced data modeling
Normalization
Transaction Management
Distributed DBs
Performance Tuning and Query Optimization
Business Intelligence
Discussion1
Discussion 2

Introduction#

1. The difference between data and and information#

Data	Information
Raw fact(Raw data- Not yet been processed to reveal the meaning)	produced by processing data
Building blocks of information	Reveals the meaning of data
Data management(Generation, storage, and retrieval of data)	Enables knowledge creation
	Should be accurate, relevant, and timely to enable good decision making

To be simple, Data is raw fact, objective exists in the real word. Information is a kind of processed data. After gathering and processing, data becomes information. For example, the data of phone is pixel but the people can see it is a picture.

Knowledge is useful information. Something you can do with the information. Knowledge can use the information we found.

2. What a database is, the various types of databases, and why they are valuable assets for decision making.#

Database is a shared, integrated computer structure that sotres a collection of:
- End-user data: Raw facts of interest to end user, such as student’s gpa…
- Metadata: Data about data, which the end-user data are integrated and managed
  - describe data characteristics and relationships such as what does the GPA mean and some restriction such as $0 \leq GPA \leq 4.0$ and $student-age > 0$. About how we store the information. Or How many tables we have.
Database management system(DBMS)
- Collection of programs
- Manages the database structure
- Controls access to data stored in the database(who can access database | access restrictions)
Role of the DBMS
- Intermediary between the user and the database
- Enables data to be shared
- Presents the end user with an integrated view of the data
- Receives and translates application requests(what are you searching for) into operations required to fulfill the request
- Hides database’s internal complexity from the application programs and users(People can reorganise data in the future)
Advantages of DBMS
- Better data integration and less data inconsistency(pervent having two difference record of the same things)
  - Data inconsistency: Different versions of the same data appear in different places
- Increase end-user productivity
- Data sharing
- Data security // Who gonna see what
- Data access
- Decision making
  - Data quality: Promoting accuracy, validity, and timeliness of data.
Types of databases
- Single-user database: Supports one user at a time (Like a database in your phone/Desktop)
  - Desktop database: Runs on PC
- Multiuser database: Supports multiple users at the same time (Like BOA or Amazon database)
  - Workgroup databases: Support a small number of users or a specific department
  - Enterprise database: Supports many users across many departments
- Centralized database: Data is located at a single site
- Distributed database: Data is distributed across different sites (Part of database from xxx part of database from xxx different contries) Note: It is not copies. Just different parts of one database from different sites. – google uses it
- Cloud database: Created and maintained using cloud data services that provide defined performance measures for the database. Companies do not actually using the hardwares. Like UBER, they dont have their physical, their databases live on some cloud.(AWS)
Types of databases (Another way to classify database)
- General-purpose databases: Contains a wide variety of data used in multiple disciplines. Like Oracle database, you can use the database where you want. And you can put anything you want in Oracle. –> general purpose
- Discipline-specific databases: Contains data focused on specific subject areas. Like scientists use protein database(PDB -> protein data bank) Every protein molecules are stored in database. Different structure for specific usage. They have their own buliding plans and only the users know what it is or what it represents.
Types of databases (Another way to classify database) -> purly for business point of view
- Operational database: Designed to support a company’s day to day operations.
- Analytical database: Stores historical data and business metrics used exclusively for tactical or strategic decision making
  - Data warehouse: Stores data in a format optimized for decision support.
- Example: Constantly writing data…. every purchase into Operational database daily. At night, they put the data to one central place called analytical database(used to analytical later) It is constantly reading database.
- where do data warehouse caculations
  - Online analytical processing(OLAP) Command to go to the warehouse and analysis data
    - Enable retrieving, processing, and modeling data from the data warehouse.
  - Business intelligence: Captures and processes business data to generate information that support decision making. (stock from Operational database to Analytical database piece to piece) -> data mining
Types of databases (Another way to classify database)
- Unstructured data: It exists in their original state (dont have a specific format like musci/video files)
- Structured data: It results from formatting (like students information stored in each row of studnet table)
  - Structure is applied based on type of processing to be performed
- Semistructured data: Processed to some extent like document. Generally it is used to describe sth. Between structured and unstructured.
- Extensible Markup Language(XML) Json better than xml can be used in structured and unstructured
  - Represents data elements in textual format

3. The importance of database design#

4. How modern databases envolved from file systems#

Early Dbs: file systems
- Manual file systems: Accompulished through a system of file folders and filing cabinets
- Computerized File Systems
  - Data processing(DP) specialist: Create a computer-based system that would track data and produce required reports
- File System Redux: Mdern End-User Productivity Tools(Includes spreadsheet programs such as Microsoft Excel)
File systems
- Data -> Raw facts, such as a Tel number, birth date or customer name. Data have little meaning unless they have been organized in some logical manner. Each cell!
- Field -> A character of group of characters(alphabetic or numeric) that has a specific meaning. A field is used to define and store data. Like the definition of the meaning of data in this field.
- Record -> A row (contains person’s name, address….)
- File -> a collection of record
Problems with Fie System Data Processing
- Lengthy development times
- Difficulty of getting quick answers
- Complex system administration
- Lack of security and limited data sharing
- Extensive programming
Structural and Data dependence
- Data dependence
  - Data access changes when data storage characteristics chage
- Data independence
  - Data storage characteristics is changed without affecting the program’s ability to access the data
- Practical significance of data dependence is difference between logical and physical format
Data Redundancy
- Unnecessarily storing same data at different places
- Islands of information: Scattered data locations
  - Increases the probability of having different versions of the same data.
Data Redundancy Implications (the defects of data redundancy)
- Poor data security
- Data inconsistency
- Increased likelihood of data-entry errors when complex enries are made in different files
- Data anomaly: Develops when not all of the required changes in the redundant data are made successfully.
  - Update Anomalies (happens when data is already there)
  - Insertion Anomalies (create data for the first time)
  - Deletion Anomalies
Database System
- Logically related data stored in a single logical data repository
  - Physically distributed among multiple storage facilities
  - DBMS eliminates most of file system’s problems
- Curent generation DBMS software
  - Stores data structures, relationships between structures, and access pahts
  - Defines, stores, and manages all access paths and components.
- DBMS Functions
  - Data dictionary management
  - Data storage management
    - Per
  - Data transformation and presentation
    - Transforms entrered data to conform required data structures
  - Security management
    - Enforces user security and data privacy

5. Flaws in file system data management#

6. The main components of the database system#

7. The main functions of a database management system(DBMS)#

Data dictionary management == meta data example for one column
- Data dictionary: Stores definitions of the data elements and their relationships
Data storage management
- Performance tuning: Ensures efficient performance of the database in terms of storage and access speed
Data transformation and presentation
- Transforms entered data to conform to required data structures.
Security management
- Enforces user security and data privacy.
Multiuser access control
- Sophisticated algorithms ensure that multiple users can access the database concurrently without compromising its integrity.
Backup and recovery management
- Enables recovery of the database after a failure
Data integrity management
- Minimizes redundancy and maximizes consistency
Database access languages and application programming interfaces
- Query language: Lets the user specify what must be done without having to specify how
- Structured Query Language (SQL): De facto query language and data access standard supported by the majority of DBMS vendors
Database communication interfaces
- Accept end-user requests via multiple, different network environments
Disadvantage of Database Systems
- Increased costs
- Management complexity
- Maintaining currency
- Vendor dependence
- Frequent upgrade/replacement cycles

Data Modeling#

About data modeling and why data models are important#

Data Modeling and Data Models#

Data modeling: Iterative and progressive process of creating a specific data model for a determined problem domain
Data models: Simple representations of complex real-world data structures
Useful for supporting a specific problem domain
Model - Abstraction of a real-world object or event

Importance of Data Models#

Are a communication tool
Give an overall view of the database
Organize data for various users
Are an abstraction for the creation of good database

About the basic data-modeling building blocks#

Entity: Unique and distinct object used to collect and store data
- Attribute: Characteristic of an entity
Relationship: Describes an association among entities(one row in an entity can relate to how many rows in other entities)
- One-to-many (1:M)
- Many-to-many (M:N or M:M)
- One-to-one (1:1)
Constraint: Set of rules to ensure data integrity

What business rules are and how they influence database design#

Brief, precise, and unambiguous description of a policy, procedure, or principle
Enable defining the basic building blocks
Describe main and distinguishing characteristics of the data
Sources of Business Rules(sb. write the manual)
- Company managers
- Policy makers
- Department managers
- Written documentation
- Direct interviews with end users
Why we need business rules
- Help standardize company’s view of data
- Communications tool between users and designers
- Allow designer to:
  - Understand the nature, role, scope of data, and business processes
  - Develop appropriate relationship participation rules and constraints
  - Create an accurate data model

Hierarchical modeling#

Manage large amounts of data for complex manufacturing projects
Represented by an upside-down tree which contains segments
Segments: Equivalent of a file system’s record type
Depicts a set of one-to-many (1:M) relationships
Advantage
- Promotes data sharing
- Parent/child relationship promotes conceptual simplicity and data integrity
- Database security is provided and enforced by DBMS
- Efficient with 1:M relationships
Disadvantages
- Requires knowledge of physical data storage characteristics
- Navigational system requires knowledge of hierarchical path
- Changes in structure require changes in all application programs
- Implementation limitations
- No data definition
- Lack of standards

Network modeling#

Represent complex data relationships
Improve database performance and impose a database standard
Depicts both one-to-many (1:M) and many-to-many (M:N) relationships
Advantages
- Conceptual simplicity
- Handles more relationship types
- Data access is flexible
- Data owner/member relationship promotes data integrity
- Conformance to standards
- Includes data definition language (DDL) and data manipulation language (DML)
Disadvantages
- System complexity limits efficiency
- Navigational system yields complex implementation, application development, and management
- Structural changes require changes in all application programs

Difference between Schema and Sub Schema#

Schema is a Conceptual organization of the entire database as viewed by the database administrator(means the whole thing)
SubSchema is a Portion of the database seen by the application programs that produce the desired information from the data within the database.(only a sub part of schema)

Difference between DML and DDL#

Data manipulation language(DML): Environment in which data can be managed and is used to work with the data in the database(add / delete/ modify/ query data)
Schema data definition language(DDL): Enables the database administrator to define the schema components (create the structure)

The Relational Model#

Produced an automatic transmission database that replaced standard transmission databases
Based on a relation
- Relation or table: Matrix composed of intersecting tuple and attribute(Here we need to note that we have two kinds of relation.One is the relation between different tables. And the other thing is one cell related to rows and columns. For example, one score can relate to one student and it is belonged to gpa column it is an intersection)
  - Tuple: Rows
  - Attribute: Columns
Describes a precise set of data manipulation constructs
Advantage
- Structural independence is promoted using independent tables
- Tabular view improves conceptual simplicity
- Ad hoc query capability is based on SQL
- Isolates the end user from physical-level details
- Improves implementation and management simplicity

Relational DBMS(Database Management System)#

Performs basic functions provided by the hierarchical and network DBMS systems
Makes the relational data model easier to understand and implement
Hides the complexities of the relational model from the user

How the major data models evolved#

The Evolution of Data Model

Data Abstraction Levels#

It means you change sth, the level above this layer will no change

Physical independence means pyhsical model can change but internal Model could not change.

Logical independence means if internal model changes but conceptual could not change. Such as if Internal Model is changed to NoSQL from RDM, nobody will notice that.

The External Model#

End users’ view of the data environment
ER diagrams are used to represent the external views
External schema: Specific representation of an external view

The Conceptual Model#

Represents a global view of the entire database by the entire organization
Conceptual schema: Basis for the identification and high-level description of the main data objects
Has a macro-level view of data environment
Is software and hardware independent
Logical design: Task of creating a conceptual data model

The Internal Model#

Representing database as seen by the DBMS mapping conceptual model to the DBMS
Internal schema: Specific representation of an internal model
- Uses the database constructs supported by the chosen database
Is software dependent and hardware independent
Logical independence: Changing internal model without affecting the conceptual model

The Physical Model#

Operates at lowest level of abstraction
Describes the way data are saved on storage media such as disks or tapes
Requires the definition of physical storage and data access methods
Relational model aimed at logical level
Does not require physical-level details
Physical independence: Changes in physical model do not affect internal model

About emerging alternative data models and the need they fulfill#

How data models can be classified by their level of abstraction#

ER#

Entity Relationship Model (ERM)#

Basis of an entity relationship diagram (ERD)
ERD depicts the:
- Conceptual database as viewed by end user
- Database’s main components
  - Entities
  - Attributes
  - Relationships
Entity - Refers to the entity set and not to a single entity occurrence

Attributes#

Characteristics of entities
Required attribute: Must have a value, cannot be left empty -Id/Lastname
Optional attribute: Does not require a value, can be left empty -secondary Email
Domain - Set of possible values for a given attribute(range the possible like GPA 0-4 name can only be 40chars long ..valid values)
Identifiers: One or more attributes that uniquely identify each entity instance(must exist and cannot be null like primary key(s))
Compondend attribute: Attribute that can be subdivided to yield additional attributes(Different from compond. One column can contain more than one colums in it, such as color can contain R,G,B or Name can contain first_name and last_name
Composite identifier: Primary key composed of more than one attribute
Composite attribute: composit is like the combinations of several columns same as Composite identifier, but identifier is primary key.
Simple attribute: Attribute that cannot be subdivided
Single-valued attribute: Attribute that has only a single value
Multivalued attributes: Attributes that have many values and require creating:
- Several new attributes, one for each component of the original multivalued attribute
- A new entity composed of the original multivalued attribute’s components
Derived attribute: Attribute whose value is calculated from other attributes
- Derived using an algorithm
Advantages and Disadvantages of Storing Derived Attributes

Relationships#

Association between entities that always operate in both directions
Participants: Entities that participate in a relationship
Connectivity: Describes the relationship classification
Cardinality: Expresses the minimum and maximum number of entity occurrences associated with one occurrence of related entity

Existence Dependence and Existence Independence

Existence dependence	Existence independence
Entity exists in the database only when it is associated with another related entity occurrence	Entity exists apart from all of its related entities. Referred to as a strong entity or regular entity
For example, the Employee(String entity/regular entity) and their children (* Children is not week entity!!!) Please notice that When a primary is shared between two Entity, the other entity(which owns both its only key and shared key) is weak entity And the relationship between them is strong relationship**.

However, if one entity has a foreign key from other entity there is a weak relationship. This is not a weak entiy but it is also existence depency.

Relationship Strength
- Weak (non-identifying) relationship
  - Primary key of the related entity does not contain a primary key component of the parent entity
- Strong (identifying) relationships
  - Primary key of the related entity contains a primary key component of the parent entity
    A strong relationship makes the right side be weak. But the right hand side of a Weak entiy is not weak though it is still existant dependency.
- A weak entity has two Conditions
  +Existence-dependent
  - Has a primary key that is partially or totally derived from parent entity in the relationship
Relationship Participation
- Optional participation
  - One entity occurrence does not require a corresponding entity occurrence in a particular relationship
- Mandatory participation
  - One entity occurrence requires a corresponding entity occurrence in a particular relationship
Relationship Degree
- Indicates the number of entities or participants associated with a relationship
- Unary relationship: Association is maintained within a single entity -> Some row in a table has some relationship with some row in the same table
  - Recursive relationship: Relationship exists between occurrences of the same entity set
- Binary relationship: Two entities are associated
- Ternary relationship: Three entities are associated
Associative Entities
- Also known as composite or bridge entities
- Used to represent an M:N relationship between two or more entities
- Is in a 1:M relationship with the parent entities
  - Composed of the primary key attributes of each parent entity
- May also contain additional attributes that play no role in connective process
Developing an ER Diagram
- Create a detailed narrative of the organization’s description of operations
- Identify business rules based on the descriptions
- Identify main entities and relationships from the business rules
- Develop the initial ERD
- Identify the attributes and primary keys that adequately describe entities
- Revise and review ERD

Relational Modeling#

A Logical View of Data#

Relational database model enables logical representation of the data and its relationships
Logical simplicity yields simple and effective database design methodologies
Facilitated by the creation of data relationships based on a logical construct called a relation

Keys#

Consist of one or more attributes that determine other attributes
Used to:
- Ensure that each row in a table is uniquely identifiable
- Establish relationships among tables and to ensure the integrity of the data
Primary key (PK): Attribute or combination of attributes that uniquely identifies any given row

Determination#

State in which knowing the value of one attribute makes it possible to determine the value of another
Is the basis for establishing the role of a key
Based on the relationships among the attributes

Dependencies#

Functional dependence: Value of one or more attributes determines the value of one or more other attributes
- Determinant: Attribute whose value determines another
- Dependent: Attribute whose value is determined by the other attribute
Full functional dependence: Entire collection of attributes in the determinant is necessary for the relationship. For example, like first_name and last_name are determinant. So if I want to query sth, I must git all of them

Types of Keys#

Composite key: Key that is composed of more than one attribute
Key attribute: Attribute that is a part of a key
Entity integrity: Condition in which each row in the table has its own unique identity
- All of the values in the primary key must be unique
- No key attribute in the primary key can contain a null
Null: Absence of any data value that could represent:
- An unknown attribute value
- A known, but missing, attribute value
- A inapplicable condition
Referential integrity: Every reference to an entity instance by another entity instance is valid (foreign key) if reference is null, it violates the integrity
Superkey: An attribute or combination of attributes that uniquely identifies each row in a table(A superkey is a group of single or multiple keys which identifies rows in a table. A Super key may have additional attributes that are not needed for unique identification.)
Candidate key: A minimal(irreducible) superkey; a superkey that does contain a subset of attributes that is itself a superkey.
It is a set of attributes that uniquely identify tuples in a table. Candidate Key is a super key with no repeated attributes. The Primary key should be selected from the candidate keys. Every table must have at least a single candidate key. A table can have multiple candidate keys but only a single primary key.
- Properties of Candidate key:
  - It must contain unique values
  - Candidate key may have multiple attributes
  - Must not contain null values
  - It should contain minimum fields to ensure uniqueness
  - Uniquely identify each record in a table
Secondary key
- Secondary Key is the key that has not been selected to be the primary key. A secondary key is a non-identifying column or set of columns that can be used to find row in a table. It is only for data retrieval purposes

Ways to Handle Nulls#

Flags: Special codes used to indicate the absence of some value
NOT NULL constraint - Placed on a column to ensure that every row in the table has a value for that column
UNIQUE constraint - Restriction placed on a column to ensure that no duplicate values exist for that column

Relational Algebra#

Theoretical way of manipulating table contents using relational operators
Relvar: Variable that holds a relation
Heading contains the names of the attributes and the body contains the relation
Relational operators have the property of closure
- Closure: Use of relational algebra operators on existing relations produces new relations

Relational Set Operators#

Select (Restrict)
- Unary operator that yields a horizontal subset of a table
Project
- Unary operator that yields a vertical subset of a table
Union
- Combines all rows from two tables, excluding duplicate rows
- Union-compatible: Tables share the same number of columns, and their corresponding columns share compatible domains
Intersect
- Yields only the rows that appear in both tables
- Tables must be union-compatible to yield valid results
Difference
- Yields all rows in one table that are not found in the other table
- Tables must be union-compatible to yield valid results
Product
- Yields all possible pairs of rows from two tables
Join
- Allows information to be intelligently combined from two or more tables
- Types of Joins
  - Natural join: Links tables by selecting only the rows with common values in their common attributes
    - Join columns: Common columns -> the columns have exactly the same name. -> If not the same we do not call it nature join
  - Equijoin: Links tables on the basis of an equality condition that compares specified columns of each table -> == != < > <= >= ….
  - Theta join: Extension of natural join, denoted by adding a theta subscript after the JOIN symbol
  - Inner join: Only returns matched records from the tables that are being joined
  - Outer join: Matched pairs are retained and unmatched values in the other table are left null
  - Left outer join: Yields all of the rows in the first table, including those that do not have a matching value in the second table
  - Right outer join: Yields all of the rows in the second table, including those that do not have matching values in the first table
Divide
- Uses one 2-column table as the dividend and one single-column table as the divisor
- Output is a single column that contains all values from the second column of the dividend that are associated with every row in the divisor

Data Dictionary and the System Catalog(hold metadata)#

Data dictionary: Description of all tables in the database created by the user and designer // Where is the table located and the characteristic of columns
System catalog: System data dictionary that describes all objects within the database // the same as data dictionary
Homonyms and synonyms must be avoided to lessen confusion
- Homonym: Same name is used to label different attributes -> Cannot use the same name to describe diff things
- Synonym: Different names are used to describe the same attribute -> same

Extended ER (“EER”)-> Advanced data modeling#

Extended Entity Relationship Model (EERM)#

Result of adding more semantic constructs to the original entity relationship (ER) model
EER diagram (EERD): Uses the EER model

Entity Supertypes and Subtypes#

Entity supertype: Generic entity type related to one or more entity subtypes
- Contains common characteristics
Entity subtype: Contains unique characteristics of each entity subtype
Criteria to determine the usage
- There must be different, identifiable kinds of the entity in the user’s environment
- The different kinds of instances should each have one or more attributes that are unique to that kind of instance

Specialization Hierarchy#

Depicts arrangement of higher-level entity supertypes and lower-level entity subtypes
Relationships are described in terms of “is-a” relationships -> Composition can have “Has-a” relation
Subtype exists within the context of a supertype
Every subtype has one supertype to which it is directly related # Cannot have more
Supertype can have many subtypes
Provides the means to:
- Support attribute inheritance
- Define a special supertype attribute known as the subtype discriminator
- Define disjoint/overlapping constraints and complete/partial constraints

Inheritanc#

Enables an entity subtype to inherit attributes and relationships of the supertype
All entity subtypes inherit their primary key attribute from their supertype
At the implementation level, supertype and its subtype(s) maintain a 1:1 relationship
Entity subtypes inherit all relationships in which supertype entity participates
Lower-level subtypes inherit all attributes and relationships from its upper-level supertypes

Subtype Discriminator#

Attribute in the supertype entity that determines to which entity subtype the supertype occurrence is related
Default comparison condition is the equality comparison example:’p’ -> pilot

Disjoint and Overlapping Constraints#

Disjoint subtypes: Contain a unique subset of the supertype entity set
- Known as nonoverlapping subtypes
- Implementation is based on the value of the subtype discriminator attribute in the supertype
  Overlapping subtypes: Contain nonunique subsets of the supertype entity set
- Implementation requires the use of one discriminator attribute for each subtype

Completeness Constraint#

Specifies whether each supertype occurrence must also be a member of at least one subtype
Types
- Partial completeness: Not every supertype occurrence is a member of a subtype
- Total completeness: Every supertype occurrence must be a member of any

Entity Cluster#

Virtual entity type used to represent multiple entities and relationships in ERD
Avoid the display of attributes to eliminate complications that result when the inheritance rules change

Normalization#

Normalization (Do normalization once a time one entity a time)#

Evaluating and correcting table structures to minimize data redundancies
Reduces data anomalies
Assigns attributes to tables based on determination
Normal forms
- First normal form (1NF)
- Second normal form (2NF)
- Third normal form (3NF)

Need for Normalization#

Used while designing a new database structure
- Analyzes the relationship among the attributes within each entity
- Determines if the structure can be improved
Improves the existing data structure and creates an appropriate database design

Normalization Process#

Objective is to ensure that each table conforms to the concept of well-formed relations
- Each table represents a single subject
- No data item will be unnecessarily stored in more than one table
- All nonprime attributes in a table are wholely dependent on the primary key
- Each table is void of insertion, update, and deletion anomalies
Ensures that all tables are in at least 3NF
Higher forms are not likely to be encountered in business environment
Works one relation at a time
Starts by:
- Identifying the dependencies of a relation (table)
- Progressively breaking the relation into new set of relations

Normal Forms#

NORMAL FORM	CHARACTERISTIC
First Normal form(1NF)	Table format, no repeating groups, and PK identified -> Each cell is atomic, cannot be seperated any more
Second normal form(2NF)	1NF and no partial dependencies
Third normal form(3NF)	2NF and no transitive dependencies
*Note: Normalization how-to, in one sentence: work on one relation (table) at a time: identify dependencies, then ‘normalize’ - progressively break it down into smaller relations (tables), based on the dependencies we identify in the original relation

Functional Dependence Concepts#

Concept	Definition
Functional dependence	The attribute B is fully functionally dependent on the attribute A if each value of A determines one and only one value of B. -> A determines B
Functional dependence (Generalized definition)	Attribute A determines attribute B if all of the rows in the table that agree in value for attribute A also agree in value for attribute B.
Fully functional dependence (composite key)	If attribute B is functionally dependent on a composite key A but not on any Subset of that composite key, the attribute B is fully functionally dependent on A. A is composite Key and any subsets of A is not enough to determine B

Types of Functional Dependencies#

Partial dependency: Functional dependence in which the determinant is only part of the primary key (A determines B -> A is determinant)
- Assumption - One candidate key
- Straight forward
- Easy to identify
- It means that the B only fully determined by part of primary key. -> dont need whole compositive key
Transitive dependency: An attribute functionally depends on another nonkey attribute

Conversion to First Normal Form#

Repeating group: Group of multiple entries of same type can exist for any single key attribute occurrence
- Existence proves the presence of data redundancies
Enable reducing data redundancies
Steps
- Eliminate the repeating groups
- Identify the primary key
- Identify all dependencies
Dependency diagram: Depicts all dependencies found within given table structure
- Helps to get an overview of all relationships among table’s attributes
- Makes it less likely that an important dependency will be overlooked
1NF

Conversion to Two Normal Form#

Steps
- Make new tables to eliminate partial dependencies
- Reassign corresponding dependent attributes
Table is in 2NF when it:
- Is in 1NF
- Includes no partial dependencies
2NF

Conversion to Third Normal Form#

Steps
- Make new tables to eliminate transitive dependencies
- Determinant: Any attribute whose value determines other values within a row
- Reassign corresponding dependent attributes
Table is in 3NF when it:
- Is in 2NF
- Contains no transitive dependencies
3NF

Requirements for Good Normalized Set of Tables#

Evaluate PK assignments and naming conventions -> create a job_code
evaluate naming conventions -> JOB_CHG_HOUR
Refine attribute atomicity -> EMP_NAME -> first name, last name
- Atomic attribute: Cannot be further subdivided
- Atomicity: Characteristic of an atomic attribute
Identify new attributes $\And$ new relationships -> EMP_HIREDATE(Add more columns to make it more useful) $\And$ PROJECT can have EMP_NO as FK
Refine primary keys as required for data granularity -> ASSIGN_NUM
- Granularity: Level of detail represented by the values stored in a table’s row
Maintain historical accuracy $\And$ evaluate using derived attributes -> sotre JOB_CHG_HOUR IN assignment $\And$ ASSIGN_CHARGE (It can be caculatd by xx xx waste space but get better performance))

Denormalization#

Design goals
- Creation of normalized relations
- Processing requirements and speed
Number of database tables expands when tables are decomposed to conform to normalization requirements
Joining a larger number of tables:
- Takes additional input/output (I/O) operations and processing logic
- Reduces system speed
Defects in unnormalized tables
- Data updates are less efficient because tables are larger
- Indexing is more cumbersome
- No simple strategies for creating virtual tables known as views

Transaction Management#

Transaction#

Logical unit of work that must be entirely completed or aborted
Consists of:
- SELECT statement
- Series of related UPDATE statements
- Series of INSERT statements
- Combination of SELECT, UPDATE, and INSERT statements
Consistent database state: All data integrity constraints are satisfied // Before transaction, the database is good and clean and after transaction is completed, the transaction will leave the database again in a consistent sate.
- Must begin with the database in a known consistent state to ensure consistency
Formed by two or more database requests
- Database requests: Equivalent of a single SQL statement in an application program or transaction
Consists of a single SQL statement or a collection of related SQL statements

Transaction Properities#

Atomicity: All operations of a transaction must be completed. If not, the transaction is aborted.
Consistency: Permanence of database’s consistent state
Isolation: Data used during transaction cannot be used by second transaction until the first is completed
Durability: Ensures that once transactions are committed, they cannot be undone or lost
Serializability: Ensures that the schedule for the concurrent execution of several transactions should yield consistent results

Transaction Management with SQL#

SQL statements that provide transaction support
- COMMIT
- ROLLBACK -> roll back the current transaction to last commit
Transaction sequence must continue until:
- COMMIT statement is reached
- ROLLBACK statement is reached
- End of program is reached
- Program is abnormally terminated

Transaction log#

Keeps track of all transactions that update the database
DBMS uses the information stored in a log for:
- Recovery requirement triggered by a ROLLBACK statement
- A program’s abnormal termination
- A system failure

Concurrency Control#

Coordination of the simultaneous transactions execution in a multiuser database system
Objective - Ensures serializability of transactions in a multiuser database environment
dirty read：一个事务可以读取另一个尚未提交事务的修改数据。
nonrepeatable read：在同一个事务中，同一个查询在T1时间读取某一行，在T2时间重新读取这一行时候，这一行的数据已经发生修改，可能被更新了（update），也可能被删除了（delete）。
phantom read：在同一事务中，同一查询多次进行时候，由于其他插入操作（insert）的事务提交，导致每次返回不同的结果集。

The Scheduler#

Establishes the order in which the operations are executed within concurrent transactions
- Interleaves the execution of database operations to ensure serializability and isolation of transactions
Based on concurrent control algorithms to determine the appropriate order
Creates serialization schedule
- Serializable schedule: Interleaved execution of transactions yields the same results as the serial execution of the transactions

Concurrency Control with Locking Methods#

Locking methods - Facilitate isolation of data items used in concurrently executing transactions
Lock: Guarantees exclusive use of a data item to a current transaction
Pessimistic locking: Use of locks based on the assumption that conflict between transactions is likely
Lock manager: Responsible for assigning and policing the locks used by the transactions

Lock Granularity#

Indicates the level of lock use
Levels of locking
- Database-level lock
- Table-level lock
- Page-level lock
  - Page or diskpage: Directly addressable section of a disk
- Row-level lock
- Field-level lock

Lock Types#

Binary lock
Has two states, locked (1) and unlocked (0)
- If an object is locked by a transaction, no other transaction can use that object
- If an object is unlocked, any transaction can lock the object for its use
Exclusive lock -write
- Exists when access is reserved for the transaction that locked the object
Shared lock -> read
- Exists when concurrent transactions are granted read access on the basis of a common lock

Two-Phase Locking (2PL) -> avoid deadlock#

Defines how transactions acquire and relinquish locks
Guarantees serializability but does not prevent deadlocks
Phases
- Growing phase - Transaction acquires all required locks without unlocking any data
- Running operations(middle)
- Shrinking phase - Transaction releases all locks and cannot obtain any new lock
Governing rules
- Two transactions cannot have conflicting locks
- No unlock operation can precede a lock operation in the same transaction
- No data are affected until all locks are obtained

Deadlocks#

Occurs when two transactions wait indefinitely for each other to unlock data
- Known as deadly embrace
Control techniques
- Deadlock prevention possible deadlock detect and kill one of the potential confilct -> timestamp based
- Deadlock detection
- Deadlock avoidance -> The transaction must obtain all the lock to execute, in the transaction process, no deadlock will occur
Choice of deadlock control method depends on database environment

Time Stamping -> prevent deadlock#

Assigns global, unique time stamp to each transaction
- Produces explicit order in which transactions are submitted to DBMS
Properties
- Uniqueness: Ensures no equal time stamp values exist
- Monotonicity: Ensures time stamp values always increases
Disadvantages
- Each value stored in the database requires two additional stamp fields
- Increases memory needs
- Increases the database’s processing overhead
- Demands a lot of system resources

Wait/Die and Wound/Wait Concurrency Control Schema#

Transaction Request Lock	Transaction Owning Lock	WAIT/DIE Scheme(People who currently has resources has higher privelege)	Wound/Wait Scheme -> Old always first if Old has higher privelege
T1(11548789)	T2(19562545)	T1 wait until T2 is complete	T1 preempts(rollback) T2 and T2 reschedule
T2(19562545)	T1(11548789)	T2 dies and reschedule with the same timestamp	T2 waits T1 is completed and T1 releases lock

Phases of Optimistic Approach#

Read
- Transaction:
  - Reads the database
  - Executes the needed computations
  - Makes the updates to a private copy of the database values
Validation
- Transaction is validated to ensure that the changes made will not affect the integrity and consistency of the database
Write
- Changes are permanently applied to the database

Distributed DBs#

Vertical scalling and horizontal scalling#

Vertical scalling: very bad, it’s take one server and add more and more Gigbyte it run, but it is restricted in one machine
Horizontal scalling: add more and more servers

Distributed Processing and Distributed Databases#

Distributed processing: Database’s logical processing is shared among two or more physically independent sites via network
Distributed database: Stores logically related database over two or more physically independent sites via computer network
- Database fragments: Database composed of many parts in distributed database system. EXAMPle: take a table and split it to many pieces

Functions of Distributed DBMS#

Receives the request of an application
Validates analyzes, and decomposes the request
Maps the request
Decomposes request into several I/O operations
Searches and validates data
Ensures consistency, security, and integrity
Validates data for specific conditions
Presents data in required format

DDBMS Components#

Computer workstations or remote devices
Network hardware and software components
Communications media
Transaction processor (TP): Software component of a system that requests data -> access the database and get the query fullfill(make a request)
Known as transaction manager (TM) or application processor (AP)
Data processor (DP) or data manager (DM) -> Only serve data dont care where the request come from
Software component on a system that stores and retrieves data from its location

Single-Site Processing, Single-Site Data(MPSD) pretty simple#

Multiple-Site Processing, Single-Site Data (MPSD)#

Multiple processes run on different computers sharing a single data repository
Require network file server running conventional applications
- Accessed through LAN
Client/server architecture
Reduces network traffic
Processing is distributed
Supports data at multiple sites

Multiple-Site Processing, Single-Site Data (MPSD)#

Fully distributed database management system
- Support multiple data processors and transaction processors at multiple sites
Classification of DDBMS depending on the level of support for various types of databases
- Homogeneous: Integrate multiple instances of same DBMS over a network -> all databases are in the same model(all oracle/mysql)
- Heterogeneous: Integrate different types of DBMSs -> can have different vendors(oracle, mysql,sqlite) but must follow the same relational model
- Fully heterogeneous: Support different DBMSs, each supporting different data model (some maybe noSQL data/graph/kv pairs)

Restrictions of DDBMS#

Remote access is provided on a read-only basis
Restrictions on the number of remote tables that may be accessed in a single transaction
Restrictions on the number of distinct databases that may be accessed
Restrictions on the database model that may be accessed

Distributed Requests and Distributed Transactions#

Remote request
- Single SQL statement accesses data processed by a single remote database processor
Remote transaction
- Accesses data at single remote site composed of several requests
Distributed transaction
- Requests data from several different remote sites on network
Distributed request
- Single SQL statement references data at several DP sites

Distributed Concurrency Control (2PC 2 phrase commit)#

Concurrency control is important in distributed databases environment
- Due to multi-site multiple-process operations that create inconsistencies and deadlocked transactions

Two-Phase Commit Protocol(2PC)#

Guarantees if a portion of a transaction operation cannot be committed, all changes made at the other sites will be undone
- To maintain a consistent database state
Requires that each DP’s transaction log entry be written before database fragment is updated
DO-UNDO-REDO protocol: Roll transactions back and forward with the help of the system’s transaction log entries
Write-ahead protocol: Forces the log entry to be written to permanent storage before actual operation takes place
Defines operations between coordinator and subordinates
Phases of implementation
- 1. Preparation -> request if others can commit or not(failure happends)
- 1. The final COMMIT -> Tell all others they can commit
Details:
第一阶段：准备
协调器发送一个PREPARE TO COMMIT信息给所有的从属者。

从属者接到这个信息，使用先写协议，写入事务日志并发送一个确认信息（YES/PREPARED TO COMMIT或NO/NO PREPARED）给协调器。
协调器确保所有节点要么提交，要么取消。
如果所有节点为PREPARED TO COMMIT，事务调转到第二阶段。如果有一个或多个节点为NO或NO PREPARED协调器就给所有从属者发送一个ABORT。

第二阶段：最后的提交

协调器广播一个COMMIT信息给所有从属者并等待回复。
每个从属者接收到COMMIT信息并且使用DO协议更性数据库。
从属者回复一个 COMMITTED或者NOT COMMITTED信息给协调器。
如果一个或者多个从属者没有提交，协调器就发送一个ABORT消息，这样就使它们都取消所有的操作。

两阶段提交的目的是为了确保每个节点都提交事务自己所操作的部分；否则，事务就取消。如果其中一个节点提交失败，就用事务日志来恢复数据库的信息，而且使用的是DO-UNDO-REDO协议来恢复（记住使用先写协议更新日志信息）。
CUBRID -> how transaction manager/distribute a good example

Transparency (Hide sth to users)#

Distribution transparency: The users dont know the databases are distrubuted
- Allows management of physically dispersed database as if centralized(好像是centralized的)
- Levels
  - Fragmentation transparency
  - Location transparency
  - Local mapping transparency
Transaction transparency: The users dont know what parts of transaction happens where
- Ensures database transactions will maintain distributed database’s integrity and consistency
- Ensures transaction completed only when all database sites involved complete their part
- Distributed database systems require complex mechanisms to manage transactions
Failure transparency: If the query go to one place, the place is down, they the user will not feel that. reroute to other copies
- Ensures the system will operate in case of network failure
- Considerations for resolving requests in a distributed data environment
  - Data distribution
  - Data replication
    - Replica transparency: DDBMS’s ability to hide multiple copies of data from the user
Performance transparency: Hide latency.
- Allows a DDBMS to perform as if it were a centralized database
Network and node availability
- Network latency: delay imposed by the amount of time required for a data packet to make a round trip
- Network partitioning: delay imposed when nodes become suddenly unavailable due to a network failure
Heterogeneity transparency: dont know what operating system/ database system …

Distributed Database Design#

Data fragmentation: How to partition database into fragments
Data replication: Which fragments to replicate
Data allocation: Where to locate those fragments and replicas

Data Fragmentation#

Breaks single object into many segments
Information is stored in distributed data catalog (DDC)
Strategies
- Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples (rows)
- Vertical fragmentation: Division of a relation into attribute (column) subsets
- Mixed fragmentation: Combination of horizontal and vertical strategies

Data Replication#

Data copies stored at multiple sites served by a computer network
Mutual consistency rule: Replicated data fragments should be identical
Styles of replication -> How to make changes propogate
- Push replication -> send changes to others
- Pull replication -> pull others’ changes
Helps restore lost data

Data Replication Scenarios#

Fully replicated database
- Stores multiple copies of each database fragment at multiple sites
Partially replicated database
- Stores multiple copies of some database fragments at multiple sites
Unreplicated database
- Stores each database fragment at a single site

Data Allocation Strategies#

Centralized data allocation
- Entire database stored at one site
Partitioned data allocation
- Database is divided into two or more disjoined fragments and stored at two or more sites
Replicated data allocation
- Copies of one or more database fragments are stored at several sites

The CAP Theorem (That’s what u need consider when you design a database)#

CAP stands for:
- Consistency: Are all partitions of data keeping the same
- Availability: if some nodes are down, the system can still work
- Partition tolerance: the cluster continues to function even if there is a “partition” (communication break) between two nodes (both nodes are up, but can’t communicate). -> Continue to work even some nodes fail
Basically available, soft state, eventually consistent (BASE)
- Data changes are not immediate but propagate slowly through the system until all replicas are consistent

Database Connectivity#

Database middleware: Provides an interface between the application program and the database
Data repository/source - Data management application used to store data generated by an application program
Universal Data Access (UDA): Collection of technologies used to access any type of data source and manage the data through a common interface
ODBC, OLE-DB, ADO.NET form the backbone of MS UDA architecture

Various connectivity options#

Native SQL(provided by vendors)
- Connection interface provided by database vendors, which is unique to each vendor
- Interfaces are optimized for particular vendor’s DBMS
  - Maintenance is a burden for the programmer
M’soft: 1.ODBC(Open database connectivity(Microsoft’s implementation of a superset of SQL Access Group)), DAO(Data Access Object)+JET, RDO(Remote Data Object)
- ODBC
  - Open Database Connectivity (ODBC):Microsoft’s implementation of a superset of SQL Access Group Call Level Interface (CLI) standard for database access
  - Widely supported database connectivity interface
  - Allows Windows application to access relational data sources by using SQL via standard application programming interface (API)
- Data Access Objects(DAO)
  - Object-oriented API used to access MS Access, MS FoxPro, and dBase databases from Visual Basic programs
  - Provides an optimized interface that expose functionality of Jet data engine to programmers
  - DAO interface can be used to access other relational style data sources
  - Jet is a wrapper of DAO
- Remote Data Objects (RDO)
  - Higher-level object-oriented application interface used to access remote database servers
- Dynamic-link libraries (DLLs)
  - Implements ODBC, DAO, and RDO as shared code that is dynamically linked to the Windows operating environment
M’soft: OLE-DB(open linking embeding db)
M’soft: ADO.NET(Access Data Object)
JDBC(from Sum)

Components of ODBC Architecture#

High-level ODBC API through which application programs access ODBC functionality -> face to program
Driver manager that is in charge of managing all database connections -> face to table
ODBC driver that communicates directly to DBMS

Object Linking and Embedding for Database (OLE-DB) Database becomes object#

Before: OMG(object management group)
CORBA: COMMON OBJECT REQUEST BREAKER ARCHITECHTURE -> Object exchange from different os regardless different memory

Database middleware that adds object-oriented functionality for access to data
Series of COM(component object model) objects provides low-level database connectivity for applications
Types of objects based on functionality
- Consumers(request data)
- Providers(produce data)
Does not provide support for scripting languages
ActiveX Data Objects (ADO): Provides:
High-level application-oriented interface to interact with OLE-DB, DAO, and RDO
Unified interface to access data from any programming language that uses the underlying OLE-DB objects
ADO.NET
- Data access component of Microsoft’s .NET application development framework
Microsoft’s .NET framework
- Component-based platform for developing distributed, heterogeneous, interoperable applications
- Manipulates any type of data using any combination of network, operating system, and programming language
Features critical for the development of distributed applications
- DataSet: Disconnected memory-resident representation of database
- XML support
  - DataSet is internally stored in XML format
  - Data in DataSet is made persistent as XML documents
  - DataSet is internally stored in XML format
  - Data in DataSet made persistent as XML documents

Database Internet Connectivity#

Allows new innovative services that:
- Permit rapid response by bringing new services and products to market quickly
- Increase customer satisfaction through creation of web-based support services
- Allow anywhere, anytime data access using mobile smart devices via the Internet
- Yield fast and effective information dissemination through universal access

Web-to-Database Middleware#

Web server is the main hub through which Internet services are accessed
Server-side extension: Program that interacts directly with the web server
- Known as web-to-database middleware
- Provides its services to the web server in a way that is totally transparent to the client browser
Middleware must be well integrated

Client-Side Extensions#

Add functionality to Web browser
Types
- Plug-in: External application automatically invoked by the browser when needed
- Java and JavaScript: Embedded in web page
  - Downloaded with the Web page and activated by an event
ActiveX and VBScript: Embedded in web page
- Downloaded with page and activated by event
- Oriented to Windows applications

Performance Tuning and Query Optimization#

Tuning(Database Performance-Tuning Concepts)#

Goal of database performance is to execute queries as fast as possible
Database performance tuning: Set of activities and procedures that reduce response time of database system
Fine-tuning the performance of a system requires that all factors must operate at optimum level with minimal bottlenecks

Performance Tuning: Client and Server#

Client side#

SQL performance tuning: Generates SQL query that returns correct answer in least amount of time -> suppose A and B and C-> A should fail many times. Re-arranging sql query language
- Using minimum amount of resources at server

Server side#

DBMS performance tuning: DBMS environment configured to respond to clients’ requests as fast as possible
- Optimum use of existing resources

DBMS Architecture#

All data in a database are stored in data files
- Data files automatically expand in predefined increments known as extends
Data files are grouped in file groups or table spaces
- Table space or file group: Logical grouping of several data files that store data with similar characteristics
Data cache or buffer cache: Shared, reserved memory area
- Stores most recently accessed data blocks in RAM
SQL cache or procedure cache: Stores most recently executed SQL statements or PL/SQL procedures
DBMS retrieves data from permanent storage and places them in RAM Input/output request: Low-level data access operation that reads or writes data to and from computer devices
Data cache is faster than working with data files
Majority of performance-tuning activities focus on minimizing I/O operations
Listener The listener process listens for clients’ requests and handles the processing of the SQL requests to other DBMS processes. Once a request is received, the listener passes the request to the appropriate user process
User The DBSM creates a user process to manager each client session. Therefore, when you log on to the DBMS, you are assigned a user process. This process handles all requests you submit to the server. There are many user processes-at least onper logged-in client.
Scheduler The scheduler process organizes the concurrent execution of SQL requests.
Lock Mnager This process manages all locks placed on database objects, including disk pages
Optimizer The optimizer analyzes SQL queries and finds the most efficient way to access the data. // How can we make it better.

Database Query Optimization Modes#

Algorithms proposed for query optimization are based on:
- Selection of the optimum order to achieve the fastest execution time // The order to visit each table
- Selection of sites to be accessed to minimize communication costs
Evaluated on the basis of:
- Operation mode
- Timing of its optimization
- Type of information(used for optimization)

Classification of Operation Modes#

Automatic query optimization: DBMS finds the most cost-effective access path without user intervention
Manual query optimization: Requires that the optimization be selected and scheduled by the end user or programmer

Based on Timing of Optimization#

Static query optimization: best optimization strategy is selected when the query is compiled by the DBMS
- Takes place at compilation time
Dynamic query optimization: Access strategy is dynamically determined by the DBMS at run time, using the most up-to-date information about the database
- Takes place at execution time: more processing ahead

Type of Information Used to Optimize#

Statistically based query optimization algorithm: Statistics are used by the DBMS to determine the best access strategy
Statistical information is generated by DBMS through:
- Dynamic statistical generation mode - auto eval
- Manual statistical generation mode - via user
Rule-based query optimization algorithm: based on a set of user-defined rules to determine the best query access strategy

Sample Database Statistics Measurements#

Advantages and Disadvantages of Storing Derived Attributes

Query Processing#

Parsing
- DBMS parses the SQL query and chooses the most efficient access/execution plan
Execution
- DBMS executes the SQL query using the chosen execution plan
Fetching
- DBMS fetches the data and sends the result set back to the client

SQL Parsing Phase#

Query is broken down into smaller units
Original SQL query transformed into slightly different version of original SQL code which is fully equivalent and more efficient
Query optimizer: Analyzes SQL query and finds most efficient way to access data
Access plans: DBMS-specific and translate client’s SQL query into a series of complex I/O operations
If access plan already exists for query in SQL cache, DBMS reuses it
If not, optimizer evaluates various plans and chooses one to be placed in SQL cache for use

SQL Execution Phase#

All I/O operations indicated in the access plan are executed
- Locks are acquired
- Data are retrieved and placed in data cache
- Transaction management commands are processed

SQL Fetching Phase#

Rows of resulting query result set are returned to client
- DBMS may use temporary table space to store temporary data
- Database server coordinates the movement of the result set rows from the server cache to the client cache

Query Processing Bottlenecks#

Delay introduced in the processing of an I/O operation that slows the system
Caused by the:
- CPU
- RAM
- Hard disk
- Network
- Application code

Indexes and Query Optimization#

Indexes
- Help speed up data access
- Facilitate searching, sorting, using aggregate functions, and join operations
- Ordered set of values that contain the index key and pointers
- More efficient than a full table scan

Indexes and Query Optimization#

Data sparsity: Number of different values a column could have; low sparsity => index might be useless
Data structures used to implement indexes:
- Hash indexes -> Bitl generate short-link by hash function
- B-tree indexes
- Bitmap indexes
DBMSs determine best type of index to use

index Selectivity#

Measure of the likelihood that an index will be used in query processing. So we strive to create indexes that have high selectivity.
Indexes are used when a subset of rows from a large table is to be selected based on a given condition
Index cannot always be used to improve performance
Function-based index: Based on a specific SQL function or expression(eg. EMP_SALARY + EMP_COMMISSION)
Indexes are useful when
- an indexable column occurs in a WHERE or HAVING search expression
- an indexable column appears in a GROUP BY or ORDER BY clause
- MAX or MIN is applied to an indexable column
- there is high data sparsity on an indexable column
Worth creating indexes on single columns that appear in WHERE, HAVING, ORDER BY, GROUP BY and join conditions.

Optimizer Choices -> statisitc-based#

Rule-based optimizer: Uses preset rules and points to determine the best approach to execute a query
Cost-based optimizer: Uses algorithms based on statistics about objects being accessed to determine the best approach to execute a query(collect all kinds of costs and choose on lowest cost)

Using Hints to Affect Optimizer Choices#

Optimizer might not choose the best execution plan
- Makes decisions based on existing statistics, which might be old
- Might choose less-efficient decisions
Optimizer hints: Special instructions for the optimizer, embedded in the SQL command text
Example

Conditional Expressions#

Equality comparisons are faster than inequality comparisons
Transform conditional expressions to use literals
Write equality conditions first when using multiple conditional expressions
When using multiple AND conditions, write the condition most likely to be false first
When using multiple OR conditions, put the condition most likely to be true first
Avoid the use of NOT logical operator

Query Formulation#

Identify what columns and computations are required §Identify source tables
Determine how to join tables
Determine what selection criteria are needed
Determine the order in which to display the output

DBMS Performance Tuning#

Managing DBMS processes in primary memory and the structures in physical storage
DBMS performance tuning at server end focuses on setting parameters used for:
- Data cache
- SQL cache
- Sort cache
- Optimizer mode
In-memory database: Store large portions of the database in primary storage
Recommendations for physical storage of databases:
- Use RAID (Redundant Array of Independent Disks) to provide a balance between performance improvement and fault tolerance
- Minimize disk contention
- Put high-usage tables in their own table spaces
- Assign separate data files in separate storage volumes for indexes, system, and high-usage tables
Take advantage of the various table storage organizations in the database
Index-organized table or clustered index table: Stores the end-user data and the index data in consecutive locations in permanent storage
Partition tables based on usage
Use denormalized tables where appropriate
Store computed and aggregate attributes in tables

Business Intelligence#

Business Intelligence(BI)#

Comprehensive, cohesive, integrated set of tools and processes
- Captures, collects, integrates, stores, and analyzes data
Generates and presents information to support business decision making
Allows a business to transform:
- Data into information
- Information into knowledge
- Knowledge into wisdom
ETL: extraction transformation and loading
QUERY & Reporting : include ml algorithm
Governance: How often to upadte warehouse
Management: How ETL happens

Business Intelligence Benefits#

Improve decision making
Integrating architecture
Common user interface for data reporting and analysis
Common data repository fosters single version of company data
Imporeve organizational performance

Decision Support Data#

Effectiveness of BI depends on quality of data gathered at operational level
Operational data
- Seldom well-suited for decision support tasks
- Stored in relational database with highly normalized structures
- Optimized to support transactions representing daily operations
Differ from operational data in:
Time span
Granularity
- Drill down: Decomposing a data to a lower level
- Roll up: Aggregating a data into a higher level
Dimensionality

Decision Support Database Requirements#

Database schema
- Must support complex, non-normalized data representations
- Data must be aggregated and summarized
- Queries must be able to extract multidimensional time slices
Data extraction and loading
- Allow batch and scheduled data extraction
- Support different data sources and check for inconsistent data or data validation rules
- Support advanced integration, aggregation, and classification
Database size should support
- Very large databases (VLDBs)
- Advanced storage technologies
- Multiple-processor technologies

Data Marts#

Small, single-subject data warehouse subset
Provide decision support to a small group of people
Benefits over data warehouses
Lower cost and shorter implementation time § Technologically advanced
Inevitable people issues

Components of Star Schemas#

Facts
- Numeric values that represent a specific business aspect
Dimensions
- Qualifying characteristics that provide additional perspectives to a given fact
Attributes
- Used to search, filter, and classify facts
- Slice and dice: Ability to focus on slices of the data cube for more detailed analysis.
Attribute hierarchy
- Provides a top-down data organization

Performance-Improving Techniques for the Star Schema#

Normalizing dimensional tables
- Snowflake schema: Dimension tables can have their own dimension tables
Maintaining multiple fact tables to represent different aggregation levels
Denormalizing fact tables

Data Anaytics#

Encompasses a wide range of mathematical, statistical, and modeling techiques to extract knowledge from data
- subset of BI functionality
Classfication of tools
- Explanatory analytics: Focuses on discovering and explaining data characteristics and relationships based on existing data
- Predictive analytics: Focuses on predicting future outcomes with a high degree of accuracy

OLAP-> Online Analytical Processing#

Advanced data analysis environment that supports decision making, business modeling, and operations research
Characteristics:
- Multidimensional data analysis techniques
- Advanced database support
- Easy-to-use end-user interfaces

Relational Online Analytical Processing (ROLAP)#

Provides OLAP functionality using relational databases and familiar relational tools to store and analyze multidimensional data
Extensions added to traditional RDBMS technology
Multidimensional data schema support within the RDBMS
Data access language and query performance optimized for multidimensional data
Support for very large databases (VLDBs)

Multidimensional Online Analytical Processing (MOLAP) pointer to pointer to formulate multi-dimensional database#

Extends OLAP functionality to multidimensional database management systems (MDBMSs)
MDBMS: Uses proprietary techniques store data in matrix-like n-dimensional arrays
End users visualize stored data as a 3D data cube §Grow to n dimensions, becoming hypercubes
Held in memory in a cube cache to speed access
Sparsity: Measures the density of the data held in the data cube

SQL Extensions for OLAP#

The ROLLUP extension(get upper domian data sorted by lower dimension)#

Used with GROUP BY clause to generate aggregates by different dimensions
Enables subtotal for each column listed except for the last one, which gets a grand total
Order of column list important

The CUBE extension(only get total by one dimension separately)#

Used with GROUP BY clause to generate aggregates by the listed columns
Includes the last column

Data lake#

A traditional data warehouse is an ETL-based, historical record of transactions - very RDB-like
A ‘modern’ alternative is a ‘data lake’, which offers a more continuous form of analytics, driven by the rise of unstructed data, streaming, cloud storage, etc. In a data lake, data is NOT ETLd, rather, it is stored in its ‘raw’ (“natural”) form [even incomplete, untransformed…].

Discussion1#

Data - origins?
- copper
- parchment
- leaves
- stone
Data - tech?
- hologram
- …
Data vs info vs knowledge
- info is by Searching sorting … calculation - computation of data(collected from somewhere) |exact information
- knowledge is useful information
Modeling
- Why? leave the part I dont want (abstraction) features!
Relational -> relating different entity
Another example of a ‘unary’ relationship? (employee table manager)
Another example of a ternary relationship? (manufactory)
Instead of TinyCollege, consider USC - more entities?
Why is full functional dependence a good thing?
Diff between entity and referential integrity, in terms of…
Examples of ‘blind’ (non-identifying) keys in RL?
How can we use tables to model unary relationships?
- one the same table a column called manager column -> point to itself’s employee_id -> drawback if many people donot have manager, it would be sparse column
- use two tables -> make a table called manager_employee which only has two cols (employee1, employee2) em1 -> employeers, em2-> managers
Closure - where else?
Why is closure a good property?
Set diff is non-commutative - what else is?
What is the conceptual shift, from a ‘spreadsheet DB’ to a relational one?

Discussion 2#

SQL: ____ Oriented Programming - what? What else?
- imperative programming: use for loop to accomplish the same goal. SQL is not but java c c++ is
- Object Oriented -> take data and method of data package them together
- Functional programming language: put function as input in and get function out as output. SQL is functional oriented
- - Function oriented programming language
- - Set oriented language -> SQL engine figuer out how to do 21:46
- - Declearative … tell what u want fine control
- Aspect oriented programming -> AOP
  - aspect means some tasks a program need to do. draw, email, send to printer
  - get all the task classes together and make an internface
  - if we implement the interface called aspect
  - SQL is also domain specific language DSL
Pattern matching - https://www.rexegg.com/regex-quickstart.html _-> single word % -> more than one word
More datatypes?
- Color type
- currency type
Breaking closure -> It cannot chain further stop chaining operation
RL(real life) situations (not DB) where transactions are done sequentially?
- dating cannot date with many peoples
Robert rules room how to make sure everybody get a chance to speak
DB cases in RL where interleaving in inevitable?
- travel book ticket
- buy books from amazon
- Linkedin Job
Database state, during lost updates vs inconsistent analysis?
- lost updates -> t2 override t1’s modification -> data destroy/ table is not always fine.
- inconsistent analysis -> nonrepeatable read -> integrate of table not harm. table not change the datastructure.
Serial schedule - how many exist, for n serializable transactions?
- rearrange many transactions very important
- transaction isolation level -> 4 levels for example, allow dirty read allow nonrepeatable read …
  - avoid dirty read
  - avoid nonrepeatable read
  - avoid phantom read
Why not make all schedules serial? -> consider performance
2PL(Static/Conservative) vs Strict 2PL(“S2PL”){hold on some writes and release all the reads, release read lock } vs Rigorous(Stricter/StrongStrict/SS2PL!)(hold on all the lock until commit complete)
Newton and his 2 cats
Replication - three reasons?
- failure reasons
- network-balancing
Replication tradeoffs with consistency? -> amazon
2PC - failure, how? rollback hang