DatabaseReview
This is a review reference and some notes/my own understanding of datbase course.
REFERENCE to CSCI 585 2020 SPRING lectured by Prof. Saty in University of Southern California
- Introduction
- 1. The difference between data and and information
- 2. What a database is, the various types of databases, and why they are valuable assets for decision making.
- 3. The importance of database design
- 4. How modern databases envolved from file systems
- 5. Flaws in file system data management
- 6. The main components of the database system
- 7. The main functions of a database management system(DBMS)
- Data Modeling
- About data modeling and why data models are important
- About the basic data-modeling building blocks
- What business rules are and how they influence database design
- Hierarchical modeling
- Network modeling
- Difference between Schema and Sub Schema
- Difference between DML and DDL
- The Relational Model
- Relational DBMS(Database Management System)
- How the major data models evolved
- Data Abstraction Levels
- The External Model
- The Conceptual Model
- The Internal Model
- The Physical Model
- About emerging alternative data models and the need they fulfill
- How data models can be classified by their level of abstraction
- ER
- Relational Modeling
- Extended ER (“EER”)-> Advanced data modeling
- Normalization
- Normalization (Do normalization once a time one entity a time)
- Need for Normalization
- Normalization Process
- Normal Forms
- Functional Dependence Concepts
- Types of Functional Dependencies
- Conversion to First Normal Form
- Conversion to Two Normal Form
- Conversion to Third Normal Form
- Requirements for Good Normalized Set of Tables
- Denormalization
- Transaction Management
- Transaction
- Transaction Properities
- Transaction Management with SQL
- Transaction log
- Concurrency Control
- The Scheduler
- Concurrency Control with Locking Methods
- Lock Granularity
- Lock Types
- Two-Phase Locking (2PL) -> avoid deadlock
- Deadlocks
- Time Stamping -> prevent deadlock
- Wait/Die and Wound/Wait Concurrency Control Schema
- Phases of Optimistic Approach
- Distributed DBs
- Vertical scalling and horizontal scalling
- Distributed Processing and Distributed Databases
- Functions of Distributed DBMS
- DDBMS Components
- Single-Site Processing, Single-Site Data(MPSD) pretty simple
- Multiple-Site Processing, Single-Site Data (MPSD)
- Multiple-Site Processing, Single-Site Data (MPSD)
- Restrictions of DDBMS
- Distributed Requests and Distributed Transactions
- Distributed Concurrency Control (2PC 2 phrase commit)
- Two-Phase Commit Protocol(2PC)
- Transparency (Hide sth to users)
- Distributed Database Design
- Data Fragmentation
- Data Replication
- Data Replication Scenarios
- Data Allocation Strategies
- The CAP Theorem (That’s what u need consider when you design a database)
- Database Connectivity
- Various connectivity options
- Components of ODBC Architecture
- Object Linking and Embedding for Database (OLE-DB) Database becomes object
- Database Internet Connectivity
- Web-to-Database Middleware
- Client-Side Extensions
- Performance Tuning and Query Optimization
- Tuning(Database Performance-Tuning Concepts)
- Performance Tuning: Client and Server
- DBMS Architecture
- Database Query Optimization Modes
- Classification of Operation Modes
- Based on Timing of Optimization
- Type of Information Used to Optimize
- Sample Database Statistics Measurements
- Query Processing
- SQL Parsing Phase
- SQL Execution Phase
- SQL Fetching Phase
- Query Processing Bottlenecks
- Indexes and Query Optimization
- Indexes and Query Optimization
- index Selectivity
- Optimizer Choices -> statisitc-based
- Using Hints to Affect Optimizer Choices
- Conditional Expressions
- Query Formulation
- DBMS Performance Tuning
- Business Intelligence
- Business Intelligence(BI)
- Business Intelligence Benefits
- Decision Support Data
- Decision Support Database Requirements
- Data Marts
- Components of Star Schemas
- Performance-Improving Techniques for the Star Schema
- Data Anaytics
- OLAP-> Online Analytical Processing
- Relational Online Analytical Processing (ROLAP)
- Multidimensional Online Analytical Processing (MOLAP) pointer to pointer to formulate multi-dimensional database
- SQL Extensions for OLAP
- Data lake
- Discussion1
- Discussion 2
Introduction#
1. The difference between data and and information#
Data | Information |
---|---|
Raw fact(Raw data- Not yet been processed to reveal the meaning) | produced by processing data |
Building blocks of information | Reveals the meaning of data |
Data management(Generation, storage, and retrieval of data) | Enables knowledge creation |
Should be accurate, relevant, and timely to enable good decision making |
- To be simple, Data is raw fact, objective exists in the real word. Information is a kind of processed data. After gathering and processing, data becomes information. For example, the data of phone is pixel but the people can see it is a picture.
- Knowledge is useful information. Something you can do with the information. Knowledge can use the information we found.
2. What a database is, the various types of databases, and why they are valuable assets for decision making.#
- Database is a shared, integrated computer structure that sotres a collection of:
- End-user data: Raw facts of interest to end user, such as student’s gpa…
- Metadata: Data about data, which the end-user data are integrated and managed
- describe data characteristics and relationships such as what does the GPA mean and some restriction such as $0 \leq GPA \leq 4.0$ and $student-age > 0$. About how we store the information. Or How many tables we have.
- Database management system(DBMS)
- Collection of programs
- Manages the database structure
- Controls access to data stored in the database(who can access database | access restrictions)
- Role of the DBMS
- Intermediary between the user and the database
- Enables data to be shared
- Presents the end user with an integrated view of the data
- Receives and translates application requests(what are you searching for) into operations required to fulfill the request
- Hides database’s internal complexity from the application programs and users(People can reorganise data in the future)
- Advantages of DBMS
- Better data integration and less data inconsistency(pervent having two difference record of the same things)
- Data inconsistency: Different versions of the same data appear in different places
- Increase end-user productivity
- Data sharing
- Data security // Who gonna see what
- Data access
- Decision making
- Data quality: Promoting accuracy, validity, and timeliness of data.
- Better data integration and less data inconsistency(pervent having two difference record of the same things)
- Types of databases
- Single-user database: Supports one user at a time (Like a database in your phone/Desktop)
- Desktop database: Runs on PC
- Multiuser database: Supports multiple users at the same time (Like BOA or Amazon database)
- Workgroup databases: Support a small number of users or a specific department
- Enterprise database: Supports many users across many departments
- Centralized database: Data is located at a single site
- Distributed database: Data is distributed across different sites (Part of database from xxx part of database from xxx different contries) Note: It is not copies. Just different parts of one database from different sites. – google uses it
- Cloud database: Created and maintained using cloud data services that provide defined performance measures for the database. Companies do not actually using the hardwares. Like UBER, they dont have their physical, their databases live on some cloud.(AWS)
- Single-user database: Supports one user at a time (Like a database in your phone/Desktop)
- Types of databases (Another way to classify database)
- General-purpose databases: Contains a wide variety of data used in multiple disciplines. Like Oracle database, you can use the database where you want. And you can put anything you want in Oracle. –> general purpose
- Discipline-specific databases: Contains data focused on specific subject areas. Like scientists use protein database(PDB -> protein data bank) Every protein molecules are stored in database. Different structure for specific usage. They have their own buliding plans and only the users know what it is or what it represents.
- Types of databases (Another way to classify database) -> purly for business point of view
- Operational database: Designed to support a company’s day to day operations.
- Analytical database: Stores historical data and business metrics used exclusively for tactical or strategic decision making
- Data warehouse: Stores data in a format optimized for decision support.
- Example: Constantly writing data…. every purchase into Operational database daily. At night, they put the data to one central place called analytical database(used to analytical later) It is constantly reading database.
- where do data warehouse caculations
- Online analytical processing(OLAP) Command to go to the warehouse and analysis data
- Enable retrieving, processing, and modeling data from the data warehouse.
- Business intelligence: Captures and processes business data to generate information that support decision making. (stock from Operational database to Analytical database piece to piece) -> data mining
- Online analytical processing(OLAP) Command to go to the warehouse and analysis data
- Types of databases (Another way to classify database)
- Unstructured data: It exists in their original state (dont have a specific format like musci/video files)
- Structured data: It results from formatting (like students information stored in each row of studnet table)
- Structure is applied based on type of processing to be performed
- Semistructured data: Processed to some extent like document. Generally it is used to describe sth. Between structured and unstructured.
- Extensible Markup Language(XML) Json better than xml can be used in structured and unstructured
- Represents data elements in textual format
3. The importance of database design#
4. How modern databases envolved from file systems#
- Early Dbs: file systems
- Manual file systems: Accompulished through a system of file folders and filing cabinets
- Computerized File Systems
- Data processing(DP) specialist: Create a computer-based system that would track data and produce required reports
- File System Redux: Mdern End-User Productivity Tools(Includes spreadsheet programs such as Microsoft Excel)
- File systems
- Data -> Raw facts, such as a Tel number, birth date or customer name. Data have little meaning unless they have been organized in some logical manner. Each cell!
- Field -> A character of group of characters(alphabetic or numeric) that has a specific meaning. A field is used to define and store data. Like the definition of the meaning of data in this field.
- Record -> A row (contains person’s name, address….)
- File -> a collection of record
- Problems with Fie System Data Processing
- Lengthy development times
- Difficulty of getting quick answers
- Complex system administration
- Lack of security and limited data sharing
- Extensive programming
- Structural and Data dependence
- Data dependence
- Data access changes when data storage characteristics chage
- Data independence
- Data storage characteristics is changed without affecting the program’s ability to access the data
- Practical significance of data dependence is difference between logical and physical format
- Data dependence
- Data Redundancy
- Unnecessarily storing same data at different places
- Islands of information: Scattered data locations
- Increases the probability of having different versions of the same data.
- Data Redundancy Implications (the defects of data redundancy)
- Poor data security
- Data inconsistency
- Increased likelihood of data-entry errors when complex enries are made in different files
- Data anomaly: Develops when not all of the required changes in the redundant data are made successfully.
- Update Anomalies (happens when data is already there)
- Insertion Anomalies (create data for the first time)
- Deletion Anomalies
- Database System
- Logically related data stored in a single logical data repository
- Physically distributed among multiple storage facilities
- DBMS eliminates most of file system’s problems
- Curent generation DBMS software
- Stores data structures, relationships between structures, and access pahts
- Defines, stores, and manages all access paths and components.
- DBMS Functions
- Data dictionary management
- Data storage management
- Per
- Data transformation and presentation
- Transforms entrered data to conform required data structures
- Security management
- Enforces user security and data privacy
- Logically related data stored in a single logical data repository
5. Flaws in file system data management#
6. The main components of the database system#
7. The main functions of a database management system(DBMS)#
- Data dictionary management == meta data example for one column
- Data dictionary: Stores definitions of the data elements and their relationships
- Data storage management
- Performance tuning: Ensures efficient performance of the database in terms of storage and access speed
- Data transformation and presentation
- Transforms entered data to conform to required data structures.
- Security management
- Enforces user security and data privacy.
- Multiuser access control
- Sophisticated algorithms ensure that multiple users can access the database concurrently without compromising its integrity.
- Backup and recovery management
- Enables recovery of the database after a failure
- Data integrity management
- Minimizes redundancy and maximizes consistency
- Database access languages and application programming interfaces
- Query language: Lets the user specify what must be done without having to specify how
- Structured Query Language (SQL): De facto query language and data access standard supported by the majority of DBMS vendors
- Database communication interfaces
- Accept end-user requests via multiple, different network environments
- Disadvantage of Database Systems
- Increased costs
- Management complexity
- Maintaining currency
- Vendor dependence
- Frequent upgrade/replacement cycles
Data Modeling#
About data modeling and why data models are important#
Data Modeling and Data Models#
- Data modeling: Iterative and progressive process of creating a specific data model for a determined problem domain
- Data models: Simple representations of complex real-world data structures
Useful for supporting a specific problem domain - Model - Abstraction of a real-world object or event
Importance of Data Models#
- Are a communication tool
- Give an overall view of the database
- Organize data for various users
- Are an abstraction for the creation of good database
About the basic data-modeling building blocks#
- Entity: Unique and distinct object used to collect and store data
- Attribute: Characteristic of an entity
- Relationship: Describes an association among entities(one row in an entity can relate to how many rows in other entities)
- One-to-many (1:M)
- Many-to-many (M:N or M:M)
- One-to-one (1:1)
- Constraint: Set of rules to ensure data integrity
What business rules are and how they influence database design#
- Brief, precise, and unambiguous description of a policy, procedure, or principle
- Enable defining the basic building blocks
- Describe main and distinguishing characteristics of the data
- Sources of Business Rules(sb. write the manual)
- Company managers
- Policy makers
- Department managers
- Written documentation
- Direct interviews with end users
- Why we need business rules
- Help standardize company’s view of data
- Communications tool between users and designers
- Allow designer to:
- Understand the nature, role, scope of data, and business processes
- Develop appropriate relationship participation rules and constraints
- Create an accurate data model
Hierarchical modeling#
- Manage large amounts of data for complex manufacturing projects
- Represented by an upside-down tree which contains segments
- Segments: Equivalent of a file system’s record type
- Depicts a set of one-to-many (1:M) relationships
- Advantage
- Promotes data sharing
- Parent/child relationship promotes conceptual simplicity and data integrity
- Database security is provided and enforced by DBMS
- Efficient with 1:M relationships
- Disadvantages
- Requires knowledge of physical data storage characteristics
- Navigational system requires knowledge of hierarchical path
- Changes in structure require changes in all application programs
- Implementation limitations
- No data definition
- Lack of standards
Network modeling#
- Represent complex data relationships
- Improve database performance and impose a database standard
- Depicts both one-to-many (1:M) and many-to-many (M:N) relationships
- Advantages
- Conceptual simplicity
- Handles more relationship types
- Data access is flexible
- Data owner/member relationship promotes data integrity
- Conformance to standards
- Includes data definition language (DDL) and data manipulation language (DML)
- Disadvantages
- System complexity limits efficiency
- Navigational system yields complex implementation, application development, and management
- Structural changes require changes in all application programs
Difference between Schema and Sub Schema#
- Schema is a Conceptual organization of the entire database as viewed by the database administrator(means the whole thing)
- SubSchema is a Portion of the database seen by the application programs that produce the desired information from the data within the database.(only a sub part of schema)
Difference between DML and DDL#
- Data manipulation language(DML): Environment in which data can be managed and is used to work with the data in the database(add / delete/ modify/ query data)
- Schema data definition language(DDL): Enables the database administrator to define the schema components (create the structure)
The Relational Model#
- Produced an automatic transmission database that replaced standard transmission databases
- Based on a relation
- Relation or table: Matrix composed of intersecting tuple and attribute(Here we need to note that we have two kinds of relation.One is the relation between different tables. And the other thing is one cell related to rows and columns. For example, one score can relate to one student and it is belonged to gpa column it is an intersection)
- Tuple: Rows
- Attribute: Columns
- Relation or table: Matrix composed of intersecting tuple and attribute(Here we need to note that we have two kinds of relation.One is the relation between different tables. And the other thing is one cell related to rows and columns. For example, one score can relate to one student and it is belonged to gpa column it is an intersection)
- Describes a precise set of data manipulation constructs
- Advantage
- Structural independence is promoted using independent tables
- Tabular view improves conceptual simplicity
- Ad hoc query capability is based on SQL
- Isolates the end user from physical-level details
- Improves implementation and management simplicity
Relational DBMS(Database Management System)#
- Performs basic functions provided by the hierarchical and network DBMS systems
- Makes the relational data model easier to understand and implement
- Hides the complexities of the relational model from the user
How the major data models evolved#
Data Abstraction Levels#
It means you change sth, the level above this layer will no change
Physical independence means pyhsical model can change but internal Model could not change.
Logical independence means if internal model changes but conceptual could not change. Such as if Internal Model is changed to NoSQL from RDM, nobody will notice that.
The External Model#
- End users’ view of the data environment
- ER diagrams are used to represent the external views
- External schema: Specific representation of an external view
The Conceptual Model#
- Represents a global view of the entire database by the entire organization
- Conceptual schema: Basis for the identification and high-level description of the main data objects
- Has a macro-level view of data environment
- Is software and hardware independent
- Logical design: Task of creating a conceptual data model
The Internal Model#
- Representing database as seen by the DBMS mapping conceptual model to the DBMS
- Internal schema: Specific representation of an internal model
- Uses the database constructs supported by the chosen database
- Is software dependent and hardware independent
- Logical independence: Changing internal model without affecting the conceptual model
The Physical Model#
- Operates at lowest level of abstraction
- Describes the way data are saved on storage media such as disks or tapes
- Requires the definition of physical storage and data access methods
- Relational model aimed at logical level
Does not require physical-level details - Physical independence: Changes in physical model do not affect internal model
About emerging alternative data models and the need they fulfill#
How data models can be classified by their level of abstraction#
ER#
Entity Relationship Model (ERM)#
- Basis of an entity relationship diagram (ERD)
- ERD depicts the:
- Conceptual database as viewed by end user
- Database’s main components
- Entities
- Attributes
- Relationships
- Entity - Refers to the entity set and not to a single entity occurrence
Attributes#
- Characteristics of entities
- Required attribute: Must have a value, cannot be left empty -Id/Lastname
- Optional attribute: Does not require a value, can be left empty -secondary Email
- Domain - Set of possible values for a given attribute(range the possible like GPA 0-4 name can only be 40chars long ..valid values)
- Identifiers: One or more attributes that uniquely identify each entity instance(must exist and cannot be null like primary key(s))
- Compondend attribute: Attribute that can be subdivided to yield additional attributes(Different from compond. One column can contain more than one colums in it, such as color can contain R,G,B or Name can contain first_name and last_name
- Composite identifier: Primary key composed of more than one attribute
- Composite attribute: composit is like the combinations of several columns same as Composite identifier, but identifier is primary key.
- Simple attribute: Attribute that cannot be subdivided
- Single-valued attribute: Attribute that has only a single value
- Multivalued attributes: Attributes that have many values and require creating:
- Several new attributes, one for each component of the original multivalued attribute
- A new entity composed of the original multivalued attribute’s components
- Derived attribute: Attribute whose value is calculated from other attributes
- Derived using an algorithm
- Advantages and Disadvantages of Storing Derived Attributes
Relationships#
- Association between entities that always operate in both directions
- Participants: Entities that participate in a relationship
- Connectivity: Describes the relationship classification
- Cardinality: Expresses the minimum and maximum number of entity occurrences associated with one occurrence of related entity
- Existence Dependence and Existence Independence
Existence dependence Existence independence Entity exists in the database only when it is associated with another related entity occurrence Entity exists apart from all of its related entities. Referred to as a strong entity or regular entity For example, the Employee(String entity/regular entity) and their children (*** Children is not week entity!!!) Please notice that When a primary is shared between two Entity, the other entity(which owns both its only key and shared key) is weak entity And the relationship between them is strong relationship. - Relationship Strength
- Weak (non-identifying) relationship
- Primary key of the related entity does not contain a primary key component of the parent entity
- Strong (identifying) relationships
- Primary key of the related entity contains a primary key component of the parent entity
A strong relationship makes the right side be weak. But the right hand side of a Weak entiy is not weak though it is still existant dependency.
- Primary key of the related entity contains a primary key component of the parent entity
- A weak entity has two Conditions
+Existence-dependent- Has a primary key that is partially or totally derived from parent entity in the relationship
- Weak (non-identifying) relationship
- Relationship Participation
- Optional participation
- One entity occurrence does not require a corresponding entity occurrence in a particular relationship
- Mandatory participation
- One entity occurrence requires a corresponding entity occurrence in a particular relationship
- Optional participation
- Relationship Degree
- Indicates the number of entities or participants associated with a relationship
- Unary relationship: Association is maintained within a single entity -> Some row in a table has some relationship with some row in the same table
- Recursive relationship: Relationship exists between occurrences of the same entity set
- Binary relationship: Two entities are associated
- Ternary relationship: Three entities are associated
- Associative Entities
- Also known as composite or bridge entities
- Used to represent an M:N relationship between two or more entities
- Is in a 1:M relationship with the parent entities
- Composed of the primary key attributes of each parent entity
- May also contain additional attributes that play no role in connective process
- Developing an ER Diagram
- Create a detailed narrative of the organization’s description of operations
- Identify business rules based on the descriptions
- Identify main entities and relationships from the business rules
- Develop the initial ERD
- Identify the attributes and primary keys that adequately describe entities
- Revise and review ERD
Relational Modeling#
A Logical View of Data#
- Relational database model enables logical representation of the data and its relationships
- Logical simplicity yields simple and effective database design methodologies
- Facilitated by the creation of data relationships based on a logical construct called a relation
Keys#
- Consist of one or more attributes that determine other attributes
- Used to:
- Ensure that each row in a table is uniquely identifiable
- Establish relationships among tables and to ensure the integrity of the data
- Primary key (PK): Attribute or combination of attributes that uniquely identifies any given row
Determination#
- State in which knowing the value of one attribute makes it possible to determine the value of another
- Is the basis for establishing the role of a key
- Based on the relationships among the attributes
Dependencies#
- Functional dependence: Value of one or more attributes determines the value of one or more other attributes
- Determinant: Attribute whose value determines another
- Dependent: Attribute whose value is determined by the other attribute
- Full functional dependence: Entire collection of attributes in the determinant is necessary for the relationship. For example, like first_name and last_name are determinant. So if I want to query sth, I must git all of them
Types of Keys#
- Composite key: Key that is composed of more than one attribute
- Key attribute: Attribute that is a part of a key
- Entity integrity: Condition in which each row in the table has its own unique identity
- All of the values in the primary key must be unique
- No key attribute in the primary key can contain a null
- Null: Absence of any data value that could represent:
- An unknown attribute value
- A known, but missing, attribute value
- A inapplicable condition
- Referential integrity: Every reference to an entity instance by another entity instance is valid (foreign key) if reference is null, it violates the integrity
- Superkey: An attribute or combination of attributes that uniquely identifies each row in a table(A superkey is a group of single or multiple keys which identifies rows in a table. A Super key may have additional attributes that are not needed for unique identification.)
- Candidate key: A minimal(irreducible) superkey; a superkey that does contain a subset of attributes that is itself a superkey.
It is a set of attributes that uniquely identify tuples in a table. Candidate Key is a super key with no repeated attributes. The Primary key should be selected from the candidate keys. Every table must have at least a single candidate key. A table can have multiple candidate keys but only a single primary key.- Properties of Candidate key:
- It must contain unique values
- Candidate key may have multiple attributes
- Must not contain null values
- It should contain minimum fields to ensure uniqueness
- Uniquely identify each record in a table
- Properties of Candidate key:
- Secondary key
- Secondary Key is the key that has not been selected to be the primary key. A secondary key is a non-identifying column or set of columns that can be used to find row in a table. It is only for data retrieval purposes
Ways to Handle Nulls#
- Flags: Special codes used to indicate the absence of some value
- NOT NULL constraint - Placed on a column to ensure that every row in the table has a value for that column
- UNIQUE constraint - Restriction placed on a column to ensure that no duplicate values exist for that column
Relational Algebra#
- Theoretical way of manipulating table contents using relational operators
- Relvar: Variable that holds a relation
- Heading contains the names of the attributes and the body contains the relation
- Relational operators have the property of closure
- Closure: Use of relational algebra operators on existing relations produces new relations
Relational Set Operators#
- Select (Restrict)
- Unary operator that yields a horizontal subset of a table
- Project
- Unary operator that yields a vertical subset of a table
- Union
- Combines all rows from two tables, excluding duplicate rows
- Union-compatible: Tables share the same number of columns, and their corresponding columns share compatible domains
- Intersect
- Yields only the rows that appear in both tables
- Tables must be union-compatible to yield valid results
- Difference
- Yields all rows in one table that are not found in the other table
- Tables must be union-compatible to yield valid results
- Product
- Yields all possible pairs of rows from two tables
- Join
- Allows information to be intelligently combined from two or more tables
- Types of Joins
- Natural join: Links tables by selecting only the rows with common values in their common attributes
- Join columns: Common columns -> the columns have exactly the same name. -> If not the same we do not call it nature join
- Equijoin: Links tables on the basis of an equality condition that compares specified columns of each table -> == != < > <= >= ….
- Theta join: Extension of natural join, denoted by adding a theta subscript after the JOIN symbol
- Inner join: Only returns matched records from the tables that are being joined
- Outer join: Matched pairs are retained and unmatched values in the other table are left null
- Left outer join: Yields all of the rows in the first table, including those that do not have a matching value in the second table
- Right outer join: Yields all of the rows in the second table, including those that do not have matching values in the first table
- Natural join: Links tables by selecting only the rows with common values in their common attributes
- Divide
- Uses one 2-column table as the dividend and one single-column table as the divisor
- Output is a single column that contains all values from the second column of the dividend that are associated with every row in the divisor
Data Dictionary and the System Catalog(hold metadata)#
- Data dictionary: Description of all tables in the database created by the user and designer // Where is the table located and the characteristic of columns
- System catalog: System data dictionary that describes all objects within the database // the same as data dictionary
- Homonyms and synonyms must be avoided to lessen confusion
- Homonym: Same name is used to label different attributes -> Cannot use the same name to describe diff things
- Synonym: Different names are used to describe the same attribute -> same
Extended ER (“EER”)-> Advanced data modeling#
Extended Entity Relationship Model (EERM)#
- Result of adding more semantic constructs to the original entity relationship (ER) model
- EER diagram (EERD): Uses the EER model
Entity Supertypes and Subtypes#
- Entity supertype: Generic entity type related to one or more entity subtypes
- Contains common characteristics
- Entity subtype: Contains unique characteristics of each entity subtype
- Criteria to determine the usage
- There must be different, identifiable kinds of the entity in the user’s environment
- The different kinds of instances should each have one or more attributes that are unique to that kind of instance
Specialization Hierarchy#
- Depicts arrangement of higher-level entity supertypes and lower-level entity subtypes
- Relationships are described in terms of “is-a” relationships -> Composition can have “Has-a” relation
- Subtype exists within the context of a supertype
- Every subtype has one supertype to which it is directly related # Cannot have more
- Supertype can have many subtypes
- Provides the means to:
- Support attribute inheritance
- Define a special supertype attribute known as the subtype discriminator
- Define disjoint/overlapping constraints and complete/partial constraints
Inheritanc#
- Enables an entity subtype to inherit attributes and relationships of the supertype
- All entity subtypes inherit their primary key attribute from their supertype
- At the implementation level, supertype and its subtype(s) maintain a 1:1 relationship
- Entity subtypes inherit all relationships in which supertype entity participates
- Lower-level subtypes inherit all attributes and relationships from its upper-level supertypes
Subtype Discriminator#
- Attribute in the supertype entity that determines to which entity subtype the supertype occurrence is related
- Default comparison condition is the equality comparison example:’p’ -> pilot
Disjoint and Overlapping Constraints#
- Disjoint subtypes: Contain a unique subset of the supertype entity set
- Known as nonoverlapping subtypes
- Implementation is based on the value of the subtype discriminator attribute in the supertype
Overlapping subtypes: Contain nonunique subsets of the supertype entity set - Implementation requires the use of one discriminator attribute for each subtype
Completeness Constraint#
- Specifies whether each supertype occurrence must also be a member of at least one subtype
- Types
- Partial completeness: Not every supertype occurrence is a member of a subtype
- Total completeness: Every supertype occurrence must be a member of any
Entity Cluster#
- Virtual entity type used to represent multiple entities and relationships in ERD
- Avoid the display of attributes to eliminate complications that result when the inheritance rules change
Normalization#
Normalization (Do normalization once a time one entity a time)#
- Evaluating and correcting table structures to minimize data redundancies
- Reduces data anomalies
- Assigns attributes to tables based on determination
- Normal forms
- First normal form (1NF)
- Second normal form (2NF)
- Third normal form (3NF)
Need for Normalization#
- Used while designing a new database structure
- Analyzes the relationship among the attributes within each entity
- Determines if the structure can be improved
- Improves the existing data structure and creates an appropriate database design
Normalization Process#
- Objective is to ensure that each table conforms to the concept of well-formed relations
- Each table represents a single subject
- No data item will be unnecessarily stored in more than one table
- All nonprime attributes in a table are wholely dependent on the primary key
- Each table is void of insertion, update, and deletion anomalies
- Ensures that all tables are in at least 3NF
- Higher forms are not likely to be encountered in business environment
- Works one relation at a time
- Starts by:
- Identifying the dependencies of a relation (table)
- Progressively breaking the relation into new set of relations
Normal Forms#
NORMAL FORM | CHARACTERISTIC |
---|---|
First Normal form(1NF) | Table format, no repeating groups, and PK identified -> Each cell is atomic, cannot be seperated any more |
Second normal form(2NF) | 1NF and no partial dependencies |
Third normal form(3NF) | 2NF and no transitive dependencies |
*Note: Normalization how-to, in one sentence: work on one relation (table) at a time: identify dependencies, then ‘normalize’ - progressively break it down into smaller relations (tables), based on the dependencies we identify in the original relation |
Functional Dependence Concepts#
Concept | Definition |
---|---|
Functional dependence | The attribute B is fully functionally dependent on the attribute A if each value of A determines one and only one value of B. -> A determines B |
Functional dependence (Generalized definition) | Attribute A determines attribute B if all of the rows in the table that agree in value for attribute A also agree in value for attribute B. |
Fully functional dependence (composite key) | If attribute B is functionally dependent on a composite key A but not on any Subset of that composite key, the attribute B is fully functionally dependent on A. A is composite Key and any subsets of A is not enough to determine B |
Types of Functional Dependencies#
- Partial dependency: Functional dependence in which the determinant is only part of the primary key (A determines B -> A is determinant)
- Assumption - One candidate key
- Straight forward
- Easy to identify
- It means that the B only fully determined by part of primary key. -> dont need whole compositive key
- Transitive dependency: An attribute functionally depends on another nonkey attribute
Conversion to First Normal Form#
- Repeating group: Group of multiple entries of same type can exist for any single key attribute occurrence
- Existence proves the presence of data redundancies
- Enable reducing data redundancies
- Steps
- Eliminate the repeating groups
- Identify the primary key
- Identify all dependencies
- Dependency diagram: Depicts all dependencies found within given table structure
- Helps to get an overview of all relationships among table’s attributes
- Makes it less likely that an important dependency will be overlooked
- 1NF
Conversion to Two Normal Form#
- Steps
- Make new tables to eliminate partial dependencies
- Reassign corresponding dependent attributes
- Table is in 2NF when it:
- Is in 1NF
- Includes no partial dependencies
- 2NF
Conversion to Third Normal Form#
- Steps
- Make new tables to eliminate transitive dependencies
- Determinant: Any attribute whose value determines other values within a row
- Reassign corresponding dependent attributes
- Table is in 3NF when it:
- Is in 2NF
- Contains no transitive dependencies
- 3NF
Requirements for Good Normalized Set of Tables#
- Evaluate PK assignments and naming conventions -> create a job_code
- evaluate naming conventions -> JOB_CHG_HOUR
- Refine attribute atomicity -> EMP_NAME -> first name, last name
- Atomic attribute: Cannot be further subdivided
- Atomicity: Characteristic of an atomic attribute
- Identify new attributes $\And$ new relationships -> EMP_HIREDATE(Add more columns to make it more useful) $\And$ PROJECT can have EMP_NO as FK
- Refine primary keys as required for data granularity -> ASSIGN_NUM
- Granularity: Level of detail represented by the values stored in a table’s row
- Maintain historical accuracy $\And$ evaluate using derived attributes -> sotre JOB_CHG_HOUR IN assignment $\And$ ASSIGN_CHARGE (It can be caculatd by xx xx waste space but get better performance))
Denormalization#
- Design goals
- Creation of normalized relations
- Processing requirements and speed
- Number of database tables expands when tables are decomposed to conform to normalization requirements
- Joining a larger number of tables:
- Takes additional input/output (I/O) operations and processing logic
- Reduces system speed
- Defects in unnormalized tables
- Data updates are less efficient because tables are larger
- Indexing is more cumbersome
- No simple strategies for creating virtual tables known as views
Transaction Management#
Transaction#
- Logical unit of work that must be entirely completed or aborted
- Consists of:
- SELECT statement
- Series of related UPDATE statements
- Series of INSERT statements
- Combination of SELECT, UPDATE, and INSERT statements
- Consistent database state: All data integrity constraints are satisfied // Before transaction, the database is good and clean and after transaction is completed, the transaction will leave the database again in a consistent sate.
- Must begin with the database in a known consistent state to ensure consistency
- Formed by two or more database requests
- Database requests: Equivalent of a single SQL statement in an application program or transaction
- Consists of a single SQL statement or a collection of related SQL statements
Transaction Properities#
- Atomicity: All operations of a transaction must be completed. If not, the transaction is aborted.
- Consistency: Permanence of database’s consistent state
- Isolation: Data used during transaction cannot be used by second transaction until the first is completed
- Durability: Ensures that once transactions are committed, they cannot be undone or lost
- Serializability: Ensures that the schedule for the concurrent execution of several transactions should yield consistent results
Transaction Management with SQL#
- SQL statements that provide transaction support
- COMMIT
- ROLLBACK -> roll back the current transaction to last commit
- Transaction sequence must continue until:
- COMMIT statement is reached
- ROLLBACK statement is reached
- End of program is reached
- Program is abnormally terminated
Transaction log#
- Keeps track of all transactions that update the database
- DBMS uses the information stored in a log for:
- Recovery requirement triggered by a ROLLBACK statement
- A program’s abnormal termination
- A system failure
Concurrency Control#
- Coordination of the simultaneous transactions execution in a multiuser database system
- Objective - Ensures serializability of transactions in a multiuser database environment
- dirty read:一个事务可以读取另一个尚未提交事务的修改数据。
- nonrepeatable read:在同一个事务中,同一个查询在T1时间读取某一行,在T2时间重新读取这一行时候,这一行的数据已经发生修改,可能被更新了(update),也可能被删除了(delete)。
- phantom read:在同一事务中,同一查询多次进行时候,由于其他插入操作(insert)的事务提交,导致每次返回不同的结果集。
The Scheduler#
- Establishes the order in which the operations are executed within concurrent transactions
- Interleaves the execution of database operations to ensure serializability and isolation of transactions
- Based on concurrent control algorithms to determine the appropriate order
- Creates serialization schedule
- Serializable schedule: Interleaved execution of transactions yields the same results as the serial execution of the transactions
Concurrency Control with Locking Methods#
- Locking methods - Facilitate isolation of data items used in concurrently executing transactions
- Lock: Guarantees exclusive use of a data item to a current transaction
- Pessimistic locking: Use of locks based on the assumption that conflict between transactions is likely
- Lock manager: Responsible for assigning and policing the locks used by the transactions
Lock Granularity#
- Indicates the level of lock use
- Levels of locking
- Database-level lock
- Table-level lock
- Page-level lock
- Page or diskpage: Directly addressable section of a disk
- Row-level lock
- Field-level lock
Lock Types#
- Binary lock
- Has two states, locked (1) and unlocked (0)
- If an object is locked by a transaction, no other transaction can use that object
- If an object is unlocked, any transaction can lock the object for its use
- Exclusive lock -write
- Exists when access is reserved for the transaction that locked the object
- Shared lock -> read
- Exists when concurrent transactions are granted read access on the basis of a common lock
Two-Phase Locking (2PL) -> avoid deadlock#
- Defines how transactions acquire and relinquish locks
- Guarantees serializability but does not prevent deadlocks
- Phases
- Growing phase - Transaction acquires all required locks without unlocking any data
- Running operations(middle)
- Shrinking phase - Transaction releases all locks and cannot obtain any new lock
- Governing rules
- Two transactions cannot have conflicting locks
- No unlock operation can precede a lock operation in the same transaction
- No data are affected until all locks are obtained
Deadlocks#
- Occurs when two transactions wait indefinitely for each other to unlock data
- Known as deadly embrace
- Control techniques
- Deadlock prevention possible deadlock detect and kill one of the potential confilct -> timestamp based
- Deadlock detection
- Deadlock avoidance -> The transaction must obtain all the lock to execute, in the transaction process, no deadlock will occur
- Choice of deadlock control method depends on database environment
Time Stamping -> prevent deadlock#
- Assigns global, unique time stamp to each transaction
- Produces explicit order in which transactions are submitted to DBMS
- Properties
- Uniqueness: Ensures no equal time stamp values exist
- Monotonicity: Ensures time stamp values always increases
- Disadvantages
- Each value stored in the database requires two additional stamp fields
- Increases memory needs
- Increases the database’s processing overhead
- Demands a lot of system resources
Wait/Die and Wound/Wait Concurrency Control Schema#
Transaction Request Lock | Transaction Owning Lock | WAIT/DIE Scheme(People who currently has resources has higher privelege) | Wound/Wait Scheme -> Old always first if Old has higher privelege |
---|---|---|---|
T1(11548789) | T2(19562545) | T1 wait until T2 is complete | T1 preempts(rollback) T2 and T2 reschedule |
T2(19562545) | T1(11548789) | T2 dies and reschedule with the same timestamp | T2 waits T1 is completed and T1 releases lock |
Phases of Optimistic Approach#
- Read
- Transaction:
- Reads the database
- Executes the needed computations
- Makes the updates to a private copy of the database values
- Transaction:
- Validation
- Transaction is validated to ensure that the changes made will not affect the integrity and consistency of the database
- Write
- Changes are permanently applied to the database
Distributed DBs#
Vertical scalling and horizontal scalling#
- Vertical scalling: very bad, it’s take one server and add more and more Gigbyte it run, but it is restricted in one machine
- Horizontal scalling: add more and more servers
Distributed Processing and Distributed Databases#
- Distributed processing: Database’s logical processing is shared among two or more physically independent sites via network
- Distributed database: Stores logically related database over two or more physically independent sites via computer network
- Database fragments: Database composed of many parts in distributed database system. EXAMPle: take a table and split it to many pieces
Functions of Distributed DBMS#
- Receives the request of an application
- Validates analyzes, and decomposes the request
- Maps the request
- Decomposes request into several I/O operations
- Searches and validates data
- Ensures consistency, security, and integrity
- Validates data for specific conditions
- Presents data in required format
DDBMS Components#
- Computer workstations or remote devices
- Network hardware and software components
- Communications media
- Transaction processor (TP): Software component of a system that requests data -> access the database and get the query fullfill(make a request)
- Known as transaction manager (TM) or application processor (AP)
- Data processor (DP) or data manager (DM) -> Only serve data dont care where the request come from
- Software component on a system that stores and retrieves data from its location
Single-Site Processing, Single-Site Data(MPSD) pretty simple#
Multiple-Site Processing, Single-Site Data (MPSD)#
- Multiple processes run on different computers sharing a single data repository
- Require network file server running conventional applications
- Accessed through LAN
- Client/server architecture
Reduces network traffic
Processing is distributed
Supports data at multiple sites
Multiple-Site Processing, Single-Site Data (MPSD)#
- Fully distributed database management system
- Support multiple data processors and transaction processors at multiple sites
- Classification of DDBMS depending on the level of support for various types of databases
- Homogeneous: Integrate multiple instances of same DBMS over a network -> all databases are in the same model(all oracle/mysql)
- Heterogeneous: Integrate different types of DBMSs -> can have different vendors(oracle, mysql,sqlite) but must follow the same relational model
- Fully heterogeneous: Support different DBMSs, each supporting different data model (some maybe noSQL data/graph/kv pairs)
Restrictions of DDBMS#
- Remote access is provided on a read-only basis
- Restrictions on the number of remote tables that may be accessed in a single transaction
- Restrictions on the number of distinct databases that may be accessed
- Restrictions on the database model that may be accessed
Distributed Requests and Distributed Transactions#
- Remote request
- Single SQL statement accesses data processed by a single remote database processor
- Remote transaction
- Accesses data at single remote site composed of several requests
- Distributed transaction
- Requests data from several different remote sites on network
- Distributed request
- Single SQL statement references data at several DP sites
Distributed Concurrency Control (2PC 2 phrase commit)#
- Concurrency control is important in distributed databases environment
- Due to multi-site multiple-process operations that create inconsistencies and deadlocked transactions
Two-Phase Commit Protocol(2PC)#
Guarantees if a portion of a transaction operation cannot be committed, all changes made at the other sites will be undone
- To maintain a consistent database state
Requires that each DP’s transaction log entry be written before database fragment is updated
DO-UNDO-REDO protocol: Roll transactions back and forward with the help of the system’s transaction log entries
Write-ahead protocol: Forces the log entry to be written to permanent storage before actual operation takes place
Defines operations between coordinator and subordinates
Phases of implementation
- Preparation -> request if others can commit or not(failure happends)
- The final COMMIT -> Tell all others they can commit
Details:
第一阶段:准备
协调器发送一个PREPARE TO COMMIT信息给所有的从属者。从属者接到这个信息,使用先写协议,写入事务日志并发送一个确认信息(YES/PREPARED TO COMMIT或NO/NO PREPARED)给协调器。
协调器确保所有节点要么提交,要么取消。
如果所有节点为PREPARED TO COMMIT,事务调转到第二阶段。如果有一个或多个节点为NO或NO PREPARED协调器就给所有从属者发送一个ABORT。第二阶段:最后的提交
协调器广播一个COMMIT信息给所有从属者并等待回复。
每个从属者接收到COMMIT信息并且使用DO协议更性数据库。
从属者回复一个 COMMITTED或者NOT COMMITTED信息给协调器。
如果一个或者多个从属者没有提交,协调器就发送一个ABORT消息,这样就使它们都取消所有的操作。两阶段提交的目的是为了确保每个节点都提交事务自己所操作的部分;否则,事务就取消。如果其中一个节点提交失败,就用事务日志来恢复数据库的信息,而且使用的是DO-UNDO-REDO协议来恢复(记住使用先写协议更新日志信息)。
CUBRID -> how transaction manager/distribute a good example
Transparency (Hide sth to users)#
- Distribution transparency: The users dont know the databases are distrubuted
- Allows management of physically dispersed database as if centralized(好像是centralized的)
- Levels
- Fragmentation transparency
- Location transparency
- Local mapping transparency
- Transaction transparency: The users dont know what parts of transaction happens where
- Ensures database transactions will maintain distributed database’s integrity and consistency
- Ensures transaction completed only when all database sites involved complete their part
- Distributed database systems require complex mechanisms to manage transactions
- Failure transparency: If the query go to one place, the place is down, they the user will not feel that. reroute to other copies
- Ensures the system will operate in case of network failure
- Considerations for resolving requests in a distributed data environment
- Data distribution
- Data replication
- Replica transparency: DDBMS’s ability to hide multiple copies of data from the user
- Performance transparency: Hide latency.
- Allows a DDBMS to perform as if it were a centralized database
- Network and node availability
- Network latency: delay imposed by the amount of time required for a data packet to make a round trip
- Network partitioning: delay imposed when nodes become suddenly unavailable due to a network failure
- Heterogeneity transparency: dont know what operating system/ database system …
Distributed Database Design#
- Data fragmentation: How to partition database into fragments
- Data replication: Which fragments to replicate
- Data allocation: Where to locate those fragments and replicas
Data Fragmentation#
- Breaks single object into many segments
Information is stored in distributed data catalog (DDC) - Strategies
- Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples (rows)
- Vertical fragmentation: Division of a relation into attribute (column) subsets
- Mixed fragmentation: Combination of horizontal and vertical strategies
Data Replication#
- Data copies stored at multiple sites served by a computer network
- Mutual consistency rule: Replicated data fragments should be identical
- Styles of replication -> How to make changes propogate
- Push replication -> send changes to others
- Pull replication -> pull others’ changes
- Helps restore lost data
Data Replication Scenarios#
- Fully replicated database
- Stores multiple copies of each database fragment at multiple sites
- Partially replicated database
- Stores multiple copies of some database fragments at multiple sites
- Unreplicated database
- Stores each database fragment at a single site
Data Allocation Strategies#
- Centralized data allocation
- Entire database stored at one site
- Partitioned data allocation
- Database is divided into two or more disjoined fragments and stored at two or more sites
- Replicated data allocation
- Copies of one or more database fragments are stored at several sites
The CAP Theorem (That’s what u need consider when you design a database)#
- CAP stands for:
- Consistency: Are all partitions of data keeping the same
- Availability: if some nodes are down, the system can still work
- Partition tolerance: the cluster continues to function even if there is a “partition” (communication break) between two nodes (both nodes are up, but can’t communicate). -> Continue to work even some nodes fail
- Basically available, soft state, eventually consistent (BASE)
- Data changes are not immediate but propagate slowly through the system until all replicas are consistent
Database Connectivity#
Database middleware: Provides an interface between the application program and the database
Data repository/source - Data management application used to store data generated by an application program
Universal Data Access (UDA): Collection of technologies used to access any type of data source and manage the data through a common interface
ODBC, OLE-DB, ADO.NET form the backbone of MS UDA architecture
Various connectivity options#
Native SQL(provided by vendors)
- Connection interface provided by database vendors, which is unique to each vendor
- Interfaces are optimized for particular vendor’s DBMS
- Maintenance is a burden for the programmer
M’soft: 1.ODBC(Open database connectivity(Microsoft’s implementation of a superset of SQL Access Group)), DAO(Data Access Object)+JET, RDO(Remote Data Object)
- ODBC
- Open Database Connectivity (ODBC):Microsoft’s implementation of a superset of SQL Access Group Call Level Interface (CLI) standard for database access
- Widely supported database connectivity interface
- Allows Windows application to access relational data sources by using SQL via standard application programming interface (API)
- Data Access Objects(DAO)
- Object-oriented API used to access MS Access, MS FoxPro, and dBase databases from Visual Basic programs
- Provides an optimized interface that expose functionality of Jet data engine to programmers
- DAO interface can be used to access other relational style data sources
- Jet is a wrapper of DAO
- Remote Data Objects (RDO)
- Higher-level object-oriented application interface used to access remote database servers
- Dynamic-link libraries (DLLs)
- Implements ODBC, DAO, and RDO as shared code that is dynamically linked to the Windows operating environment
- ODBC
M’soft: OLE-DB(open linking embeding db)
M’soft: ADO.NET(Access Data Object)
JDBC(from Sum)
Components of ODBC Architecture#
- High-level ODBC API through which application programs access ODBC functionality -> face to program
- Driver manager that is in charge of managing all database connections -> face to table
- ODBC driver that communicates directly to DBMS
Object Linking and Embedding for Database (OLE-DB) Database becomes object#
Before: OMG(object management group)
CORBA: COMMON OBJECT REQUEST BREAKER ARCHITECHTURE -> Object exchange from different os regardless different memory
- Database middleware that adds object-oriented functionality for access to data
- Series of COM(component object model) objects provides low-level database connectivity for applications
- Types of objects based on functionality
- Consumers(request data)
- Providers(produce data)
- Does not provide support for scripting languages
- ActiveX Data Objects (ADO): Provides:
- High-level application-oriented interface to interact with OLE-DB, DAO, and RDO
- Unified interface to access data from any programming language that uses the underlying OLE-DB objects
- ADO.NET
- Data access component of Microsoft’s .NET application development framework
- Microsoft’s .NET framework
- Component-based platform for developing distributed, heterogeneous, interoperable applications
- Manipulates any type of data using any combination of network, operating system, and programming language
- Features critical for the development of distributed applications
- DataSet: Disconnected memory-resident representation of database
- XML support
- DataSet is internally stored in XML format
- Data in DataSet is made persistent as XML documents
- DataSet is internally stored in XML format
- Data in DataSet made persistent as XML documents
Database Internet Connectivity#
- Allows new innovative services that:
- Permit rapid response by bringing new services and products to market quickly
- Increase customer satisfaction through creation of web-based support services
- Allow anywhere, anytime data access using mobile smart devices via the Internet
- Yield fast and effective information dissemination through universal access
Web-to-Database Middleware#
- Web server is the main hub through which Internet services are accessed
- Server-side extension: Program that interacts directly with the web server
- Known as web-to-database middleware
- Provides its services to the web server in a way that is totally transparent to the client browser
- Middleware must be well integrated
Client-Side Extensions#
- Add functionality to Web browser
- Types
- Plug-in: External application automatically invoked by the browser when needed
- Java and JavaScript: Embedded in web page
- Downloaded with the Web page and activated by an event
- ActiveX and VBScript: Embedded in web page
- Downloaded with page and activated by event
- Oriented to Windows applications
Performance Tuning and Query Optimization#
Tuning(Database Performance-Tuning Concepts)#
- Goal of database performance is to execute queries as fast as possible
- Database performance tuning: Set of activities and procedures that reduce response time of database system
- Fine-tuning the performance of a system requires that all factors must operate at optimum level with minimal bottlenecks
Performance Tuning: Client and Server#
Client side#
- SQL performance tuning: Generates SQL query that returns correct answer in least amount of time -> suppose A and B and C-> A should fail many times. Re-arranging sql query language
- Using minimum amount of resources at server
Server side#
- DBMS performance tuning: DBMS environment configured to respond to clients’ requests as fast as possible
- Optimum use of existing resources
DBMS Architecture#
- All data in a database are stored in data files
- Data files automatically expand in predefined increments known as extends
- Data files are grouped in file groups or table spaces
- Table space or file group: Logical grouping of several data files that store data with similar characteristics
- Data cache or buffer cache: Shared, reserved memory area
- Stores most recently accessed data blocks in RAM
- SQL cache or procedure cache: Stores most recently executed SQL statements or PL/SQL procedures
- DBMS retrieves data from permanent storage and places them in RAM Input/output request: Low-level data access operation that reads or writes data to and from computer devices
- Data cache is faster than working with data files
- Majority of performance-tuning activities focus on minimizing I/O operations
- Listener The listener process listens for clients’ requests and handles the processing of the SQL requests to other DBMS processes. Once a request is received, the listener passes the request to the appropriate user process
- User The DBSM creates a user process to manager each client session. Therefore, when you log on to the DBMS, you are assigned a user process. This process handles all requests you submit to the server. There are many user processes-at least onper logged-in client.
- Scheduler The scheduler process organizes the concurrent execution of SQL requests.
- Lock Mnager This process manages all locks placed on database objects, including disk pages
- Optimizer The optimizer analyzes SQL queries and finds the most efficient way to access the data. // How can we make it better.
Database Query Optimization Modes#
- Algorithms proposed for query optimization are based on:
- Selection of the optimum order to achieve the fastest execution time // The order to visit each table
- Selection of sites to be accessed to minimize communication costs
- Evaluated on the basis of:
- Operation mode
- Timing of its optimization
- Type of information(used for optimization)
Classification of Operation Modes#
- Automatic query optimization: DBMS finds the most cost-effective access path without user intervention
- Manual query optimization: Requires that the optimization be selected and scheduled by the end user or programmer
Based on Timing of Optimization#
- Static query optimization: best optimization strategy is selected when the query is compiled by the DBMS
- Takes place at compilation time
- Dynamic query optimization: Access strategy is dynamically determined by the DBMS at run time, using the most up-to-date information about the database
- Takes place at execution time: more processing ahead
Type of Information Used to Optimize#
- Statistically based query optimization algorithm: Statistics are used by the DBMS to determine the best access strategy
- Statistical information is generated by DBMS through:
- Dynamic statistical generation mode - auto eval
- Manual statistical generation mode - via user
- Rule-based query optimization algorithm: based on a set of user-defined rules to determine the best query access strategy
Sample Database Statistics Measurements#
Query Processing#
- Parsing
- DBMS parses the SQL query and chooses the most efficient access/execution plan
- Execution
- DBMS executes the SQL query using the chosen execution plan
- Fetching
- DBMS fetches the data and sends the result set back to the client
SQL Parsing Phase#
- Query is broken down into smaller units
- Original SQL query transformed into slightly different version of original SQL code which is fully equivalent and more efficient
- Query optimizer: Analyzes SQL query and finds most efficient way to access data
- Access plans: DBMS-specific and translate client’s SQL query into a series of complex I/O operations
- If access plan already exists for query in SQL cache, DBMS reuses it
- If not, optimizer evaluates various plans and chooses one to be placed in SQL cache for use
SQL Execution Phase#
- All I/O operations indicated in the access plan are executed
- Locks are acquired
- Data are retrieved and placed in data cache
- Transaction management commands are processed
SQL Fetching Phase#
- Rows of resulting query result set are returned to client
- DBMS may use temporary table space to store temporary data
- Database server coordinates the movement of the result set rows from the server cache to the client cache
Query Processing Bottlenecks#
- Delay introduced in the processing of an I/O operation that slows the system
- Caused by the:
- CPU
- RAM
- Hard disk
- Network
- Application code
Indexes and Query Optimization#
- Indexes
- Help speed up data access
- Facilitate searching, sorting, using aggregate functions, and join operations
- Ordered set of values that contain the index key and pointers
- More efficient than a full table scan
Indexes and Query Optimization#
- Data sparsity: Number of different values a column could have; low sparsity => index might be useless
- Data structures used to implement indexes:
- Hash indexes -> Bitl generate short-link by hash function
- B-tree indexes
- Bitmap indexes
- DBMSs determine best type of index to use
index Selectivity#
- Measure of the likelihood that an index will be used in query processing. So we strive to create indexes that have high selectivity.
- Indexes are used when a subset of rows from a large table is to be selected based on a given condition
- Index cannot always be used to improve performance
- Function-based index: Based on a specific SQL function or expression(eg. EMP_SALARY + EMP_COMMISSION)
- Indexes are useful when
- an indexable column occurs in a WHERE or HAVING search expression
- an indexable column appears in a GROUP BY or ORDER BY clause
- MAX or MIN is applied to an indexable column
- there is high data sparsity on an indexable column
- Worth creating indexes on single columns that appear in WHERE, HAVING, ORDER BY, GROUP BY and join conditions.
Optimizer Choices -> statisitc-based#
- Rule-based optimizer: Uses preset rules and points to determine the best approach to execute a query
- Cost-based optimizer: Uses algorithms based on statistics about objects being accessed to determine the best approach to execute a query(collect all kinds of costs and choose on lowest cost)
Using Hints to Affect Optimizer Choices#
- Optimizer might not choose the best execution plan
- Makes decisions based on existing statistics, which might be old
- Might choose less-efficient decisions
- Optimizer hints: Special instructions for the optimizer, embedded in the SQL command text
- Example
Conditional Expressions#
- Equality comparisons are faster than inequality comparisons
- Transform conditional expressions to use literals
- Write equality conditions first when using multiple conditional expressions
- When using multiple AND conditions, write the condition most likely to be false first
- When using multiple OR conditions, put the condition most likely to be true first
- Avoid the use of NOT logical operator
Query Formulation#
- Identify what columns and computations are required §Identify source tables
- Determine how to join tables
- Determine what selection criteria are needed
- Determine the order in which to display the output
DBMS Performance Tuning#
- Managing DBMS processes in primary memory and the structures in physical storage
- DBMS performance tuning at server end focuses on setting parameters used for:
- Data cache
- SQL cache
- Sort cache
- Optimizer mode
- In-memory database: Store large portions of the database in primary storage
- Recommendations for physical storage of databases:
- Use RAID (Redundant Array of Independent Disks) to provide a balance between performance improvement and fault tolerance
- Minimize disk contention
- Put high-usage tables in their own table spaces
- Assign separate data files in separate storage volumes for indexes, system, and high-usage tables
- Take advantage of the various table storage organizations in the database
- Index-organized table or clustered index table: Stores the end-user data and the index data in consecutive locations in permanent storage
- Partition tables based on usage
- Use denormalized tables where appropriate
- Store computed and aggregate attributes in tables
Business Intelligence#
Business Intelligence(BI)#
- Comprehensive, cohesive, integrated set of tools and processes
- Captures, collects, integrates, stores, and analyzes data
- Generates and presents information to support business decision making
- Allows a business to transform:
- Data into information
- Information into knowledge
- Knowledge into wisdom
- ETL: extraction transformation and loading
- QUERY & Reporting : include ml algorithm
- Governance: How often to upadte warehouse
- Management: How ETL happens
Business Intelligence Benefits#
- Improve decision making
- Integrating architecture
- Common user interface for data reporting and analysis
- Common data repository fosters single version of company data
- Imporeve organizational performance
Decision Support Data#
- Effectiveness of BI depends on quality of data gathered at operational level
- Operational data
- Seldom well-suited for decision support tasks
- Stored in relational database with highly normalized structures
- Optimized to support transactions representing daily operations
- Differ from operational data in:
- Time span
- Granularity
- Drill down: Decomposing a data to a lower level
- Roll up: Aggregating a data into a higher level
- Dimensionality
Decision Support Database Requirements#
- Database schema
- Must support complex, non-normalized data representations
- Data must be aggregated and summarized
- Queries must be able to extract multidimensional time slices
- Data extraction and loading
- Allow batch and scheduled data extraction
- Support different data sources and check for inconsistent data or data validation rules
- Support advanced integration, aggregation, and classification
- Database size should support
- Very large databases (VLDBs)
- Advanced storage technologies
- Multiple-processor technologies
Data Marts#
- Small, single-subject data warehouse subset
- Provide decision support to a small group of people
- Benefits over data warehouses
- Lower cost and shorter implementation time § Technologically advanced
- Inevitable people issues
Components of Star Schemas#
- Facts
- Numeric values that represent a specific business aspect
- Dimensions
- Qualifying characteristics that provide additional perspectives to a given fact
- Attributes
- Used to search, filter, and classify facts
- Slice and dice: Ability to focus on slices of the data cube for more detailed analysis.
- Attribute hierarchy
- Provides a top-down data organization
Performance-Improving Techniques for the Star Schema#
- Normalizing dimensional tables
- Snowflake schema: Dimension tables can have their own dimension tables
- Maintaining multiple fact tables to represent different aggregation levels
- Denormalizing fact tables
Data Anaytics#
- Encompasses a wide range of mathematical, statistical, and modeling techiques to extract knowledge from data
- subset of BI functionality
- Classfication of tools
- Explanatory analytics: Focuses on discovering and explaining data characteristics and relationships based on existing data
- Predictive analytics: Focuses on predicting future outcomes with a high degree of accuracy
OLAP-> Online Analytical Processing#
- Advanced data analysis environment that supports decision making, business modeling, and operations research
- Characteristics:
- Multidimensional data analysis techniques
- Advanced database support
- Easy-to-use end-user interfaces
Relational Online Analytical Processing (ROLAP)#
- Provides OLAP functionality using relational databases and familiar relational tools to store and analyze multidimensional data
- Extensions added to traditional RDBMS technology
- Multidimensional data schema support within the RDBMS
- Data access language and query performance optimized for multidimensional data
- Support for very large databases (VLDBs)
Multidimensional Online Analytical Processing (MOLAP) pointer to pointer to formulate multi-dimensional database#
- Extends OLAP functionality to multidimensional database management systems (MDBMSs)
- MDBMS: Uses proprietary techniques store data in matrix-like n-dimensional arrays
- End users visualize stored data as a 3D data cube §Grow to n dimensions, becoming hypercubes
- Held in memory in a cube cache to speed access
- Sparsity: Measures the density of the data held in the data cube
SQL Extensions for OLAP#
The ROLLUP extension(get upper domian data sorted by lower dimension)#
- Used with GROUP BY clause to generate aggregates by different dimensions
- Enables subtotal for each column listed except for the last one, which gets a grand total
- Order of column list important
The CUBE extension(only get total by one dimension separately)#
- Used with GROUP BY clause to generate aggregates by the listed columns
- Includes the last column
Data lake#
- A traditional data warehouse is an ETL-based, historical record of transactions - very RDB-like
- A ‘modern’ alternative is a ‘data lake’, which offers a more continuous form of analytics, driven by the rise of unstructed data, streaming, cloud storage, etc. In a data lake, data is NOT ETLd, rather, it is stored in its ‘raw’ (“natural”) form [even incomplete, untransformed…].
Discussion1#
- Data - origins?
- copper
- parchment
- leaves
- stone
- Data - tech?
- hologram
- …
- Data vs info vs knowledge
- info is by Searching sorting … calculation - computation of data(collected from somewhere) |exact information
- knowledge is useful information
- Modeling
- Why? leave the part I dont want (abstraction) features!
- Relational -> relating different entity
- Another example of a ‘unary’ relationship? (employee table manager)
- Another example of a ternary relationship? (manufactory)
- Instead of TinyCollege, consider USC - more entities?
- Why is full functional dependence a good thing?
- Diff between entity and referential integrity, in terms of…
- Examples of ‘blind’ (non-identifying) keys in RL?
- How can we use tables to model unary relationships?
- one the same table a column called manager column -> point to itself’s employee_id -> drawback if many people donot have manager, it would be sparse column
- use two tables -> make a table called manager_employee which only has two cols (employee1, employee2) em1 -> employeers, em2-> managers
- Closure - where else?
- Why is closure a good property?
- Set diff is non-commutative - what else is?
- What is the conceptual shift, from a ‘spreadsheet DB’ to a relational one?
Discussion 2#
SQL: ____ Oriented Programming - what? What else?
- imperative programming: use for loop to accomplish the same goal. SQL is not but java c c++ is
- Object Oriented -> take data and method of data package them together
- Functional programming language: put function as input in and get function out as output. SQL is functional oriented
- Function oriented programming language
- Set oriented language -> SQL engine figuer out how to do 21:46
- Declearative … tell what u want fine control
- Aspect oriented programming -> AOP
- aspect means some tasks a program need to do. draw, email, send to printer
- get all the task classes together and make an internface
- if we implement the interface called aspect
- SQL is also domain specific language DSL
Pattern matching - https://www.rexegg.com/regex-quickstart.html _-> single word % -> more than one word
More datatypes?
- Color type
- currency type
Breaking closure -> It cannot chain further stop chaining operation
RL(real life) situations (not DB) where transactions are done sequentially?
- dating cannot date with many peoples
Robert rules room how to make sure everybody get a chance to speak
DB cases in RL where interleaving in inevitable?
- travel book ticket
- buy books from amazon
- Linkedin Job
Database state, during lost updates vs inconsistent analysis?
- lost updates -> t2 override t1’s modification -> data destroy/ table is not always fine.
- inconsistent analysis -> nonrepeatable read -> integrate of table not harm. table not change the datastructure.
Serial schedule - how many exist, for n serializable transactions?
- rearrange many transactions very important
- transaction isolation level -> 4 levels for example, allow dirty read allow nonrepeatable read …
- avoid dirty read
- avoid nonrepeatable read
- avoid phantom read
Why not make all schedules serial? -> consider performance
2PL(Static/Conservative) vs Strict 2PL(“S2PL”){hold on some writes and release all the reads, release read lock } vs Rigorous(Stricter/StrongStrict/SS2PL!)(hold on all the lock until commit complete)
Newton and his 2 cats
Replication - three reasons?
- failure reasons
- network-balancing
Replication tradeoffs with consistency? -> amazon
2PC - failure, how? rollback hang