DatabaseReview

Posted on 2020-02-18 In Database

This is a review reference and some notes/my own understanding of datbase course.
REFERENCE to CSCI 585 2020 SPRING lectured by Prof. Saty in University of Southern California

Introduction
Data Modeling
ER
Relational Modeling
Extended ER (“EER”)-> Advanced data modeling
Normalization
Transaction Management
Distributed DBs
Performance Tuning and Query Optimization
Business Intelligence
Discussion1
Discussion 2

Introduction#

1. The difference between data and and information#

Data	Information
Raw fact(Raw data- Not yet been processed to reveal the meaning)	produced by processing data
Building blocks of information	Reveals the meaning of data
Data management(Generation, storage, and retrieval of data)	Enables knowledge creation
	Should be accurate, relevant, and timely to enable good decision making

To be simple, Data is raw fact, objective exists in the real word. Information is a kind of processed data. After gathering and processing, data becomes information. For example, the data of phone is pixel but the people can see it is a picture.

Knowledge is useful information. Something you can do with the information. Knowledge can use the information we found.

2. What a database is, the various types of databases, and why they are valuable assets for decision making.#

Database is a shared, integrated computer structure that sotres a collection of:
- End-user data: Raw facts of interest to end user, such as student’s gpa…
- Metadata: Data about data, which the end-user data are integrated and managed
  - describe data characteristics and relationships such as what does the GPA mean and some restriction such as $0 \leq GPA \leq 4.0$ and $student-age > 0$. About how we store the information. Or How many tables we have.
Database management system(DBMS)
- Collection of programs
- Manages the database structure
- Controls access to data stored in the database(who can access database | access restrictions)
Role of the DBMS
- Intermediary between the user and the database
- Enables data to be shared
- Presents the end user with an integrated view of the data
- Receives and translates application requests(what are you searching for) into operations required to fulfill the request
- Hides database’s internal complexity from the application programs and users(People can reorganise data in the future)
Advantages of DBMS
- Better data integration and less data inconsistency(pervent having two difference record of the same things)
  - Data inconsistency: Different versions of the same data appear in different places
- Increase end-user productivity
- Data sharing
- Data security // Who gonna see what
- Data access
- Decision making
  - Data quality: Promoting accuracy, validity, and timeliness of data.
Types of databases
- Single-user database: Supports one user at a time (Like a database in your phone/Desktop)
  - Desktop database: Runs on PC
- Multiuser database: Supports multiple users at the same time (Like BOA or Amazon database)
  - Workgroup databases: Support a small number of users or a specific department
  - Enterprise database: Supports many users across many departments
- Centralized database: Data is located at a single site
- Distributed database: Data is distributed across different sites (Part of database from xxx part of database from xxx different contries) Note: It is not copies. Just different parts of one database from different sites. – google uses it
- Cloud database: Created and maintained using cloud data services that provide defined performance measures for the database. Companies do not actually using the hardwares. Like UBER, they dont have their physical, their databases live on some cloud.(AWS)
Types of databases (Another way to classify database)
- General-purpose databases: Contains a wide variety of data used in multiple disciplines. Like Oracle database, you can use the database where you want. And you can put anything you want in Oracle. –> general purpose
- Discipline-specific databases: Contains data focused on specific subject areas. Like scientists use protein database(PDB -> protein data bank) Every protein molecules are stored in database. Different structure for specific usage. They have their own buliding plans and only the users know what it is or what it represents.
Types of databases (Another way to classify database) -> purly for business point of view
- Operational database: Designed to support a company’s day to day operations.
- Analytical database: Stores historical data and business metrics used exclusively for tactical or strategic decision making
  - Data warehouse: Stores data in a format optimized for decision support.
- Example: Constantly writing data…. every purchase into Operational database daily. At night, they put the data to one central place called analytical database(used to analytical later) It is constantly reading database.
- where do data warehouse caculations
  - Online analytical processing(OLAP) Command to go to the warehouse and analysis data
    - Enable retrieving, processing, and modeling data from the data warehouse.
  - Business intelligence: Captures and processes business data to generate information that support decision making. (stock from Operational database to Analytical database piece to piece) -> data mining
Types of databases (Another way to classify database)
- Unstructured data: It exists in their original state (dont have a specific format like musci/video files)
- Structured data: It results from formatting (like students information stored in each row of studnet table)
  - Structure is applied based on type of processing to be performed
- Semistructured data: Processed to some extent like document. Generally it is used to describe sth. Between structured and unstructured.
- Extensible Markup Language(XML) Json better than xml can be used in structured and unstructured
  - Represents data elements in textual format

3. The importance of database design#

4. How modern databases envolved from file systems#

Early Dbs: file systems
- Manual file systems: Accompulished through a system of file folders and filing cabinets
- Computerized File Systems
  - Data processing(DP) specialist: Create a computer-based system that would track data and produce required reports
- File System Redux: Mdern End-User Productivity Tools(Includes spreadsheet programs such as Microsoft Excel)
File systems
- Data -> Raw facts, such as a Tel number, birth date or customer name. Data have little meaning unless they have been organized in some logical manner. Each cell!
- Field -> A character of group of characters(alphabetic or numeric) that has a specific meaning. A field is used to define and store data. Like the definition of the meaning of data in this field.
- Record -> A row (contains person’s name, address….)
- File -> a collection of record
Problems with Fie System Data Processing
- Lengthy development times
- Difficulty of getting quick answers
- Complex system administration
- Lack of security and limited data sharing
- Extensive programming
Structural and Data dependence
- Data dependence
  - Data access changes when data storage characteristics chage
- Data independence
  - Data storage characteristics is changed without affecting the program’s ability to access the data
- Practical significance of data dependence is difference between logical and physical format
Data Redundancy
- Unnecessarily storing same data at different places
- Islands of information: Scattered data locations
  - Increases the probability of having different versions of the same data.
Data Redundancy Implications (the defects of data redundancy)
- Poor data security
- Data inconsistency
- Increased likelihood of data-entry errors when complex enries are made in different files
- Data anomaly: Develops when not all of the required changes in the redundant data are made successfully.
  - Update Anomalies (happens when data is already there)
  - Insertion Anomalies (create data for the first time)
  - Deletion Anomalies
Database System
- Logically related data stored in a single logical data repository
  - Physically distributed among multiple storage facilities
  - DBMS eliminates most of file system’s problems
- Curent generation DBMS software
  - Stores data structures, relationships between structures, and access pahts
  - Defines, stores, and manages all access paths and components.
- DBMS Functions
  - Data dictionary management
  - Data storage management
    - Per
  - Data transformation and presentation
    - Transforms entrered data to conform required data structures
  - Security management
    - Enforces user security and data privacy

5. Flaws in file system data management#

6. The main components of the database system#

7. The main functions of a database management system(DBMS)#

Data dictionary management == meta data example for one column
- Data dictionary: Stores definitions of the data elements and their relationships
Data storage management
- Performance tuning: Ensures efficient performance of the database in terms of storage and access speed
Data transformation and presentation
- Transforms entered data to conform to required data structures.
Security management
- Enforces user security and data privacy.
Multiuser access control
- Sophisticated algorithms ensure that multiple users can access the database concurrently without compromising its integrity.
Backup and recovery management
- Enables recovery of the database after a failure
Data integrity management
- Minimizes redundancy and maximizes consistency
Database access languages and application programming interfaces
- Query language: Lets the user specify what must be done without having to specify how
- Structured Query Language (SQL): De facto query language and data access standard supported by the majority of DBMS vendors
Database communication interfaces
- Accept end-user requests via multiple, different network environments
Disadvantage of Database Systems
- Increased costs
- Management complexity
- Maintaining currency
- Vendor dependence
- Frequent upgrade/replacement cycles

Data Modeling#

About data modeling and why data models are important#

Data Modeling and Data Models#

Data modeling: Iterative and progressive process of creating a specific data model for a determined problem domain
Data models: Simple representations of complex real-world data structures
Useful for supporting a specific problem domain
Model - Abstraction of a real-world object or event

Importance of Data Models#

Are a communication tool
Give an overall view of the database
Organize data for various users
Are an abstraction for the creation of good database

About the basic data-modeling building blocks#

Entity: Unique and distinct object used to collect and store data
- Attribute: Characteristic of an entity
Relationship: Describes an association among entities(one row in an entity can relate to how many rows in other entities)
- One-to-many (1:M)
- Many-to-many (M:N or M:M)
- One-to-one (1:1)
Constraint: Set of rules to ensure data integrity

What business rules are and how they influence database design#

Brief, precise, and unambiguous description of a policy, procedure, or principle
Enable defining the basic building blocks
Describe main and distinguishing characteristics of the data
Sources of Business Rules(sb. write the manual)
- Company managers
- Policy makers
- Department managers
- Written documentation
- Direct interviews with end users
Why we need business rules
- Help standardize company’s view of data
- Communications tool between users and designers
- Allow designer to:
  - Understand the nature, role, scope of data, and business processes
  - Develop appropriate relationship participation rules and constraints
  - Create an accurate data model

Hierarchical modeling#

Manage large amounts of data for complex manufacturing projects
Represented by an upside-down tree which contains segments
Segments: Equivalent of a file system’s record type
Depicts a set of one-to-many (1:M) relationships
Advantage
- Promotes data sharing
- Parent/child relationship promotes conceptual simplicity and data integrity
- Database security is provided and enforced by DBMS
- Efficient with 1:M relationships
Disadvantages
- Requires knowledge of physical data storage characteristics
- Navigational system requires knowledge of hierarchical path
- Changes in structure require changes in all application programs
- Implementation limitations
- No data definition
- Lack of standards

Network modeling#

Represent complex data relationships
Improve database performance and impose a database standard
Depicts both one-to-many (1:M) and many-to-many (M:N) relationships
Advantages
- Conceptual simplicity
- Handles more relationship types
- Data access is flexible
- Data owner/member relationship promotes data integrity
- Conformance to standards
- Includes data definition language (DDL) and data manipulation language (DML)
Disadvantages
- System complexity limits efficiency
- Navigational system yields complex implementation, application development, and management
- Structural changes require changes in all application programs

Difference between Schema and Sub Schema#

Schema is a Conceptual organization of the entire database as viewed by the database administrator(means the whole thing)
SubSchema is a Portion of the database seen by the application programs that produce the desired information from the data within the database.(only a sub part of schema)

Difference between DML and DDL#

Data manipulation language(DML): Environment in which data can be managed and is used to work with the data in the database(add / delete/ modify/ query data)
Schema data definition language(DDL): Enables the database administrator to define the schema components (create the structure)

The Relational Model#

Produced an automatic transmission database that replaced standard transmission databases
Based on a relation
- Relation or table: Matrix composed of intersecting tuple and attribute(Here we need to note that we have two kinds of relation.One is the relation between different tables. And the other thing is one cell related to rows and columns. For example, one score can relate to one student and it is belonged to gpa column it is an intersection)
  - Tuple: Rows
  - Attribute: Columns
Describes a precise set of data manipulation constructs
Advantage
- Structural independence is promoted using independent tables
- Tabular view improves conceptual simplicity
- Ad hoc query capability is based on SQL
- Isolates the end user from physical-level details
- Improves implementation and management simplicity

Relational DBMS(Database Management System)#

Performs basic functions provided by the hierarchical and network DBMS systems
Makes the relational data model easier to understand and implement
Hides the complexities of the relational model from the user

How the major data models evolved#

The Evolution of Data Model

Data Abstraction Levels#

It means you change sth, the level above this layer will no change

Physical independence means pyhsical model can change but internal Model could not change.

Logical independence means if internal model changes but conceptual could not change. Such as if Internal Model is changed to NoSQL from RDM, nobody will notice that.

The External Model#

End users’ view of the data environment
ER diagrams are used to represent the external views
External schema: Specific representation of an external view

The Conceptual Model#

Represents a global view of the entire database by the entire organization
Conceptual schema: Basis for the identification and high-level description of the main data objects
Has a macro-level view of data environment
Is software and hardware independent
Logical design: Task of creating a conceptual data model

The Internal Model#

Representing database as seen by the DBMS mapping conceptual model to the DBMS
Internal schema: Specific representation of an internal model
- Uses the database constructs supported by the chosen database
Is software dependent and hardware independent
Logical independence: Changing internal model without affecting the conceptual model

The Physical Model#

Operates at lowest level of abstraction
Describes the way data are saved on storage media such as disks or tapes
Requires the definition of physical storage and data access methods
Relational model aimed at logical level
Does not require physical-level details
Physical independence: Changes in physical model do not affect internal model

About emerging alternative data models and the need they fulfill#

How data models can be classified by their level of abstraction#

ER#

Entity Relationship Model (ERM)#

Basis of an entity relationship diagram (ERD)
ERD depicts the:
- Conceptual database as viewed by end user
- Database’s main components
  - Entities
  - Attributes
  - Relationships
Entity - Refers to the entity set and not to a single entity occurrence

Attributes#

Characteristics of entities
Required attribute: Must have a value, cannot be left empty -Id/Lastname
Optional attribute: Does not require a value, can be left empty -secondary Email
Domain - Set of possible values for a given attribute(range the possible like GPA 0-4 name can only be 40chars long ..valid values)
Identifiers: One or more attributes that uniquely identify each entity instance(must exist and cannot be null like primary key(s))
Compondend attribute: Attribute that can be subdivided to yield additional attributes(Different from compond. One column can contain more than one colums in it, such as color can contain R,G,B or Name can contain first_name and last_name
Composite identifier: Primary key composed of more than one attribute
Composite attribute: composit is like the combinations of several columns same as Composite identifier, but identifier is primary key.
Simple attribute: Attribute that cannot be subdivided
Single-valued attribute: Attribute that has only a single value
Multivalued attributes: Attributes that have many values and require creating:
- Several new attributes, one for each component of the original multivalued attribute
- A new entity composed of the original multivalued attribute’s components
Derived attribute: Attribute whose value is calculated from other attributes
- Derived using an algorithm
Advantages and Disadvantages of Storing Derived Attributes

Relationships#

Association between entities that always operate in both directions
Participants: Entities that participate in a relationship
Connectivity: Describes the relationship classification
Cardinality: Expresses the minimum and maximum number of entity occurrences associated with one occurrence of related entity

Existence Dependence and Existence Independence

Existence dependence	Existence independence
Entity exists in the database only when it is associated with another related entity occurrence	Entity exists apart from all of its related entities. Referred to as a strong entity or regular entity
For example, the Employee(String entity/regular entity) and their children (* Children is not week entity!!!) Please notice that When a primary is shared between two Entity, the other entity(which owns both its only key and shared key) is weak entity And the relationship between them is strong relationship**.

However, if one entity has a foreign key from other entity there is a weak relationship. This is not a weak entiy but it is also existence depency.

Relationship Strength
- Weak (non-identifying) relationship
  - Primary key of the related entity does not contain a primary key component of the parent entity
- Strong (identifying) relationships
  - Primary key of the related entity contains a primary key component of the parent entity
    A strong relationship makes the right side be weak. But the right hand side of a Weak entiy is not weak though it is still existant dependency.
- A weak entity has two Conditions
  +Existence-dependent
  - Has a primary key that is partially or totally derived from parent entity in the relationship
Relationship Participation
- Optional participation
  - One entity occurrence does not require a corresponding entity occurrence in a particular relationship
- Mandatory participation
  - One entity occurrence requires a corresponding entity occurrence in a particular relationship
Relationship Degree
- Indicates the number of entities or participants associated with a relationship
- Unary relationship: Association is maintained within a single entity -> Some row in a table has some relationship with some row in the same table
  - Recursive relationship: Relationship exists between occurrences of the same entity set
- Binary relationship: Two entities are associated
- Ternary relationship: Three entities are associated
Associative Entities
- Also known as composite or bridge entities
- Used to represent an M:N relationship between two or more entities
- Is in a 1:M relationship with the parent entities
  - Composed of the primary key attributes of each parent entity
- May also contain additional attributes that play no role in connective process
Developing an ER Diagram
- Create a detailed narrative of the organization’s description of operations
- Identify business rules based on the descriptions
- Identify main entities and relationships from the business rules
- Develop the initial ERD
- Identify the attributes and primary keys that adequately describe entities
- Revise and review ERD

Relational Modeling#

A Logical View of Data#

Relational database model enables logical representation of the data and its relationships
Logical simplicity yields simple and effective database design methodologies
Facilitated by the creation of data relationships based on a logical construct called a relation

Keys#

Consist of one or more attributes that determine other attributes
Used to:
- Ensure that each row in a table is uniquely identifiable
- Establish relationships among tables and to ensure the integrity of the data
Primary key (PK): Attribute or combination of attributes that uniquely identifies any given row

Determination#

State in which knowing the value of one attribute makes it possible to determine the value of another
Is the basis for establishing the role of a key
Based on the relationships among the attributes

Dependencies#

Functional dependence: Value of one or more attributes determines the value of one or more other attributes
- Determinant: Attribute whose value determines another
- Dependent: Attribute whose value is determined by the other attribute
Full functional dependence: Entire collection of attributes in the determinant is necessary for the relationship. For example, like first_name and last_name are determinant. So if I want to query sth, I must git all of them

Types of Keys#

Composite key: Key that is composed of more than one attribute
Key attribute: Attribute that is a part of a key
Entity integrity: Condition in which each row in the table has its own unique identity
- All of the values in the primary key must be unique
- No key attribute in the primary key can contain a null
Null: Absence of any data value that could represent:
- An unknown attribute value
- A known, but missing, attribute value
- A inapplicable condition
Referential integrity: Every reference to an entity instance by another entity instance is valid (foreign key) if reference is null, it violates the integrity
Superkey: An attribute or combination of attributes that uniquely identifies each row in a table(A superkey is a group of single or multiple keys which identifies rows in a table. A Super key may have additional attributes that are not needed for unique identification.)
Candidate key: A minimal(irreducible) superkey; a superkey that does contain a subset of attributes that is itself a superkey.
It is a set of attributes that uniquely identify tuples in a table. Candidate Key is a super key with no repeated attributes. The Primary key should be selected from the candidate keys. Every table must have at least a single candidate key. A table can have multiple candidate keys but only a single primary key.
- Properties of Candidate key:
  - It must contain unique values
  - Candidate key may have multiple attributes
  - Must not contain null values
  - It should contain minimum fields to ensure uniqueness
  - Uniquely identify each record in a table
Secondary key
- Secondary Key is the key that has not been selected to be the primary key. A secondary key is a non-identifying column or set of columns that can be used to find row in a table. It is only for data retrieval purposes

Ways to Handle Nulls#

Flags: Special codes used to indicate the absence of some value
NOT NULL constraint - Placed on a column to ensure that every row in the table has a value for that column
UNIQUE constraint - Restriction placed on a column to ensure that no duplicate values exist for that column

Relational Algebra#

Theoretical way of manipulating table contents using relational operators
Relvar: Variable that holds a relation
Heading contains the names of the attributes and the body contains the relation
Relational operators have the property of closure
- Closure: Use of relational algebra operators on existing relations produces new relations

Relational Set Operators#

Select (Restrict)
- Unary operator that yields a horizontal subset of a table
Project
- Unary operator that yields a vertical subset of a table
Union
- Combines all rows from two tables, excluding duplicate rows
- Union-compatible: Tables share the same number of columns, and their corresponding columns share compatible domains
Intersect
- Yields only the rows that appear in both tables
- Tables must be union-compatible to yield valid results
Difference
- Yields all rows in one table that are not found in the other table
- Tables must be union-compatible to yield valid results
Product
- Yields all possible pairs of rows from two tables
Join
- Allows information to be intelligently combined from two or more tables
- Types of Joins
  - Natural join: Links tables by selecting only the rows with common values in their common attributes
    - Join columns: Common columns -> the columns have exactly the same name. -> If not the same we do not call it nature join
  - Equijoin: Links tables on the basis of an equality condition that compares specified columns of each table -> == != < > <= >= ….
  - Theta join: Extension of natural join, denoted by adding a theta subscript after the JOIN symbol
  - Inner join: Only returns matched records from the tables that are being joined
  - Outer join: Matched pairs are retained and unmatched values in the other table are left null
  - Left outer join: Yields all of the rows in the first table, including those that do not have a matching value in the second table
  - Right outer join: Yields all of the rows in the second table, including those that do not have matching values in the first table
Divide
- Uses one 2-column table as the dividend and one single-column table as the divisor
- Output is a single column that contains all values from the second column of the dividend that are associated with every row in the divisor

Data Dictionary and the System Catalog(hold metadata)#

Data dictionary: Description of all tables in the database created by the user and designer // Where is the table located and the characteristic of columns
System catalog: System data dictionary that describes all objects within the database // the same as data dictionary
Homonyms and synonyms must be avoided to lessen confusion
- Homonym: Same name is used to label different attributes -> Cannot use the same name to describe diff things
- Synonym: Different names are used to describe the same attribute -> same

Extended ER (“EER”)-> Advanced data modeling#

Extended Entity Relationship Model (EERM)#

Result of adding more semantic constructs to the original entity relationship (ER) model
EER diagram (EERD): Uses the EER model

Entity Supertypes and Subtypes#

Entity supertype: Generic entity type related to one or more entity subtypes
- Contains common characteristics
Entity subtype: Contains unique characteristics of each entity subtype
Criteria to determine the usage
- There must be different, identifiable kinds of the entity in the user’s environment
- The different kinds of instances should each have one or more attributes that are unique to that kind of instance

Specialization Hierarchy#

Depicts arrangement of higher-level entity supertypes and lower-level entity subtypes
Relationships are described in terms of “is-a” relationships -> Composition can have “Has-a” relation
Subtype exists within the context of a supertype
Every subtype has one supertype to which it is directly related # Cannot have more
Supertype can have many subtypes
Provides the means to:
- Support attribute inheritance
- Define a special supertype attribute known as the subtype discriminator
- Define disjoint/overlapping constraints and complete/partial constraints

Inheritanc#

Enables an entity subtype to inherit attributes and relationships of the supertype
All entity subtypes inherit their primary key attribute from their supertype
At the implementation level, supertype and its subtype(s) maintain a 1:1 relationship
Entity subtypes inherit all relationships in which supertype entity participates
Lower-level subtypes inherit all attributes and relationships from its upper-level supertypes

Subtype Discriminator#

Attribute in the supertype entity that determines to which entity subtype the supertype occurrence is related
Default comparison condition is the equality comparison example:’p’ -> pilot

Disjoint and Overlapping Constraints#

Disjoint subtypes: Contain a unique subset of the supertype entity set
- Known as nonoverlapping subtypes
- Implementation is based on the value of the subtype discriminator attribute in the supertype
  Overlapping subtypes: Contain nonunique subsets of the supertype entity set
- Implementation requires the use of one discriminator attribute for each subtype

Completeness Constraint#

Specifies whether each supertype occurrence must also be a member of at least one subtype
Types
- Partial completeness: Not every supertype occurrence is a member of a subtype
- Total completeness: Every supertype occurrence must be a member of any

Entity Cluster#

Virtual entity type used to represent multiple entities and relationships in ERD
Avoid the display of attributes to eliminate complications that result when the inheritance rules change

Normalization#

Normalization (Do normalization once a time one entity a time)#

Evaluating and correcting table structures to minimize data redundancies
Reduces data anomalies
Assigns attributes to tables based on determination
Normal forms
- First normal form (1NF)
- Second normal form (2NF)
- Third normal form (3NF)

Need for Normalization#

Used while designing a new database structure
- Analyzes the relationship among the attributes within each entity
- Determines if the structure can be improved
Improves the existing data structure and creates an appropriate database design

Normalization Process#

Objective is to ensure that each table conforms to the concept of well-formed relations
- Each table represents a single subject
- No data item will be unnecessarily stored in more than one table
- All nonprime attributes in a table are wholely dependent on the primary key
- Each table is void of insertion, update, and deletion anomalies
Ensures that all tables are in at least 3NF
Higher forms are not likely to be encountered in business environment
Works one relation at a time
Starts by:
- Identifying the dependencies of a relation (table)
- Progressively breaking the relation into new set of relations

Normal Forms#

NORMAL FORM	CHARACTERISTIC
First Normal form(1NF)	Table format, no repeating groups, and PK identified -> Each cell is atomic, cannot be seperated any more
Second normal form(2NF)	1NF and no partial dependencies
Third normal form(3NF)	2NF and no transitive dependencies
*Note: Normalization how-to, in one sentence: work on one relation (table) at a time: identify dependencies, then ‘normalize’ - progressively break it down into smaller relations (tables), based on the dependencies we identify in the original relation

Functional Dependence Concepts#

Concept	Definition
Functional dependence	The attribute B is fully functionally dependent on the attribute A if each value of A determines one and only one value of B. -> A determines B
Functional dependence (Generalized definition)	Attribute A determines attribute B if all of the rows in the table that agree in value for attribute A also agree in value for attribute B.
Fully functional dependence (composite key)	If attribute B is functionally dependent on a composite key A but not on any Subset of that composite key, the attribute B is fully functionally dependent on A. A is composite Key and any subsets of A is not enough to determine B

Types of Functional Dependencies#

Partial dependency: Functional dependence in which the determinant is only part of the primary key (A determines B -> A is determinant)
- Assumption - One candidate key
- Straight forward
- Easy to identify
- It means that the B only fully determined by part of primary key. -> dont need whole compositive key
Transitive dependency: An attribute functionally depends on another nonkey attribute

Conversion to First Normal Form#

Repeating group: Group of multiple entries of same type can exist for any single key attribute occurrence
- Existence proves the presence of data redundancies
Enable reducing data redundancies
Steps
- Eliminate the repeating groups
- Identify the primary key
- Identify all dependencies
Dependency diagram: Depicts all dependencies found within given table structure
- Helps to get an overview of all relationships among table’s attributes
- Makes it less likely that an important dependency will be overlooked
1NF

Conversion to Two Normal Form#

Steps
- Make new tables to eliminate partial dependencies
- Reassign corresponding dependent attributes
Table is in 2NF when it:
- Is in 1NF
- Includes no partial dependencies
2NF

Conversion to Third Normal Form#

Steps
- Make new tables to eliminate transitive dependencies
- Determinant: Any attribute whose value determines other values within a row
- Reassign corresponding dependent attributes
Table is in 3NF when it:
- Is in 2NF
- Contains no transitive dependencies
3NF

Requirements for Good Normalized Set of Tables#

Evaluate PK assignments and naming conventions -> create a job_code
evaluate naming conventions -> JOB_CHG_HOUR
Refine attribute atomicity -> EMP_NAME -> first name, last name
- Atomic attribute: Cannot be further subdivided
- Atomicity: Characteristic of an atomic attribute
Identify new attributes $\And$ new relationships -> EMP_HIREDATE(Add more columns to make it more useful) $\And$ PROJECT can have EMP_NO as FK
Refine primary keys as required for data granularity -> ASSIGN_NUM
- Granularity: Level of detail represented by the values stored in a table’s row
Maintain historical accuracy $\And$ evaluate using derived attributes -> sotre JOB_CHG_HOUR IN assignment $\And$ ASSIGN_CHARGE (It can be caculatd by xx xx waste space but get better performance))

Denormalization#

Design goals
- Creation of normalized relations
- Processing requirements and speed
Number of database tables expands when tables are decomposed to conform to normalization requirements
Joining a larger number of tables:
- Takes additional input/output (I/O) operations and processing logic
- Reduces system speed
Defects in unnormalized tables
- Data updates are less efficient because tables are larger
- Indexing is more cumbersome
- No simple strategies for creating virtual tables known as views

Transaction Management#

Transaction#

Logical unit of work that must be entirely completed or aborted
Consists of:
- SELECT statement
- Series of related UPDATE statements
- Series of INSERT statements
- Combination of SELECT, UPDATE, and INSERT statements
Consistent database state: All data integrity constraints are satisfied // Before transaction, the database is good and clean and after transaction is completed, the transaction will leave the database again in a consistent sate.
- Must begin with the database in a known consistent state to ensure consistency
Formed by two or more database requests
- Database requests: Equivalent of a single SQL statement in an application program or transaction
Consists of a single SQL statement or a collection of related SQL statements

Transaction Properities#

Atomicity: All operations of a transaction must be completed. If not, the transaction is aborted.
Consistency: Permanence of database’s consistent state
Isolation: Data used during transaction cannot be used by second transaction until the first is completed
Durability: Ensures that once transactions are committed, they cannot be undone or lost
Serializability: Ensures that the schedule for the concurrent execution of several transactions should yield consistent results

Transaction Management with SQL#

SQL statements that provide transaction support
- COMMIT
- ROLLBACK -> roll back the current transaction to last commit
Transaction sequence must continue until:
- COMMIT statement is reached
- ROLLBACK statement is reached
- End of program is reached
- Program is abnormally terminated

Transaction log#

Keeps track of all transactions that update the database
DBMS uses the information stored in a log for:
- Recovery requirement triggered by a ROLLBACK statement
- A program’s abnormal termination
- A system failure

Concurrency Control#

Coordination of the simultaneous transactions execution in a multiuser database system
Objective - Ensures serializability of transactions in a multiuser database environment
dirty read：一个事务可以读取另一个尚未提交事务的修改数据。
nonrepeatable read：在同一个事务中，同一个查询在T1时间读取某一行，在T2时间重新读取这一行时候，这一行的数据已经发生修改，可能被更新了（update），也可能被删除了（delete）。
phantom read：在同一事务中，同一查询多次进行时候，由于其他插入操作（insert）的事务提交，导致每次返回不同的结果集。

The Scheduler#

Establishes the order in which the operations are executed within concurrent transactions
- Interleaves the execution of database operations to ensure serializability and isolation of transactions
Based on concurrent control algorithms to determine the appropriate order
Creates serialization schedule
- Serializable schedule: Interleaved execution of transactions yields the same results as the serial execution of the transactions

Concurrency Control with Locking Methods#

Locking methods - Facilitate isolation of data items used in concurrently executing transactions
Lock: Guarantees exclusive use of a data item to a current transaction
Pessimistic locking: Use of locks based on the assumption that conflict between transactions is likely
Lock manager: Responsible for assigning and policing the locks used by the transactions

Lock Granularity#

Indicates the level of lock use
Levels of locking
- Database-level lock
- Table-level lock
- Page-level lock
  - Page or diskpage: Directly addressable section of a disk
- Row-level lock
- Field-level lock

Lock Types#

Binary lock
Has two states, locked (1) and unlocked (0)
- If an object is locked by a transaction, no other transaction can use that object
- If an object is unlocked, any transaction can lock the object for its use
Exclusive lock -write
- Exists when access is reserved for the transaction that locked the object
Shared lock -> read
- Exists when concurrent transactions are granted read access on the basis of a common lock

Two-Phase Locking (2PL) -> avoid deadlock#

Defines how transactions acquire and relinquish locks
Guarantees serializability but does not prevent deadlocks
Phases
- Growing phase - Transaction acquires all required locks without unlocking any data
- Running operations(middle)
- Shrinking phase - Transaction releases all locks and cannot obtain any new lock
Governing rules
- Two transactions cannot have conflicting locks
- No unlock operation can precede a lock operation in the same transaction
- No data are affected until all locks are obtained

Deadlocks#

Occurs when two transactions wait indefinitely for each other to unlock data
- Known as deadly embrace
Control techniques
- Deadlock prevention possible deadlock detect and kill one of the potential confilct -> timestamp based
- Deadlock detection
- Deadlock avoidance -> The transaction must obtain all the lock to execute, in the transaction process, no deadlock will occur
Choice of deadlock control method depends on database environment

Time Stamping -> prevent deadlock#

Assigns global, unique time stamp to each transaction
- Produces explicit order in which transactions are submitted to DBMS
Properties
- Uniqueness: Ensures no equal time stamp values exist
- Monotonicity: Ensures time stamp values always increases
Disadvantages
- Each value stored in the database requires two additional stamp fields
- Increases memory needs
- Increases the database’s processing overhead
- Demands a lot of system resources

Wait/Die and Wound/Wait Concurrency Control Schema#

Transaction Request Lock	Transaction Owning Lock	WAIT/DIE Scheme(People who currently has resources has higher privelege)	Wound/Wait Scheme -> Old always first if Old has higher privelege
T1(11548789)	T2(19562545)	T1 wait until T2 is complete	T1 preempts(rollback) T2 and T2 reschedule
T2(19562545)	T1(11548789)	T2 dies and reschedule with the same timestamp	T2 waits T1 is completed and T1 releases lock

Phases of Optimistic Approach#

Read
- Transaction:
  - Reads the database
  - Executes the needed computations
  - Makes the updates to a private copy of the database values
Validation
- Transaction is validated to ensure that the changes made will not affect the integrity and consistency of the database
Write
- Changes are permanently applied to the database

Distributed DBs#

Vertical scalling and horizontal scalling#

Vertical scalling: very bad, it’s take one server and add more and more Gigbyte it run, but it is restricted in one machine
Horizontal scalling: add more and more servers

Distributed Processing and Distributed Databases#

Distributed processing: Database’s logical processing is shared among two or more physically independent sites via network
Distributed database: Stores logically related database over two or more physically independent sites via computer network
- Database fragments: Database composed of many parts in distributed database system. EXAMPle: take a table and split it to many pieces

Functions of Distributed DBMS#

Receives the request of an application
Validates analyzes, and decomposes the request
Maps the request
Decomposes request into several I/O operations
Searches and validates data
Ensures consistency, security, and integrity
Validates data for specific conditions
Presents data in required format

DDBMS Components#

Computer workstations or remote devices
Network hardware and software components
Communications media
Transaction processor (TP): Software component of a system that requests data -> access the database and get the query fullfill(make a request)
Known as transaction manager (TM) or application processor (AP)
Data processor (DP) or data manager (DM) -> Only serve data dont care where the request come from
Software component on a system that stores and retrieves data from its location

Single-Site Processing, Single-Site Data(MPSD) pretty simple#

Multiple-Site Processing, Single-Site Data (MPSD)#

Multiple processes run on different computers sharing a single data repository
Require network file server running conventional applications
- Accessed through LAN
Client/server architecture
Reduces network traffic
Processing is distributed
Supports data at multiple sites

Multiple-Site Processing, Single-Site Data (MPSD)#

Fully distributed database management system
- Support multiple data processors and transaction processors at multiple sites
Classification of DDBMS depending on the level of support for various types of databases
- Homogeneous: Integrate multiple instances of same DBMS over a network -> all databases are in the same model(all oracle/mysql)
- Heterogeneous: Integrate different types of DBMSs -> can have different vendors(oracle, mysql,sqlite) but must follow the same relational model
- Fully heterogeneous: Support different DBMSs, each supporting different data model (some maybe noSQL data/graph/kv pairs)

Restrictions of DDBMS#

Remote access is provided on a read-only basis
Restrictions on the number of remote tables that may be accessed in a single transaction
Restrictions on the number of distinct databases that may be accessed
Restrictions on the database model that may be accessed

Distributed Requests and Distributed Transactions#

Remote request
- Single SQL statement accesses data processed by a single remote database processor
Remote transaction
- Accesses data at single remote site composed of several requests
Distributed transaction
- Requests data from several different remote sites on network
Distributed request
- Single SQL statement references data at several DP sites

Distributed Concurrency Control (2PC 2 phrase commit)#

Concurrency control is important in distributed databases environment
- Due to multi-site multiple-process operations that create inconsistencies and deadlocked transactions

Two-Phase Commit Protocol(2PC)#

Guarantees if a portion of a transaction operation cannot be committed, all changes made at the other sites will be undone
- To maintain a consistent database state
Requires that each DP’s transaction log entry be written before database fragment is updated
DO-UNDO-REDO protocol: Roll transactions back and forward with the help of the system’s transaction log entries
Write-ahead protocol: Forces the log entry to be written to permanent storage before actual operation takes place
Defines operations between coordinator and subordinates
Phases of implementation
- 1. Preparation -> request if others can commit or not(failure happends)
- 1. The final COMMIT -> Tell all others they can commit
Details:
第一阶段：准备
协调器发送一个PREPARE TO COMMIT信息给所有的从属者。

从属者接到这个信息，使用先写协议，写入事务日志并发送一个确认信息（YES/PREPARED TO COMMIT或NO/NO PREPARED）给协调器。
协调器确保所有节点要么提交，要么取消。
如果所有节点为PREPARED TO COMMIT，事务调转到第二阶段。如果有一个或多个节点为NO或NO PREPARED协调器就给所有从属者发送一个ABORT。

第二阶段：最后的提交

协调器广播一个COMMIT信息给所有从属者并等待回复。
每个从属者接收到COMMIT信息并且使用DO协议更性数据库。
从属者回复一个 COMMITTED或者NOT COMMITTED信息给协调器。
如果一个或者多个从属者没有提交，协调器就发送一个ABORT消息，这样就使它们都取消所有的操作。

两阶段提交的目的是为了确保每个节点都提交事务自己所操作的部分；否则，事务就取消。如果其中一个节点提交失败，就用事务日志来恢复数据库的信息，而且使用的是DO-UNDO-REDO协议来恢复（记住使用先写协议更新日志信息）。
CUBRID -> how transaction manager/distribute a good example

Transparency (Hide sth to users)#

Distribution transparency: The users dont know the databases are distrubuted
- Allows management of physically dispersed database as if centralized(好像是centralized的)
- Levels
  - Fragmentation transparency
  - Location transparency
  - Local mapping transparency
Transaction transparency: The users dont know what parts of transaction happens where
- Ensures database transactions will maintain distributed database’s integrity and consistency
- Ensures transaction completed only when all database sites involved complete their part
- Distributed database systems require complex mechanisms to manage transactions
Failure transparency: If the query go to one place, the place is down, they the user will not feel that. reroute to other copies
- Ensures the system will operate in case of network failure
- Considerations for resolving requests in a distributed data environment
  - Data distribution
  - Data replication
    - Replica transparency: DDBMS’s ability to hide multiple copies of data from the user
Performance transparency: Hide latency.
- Allows a DDBMS to perform as if it were a centralized database
Network and node availability
- Network latency: delay imposed by the amount of time required for a data packet to make a round trip
- Network partitioning: delay imposed when nodes become suddenly unavailable due to a network failure
Heterogeneity transparency: dont know what operating system/ database system …

Distributed Database Design#

Data fragmentation: How to partition database into fragments
Data replication: Which fragments to replicate
Data allocation: Where to locate those fragments and replicas

Data Fragmentation#

Breaks single object into many segments
Information is stored in distributed data catalog (DDC)
Strategies
- Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples (rows)
- Vertical fragmentation: Division of a relation into attribute (column) subsets
- Mixed fragmentation: Combination of horizontal and vertical strategies

Data Replication#

Data copies stored at multiple sites served by a computer network
Mutual consistency rule: Replicated data fragments should be identical
Styles of replication -> How to make changes propogate
- Push replication -> send changes to others
- Pull replication -> pull others’ changes
Helps restore lost data

Data Replication Scenarios#

Fully replicated database
- Stores multiple copies of each database fragment at multiple sites
Partially replicated database
- Stores multiple copies of some database fragments at multiple sites
Unreplicated database
- Stores each database fragment at a single site

Data Allocation Strategies#

Centralized data allocation
- Entire database stored at one site
Partitioned data allocation
- Database is divided into two or more disjoined fragments and stored at two or more sites
Replicated data allocation
- Copies of one or more database fragments are stored at several sites

The CAP Theorem (That’s what u need consider when you design a database)#

CAP stands for:
- Consistency: Are all partitions of data keeping the same
- Availability: if some nodes are down, the system can still work
- Partition tolerance: the cluster continues to function even if there is a “partition” (communication break) between two nodes (both nodes are up, but can’t communicate). -> Continue to work even some nodes fail
Basically available, soft state, eventually consistent (BASE)
- Data changes are not immediate but propagate slowly through the system until all replicas are consistent

Database Connectivity#

Database middleware: Provides an interface between the application program and the database
Data repository/source - Data management application used to store data generated by an application program
Universal Data Access (UDA): Collection of technologies used to access any type of data source and manage the data through a common interface
ODBC, OLE-DB, ADO.NET form the backbone of MS UDA architecture

Various connectivity options#

Native SQL(provided by vendors)
- Connection interface provided by database vendors, which is unique to each vendor
- Interfaces are optimized for particular vendor’s DBMS
  - Maintenance is a burden for the programmer
M’soft: 1.ODBC(Open database connectivity(Microsoft’s implementation of a superset of SQL Access Group)), DAO(Data Access Object)+JET, RDO(Remote Data Object)
- ODBC
  - Open Database Connectivity (ODBC):Microsoft’s implementation of a superset of SQL Access Group Call Level Interface (CLI) standard for database access
  - Widely supported database connectivity interface
  - Allows Windows application to access relational data sources by using SQL via standard application programming interface (API)
- Data Access Objects(DAO)
  - Object-oriented API used to access MS Access, MS FoxPro, and dBase databases from Visual Basic programs
  - Provides an optimized interface that expose functionality of Jet data engine to programmers
  - DAO interface can be used to access other relational style data sources
  - Jet is a wrapper of DAO
- Remote Data Objects (RDO)
  - Higher-level object-oriented application interface used to access remote database servers
- Dynamic-link libraries (DLLs)
  - Implements ODBC, DAO, and RDO as shared code that is dynamically linked to the Windows operating environment
M’soft: OLE-DB(open linking embeding db)
M’soft: ADO.NET(Access Data Object)
JDBC(from Sum)

Components of ODBC Architecture#

High-level ODBC API through which application programs access ODBC functionality -> face to program
Driver manager that is in charge of managing all database connections -> face to table
ODBC driver that communicates directly to DBMS

Object Linking and Embedding for Database (OLE-DB) Database becomes object#

Before: OMG(object management group)
CORBA: COMMON OBJECT REQUEST BREAKER ARCHITECHTURE -> Object exchange from different os regardless different memory

Database middleware that adds object-oriented functionality for access to data
Series of COM(component object model) objects provides low-level database connectivity for applications
Types of objects based on functionality
- Consumers(request data)
- Providers(produce data)
Does not provide support for scripting languages
ActiveX Data Objects (ADO): Provides:
High-level application-oriented interface to interact with OLE-DB, DAO, and RDO
Unified interface to access data from any programming language that uses the underlying OLE-DB objects
ADO.NET
- Data access component of Microsoft’s .NET application development framework
Microsoft’s .NET framework
- Component-based platform for developing distributed, heterogeneous, interoperable applications
- Manipulates any type of data using any combination of network, operating system, and programming language
Features critical for the development of distributed applications
- DataSet: Disconnected memory-resident representation of database
- XML support
  - DataSet is internally stored in XML format
  - Data in DataSet is made persistent as XML documents
  - DataSet is internally stored in XML format
  - Data in DataSet made persistent as XML documents

Database Internet Connectivity#

Allows new innovative services that:
- Permit rapid response by bringing new services and products to market quickly
- Increase customer satisfaction through creation of web-based support services
- Allow anywhere, anytime data access using mobile smart devices via the Internet
- Yield fast and effective information dissemination through universal access

Web-to-Database Middleware#

Web server is the main hub through which Internet services are accessed
Server-side extension: Program that interacts directly with the web server
- Known as web-to-database middleware
- Provides its services to the web server in a way that is totally transparent to the client browser
Middleware must be well integrated

Client-Side Extensions#

Add functionality to Web browser
Types
- Plug-in: External application automatically invoked by the browser when needed
- Java and JavaScript: Embedded in web page
  - Downloaded with the Web page and activated by an event
ActiveX and VBScript: Embedded in web page
- Downloaded with page and activated by event
- Oriented to Windows applications

Performance Tuning and Query Optimization#

Tuning(Database Performance-Tuning Concepts)#

Goal of database performance is to execute queries as fast as possible
Database performance tuning: Set of activities and procedures that reduce response time of database system
Fine-tuning the performance of a system requires that all factors must operate at optimum level with minimal bottlenecks

Performance Tuning: Client and Server#

Client side#

SQL performance tuning: Generates SQL query that returns correct answer in least amount of time -> suppose A and B and C-> A should fail many times. Re-arranging sql query language
- Using minimum amount of resources at server

Server side#

DBMS performance tuning: DBMS environment configured to respond to clients’ requests as fast as possible
- Optimum use of existing resources

DBMS Architecture#

All data in a database are stored in data files
- Data files automatically expand in predefined increments known as extends
Data files are grouped in file groups or table spaces
- Table space or file group: Logical grouping of several data files that store data with similar characteristics
Data cache or buffer cache: Shared, reserved memory area
- Stores most recently accessed data blocks in RAM
SQL cache or procedure cache: Stores most recently executed SQL statements or PL/SQL procedures
DBMS retrieves data from permanent storage and places them in RAM Input/output request: Low-level data access operation that reads or writes data to and from computer devices
Data cache is faster than working with data files
Majority of performance-tuning activities focus on minimizing I/O operations
Listener The listener process listens for clients’ requests and handles the processing of the SQL requests to other DBMS processes. Once a request is received, the listener passes the request to the appropriate user process
User The DBSM creates a user process to manager each client session. Therefore, when you log on to the DBMS, you are assigned a user process. This process handles all requests you submit to the server. There are many user processes-at least onper logged-in client.
Scheduler The scheduler process organizes the concurrent execution of SQL requests.
Lock Mnager This process manages all locks placed on database objects, including disk pages
Optimizer The optimizer analyzes SQL queries and finds the most efficient way to access the data. // How can we make it better.

Database Query Optimization Modes#

Algorithms proposed for query optimization are based on:
- Selection of the optimum order to achieve the fastest execution time // The order to visit each table
- Selection of sites to be accessed to minimize communication costs
Evaluated on the basis of:
- Operation mode
- Timing of its optimization
- Type of information(used for optimization)

Classification of Operation Modes#

Automatic query optimization: DBMS finds the most cost-effective access path without user intervention
Manual query optimization: Requires that the optimization be selected and scheduled by the end user or programmer

Based on Timing of Optimization#

Static query optimization: best optimization strategy is selected when the query is compiled by the DBMS
- Takes place at compilation time
Dynamic query optimization: Access strategy is dynamically determined by the DBMS at run time, using the most up-to-date information about the database
- Takes place at execution time: more processing ahead

Type of Information Used to Optimize#

Statistically based query optimization algorithm: Statistics are used by the DBMS to determine the best access strategy
Statistical information is generated by DBMS through:
- Dynamic statistical generation mode - auto eval
- Manual statistical generation mode - via user
Rule-based query optimization algorithm: based on a set of user-defined rules to determine the best query access strategy

Sample Database Statistics Measurements#

Advantages and Disadvantages of Storing Derived Attributes

Query Processing#

Parsing
- DBMS parses the SQL query and chooses the most efficient access/execution plan
Execution
- DBMS executes the SQL query using the chosen execution plan
Fetching
- DBMS fetches the data and sends the result set back to the client

SQL Parsing Phase#

Query is broken down into smaller units
Original SQL query transformed into slightly different version of original SQL code which is fully equivalent and more efficient
Query optimizer: Analyzes SQL query and finds most efficient way to access data
Access plans: DBMS-specific and translate client’s SQL query into a series of complex I/O operations
If access plan already exists for query in SQL cache, DBMS reuses it
If not, optimizer evaluates various plans and chooses one to be placed in SQL cache for use

SQL Execution Phase#

All I/O operations indicated in the access plan are executed
- Locks are acquired
- Data are retrieved and placed in data cache
- Transaction management commands are processed

SQL Fetching Phase#

Rows of resulting query result set are returned to client
- DBMS may use temporary table space to store temporary data
- Database server coordinates the movement of the result set rows from the server cache to the client cache

Query Processing Bottlenecks#

Delay introduced in the processing of an I/O operation that slows the system
Caused by the:
- CPU
- RAM
- Hard disk
- Network
- Application code

Indexes and Query Optimization#

Indexes
- Help speed up data access
- Facilitate searching, sorting, using aggregate functions, and join operations
- Ordered set of values that contain the index key and pointers
- More efficient than a full table scan

Indexes and Query Optimization#

Data sparsity: Number of different values a column could have; low sparsity => index might be useless
Data structures used to implement indexes:
- Hash indexes -> Bitl generate short-link by hash function
- B-tree indexes
- Bitmap indexes
DBMSs determine best type of index to use

index Selectivity#

Measure of the likelihood that an index will be used in query processing. So we strive to create indexes that have high selectivity.
Indexes are used when a subset of rows from a large table is to be selected based on a given condition
Index cannot always be used to improve performance
Function-based index: Based on a specific SQL function or expression(eg. EMP_SALARY + EMP_COMMISSION)
Indexes are useful when
- an indexable column occurs in a WHERE or HAVING search expression
- an indexable column appears in a GROUP BY or ORDER BY clause
- MAX or MIN is applied to an indexable column
- there is high data sparsity on an indexable column
Worth creating indexes on single columns that appear in WHERE, HAVING, ORDER BY, GROUP BY and join conditions.

Optimizer Choices -> statisitc-based#

Rule-based optimizer: Uses preset rules and points to determine the best approach to execute a query
Cost-based optimizer: Uses algorithms based on statistics about objects being accessed to determine the best approach to execute a query(collect all kinds of costs and choose on lowest cost)

Using Hints to Affect Optimizer Choices#

Optimizer might not choose the best execution plan
- Makes decisions based on existing statistics, which might be old
- Might choose less-efficient decisions
Optimizer hints: Special instructions for the optimizer, embedded in the SQL command text
Example

Conditional Expressions#

Equality comparisons are faster than inequality comparisons
Transform conditional expressions to use literals
Write equality conditions first when using multiple conditional expressions
When using multiple AND conditions, write the condition most likely to be false first
When using multiple OR conditions, put the condition most likely to be true first
Avoid the use of NOT logical operator

Query Formulation#

Identify what columns and computations are required §Identify source tables
Determine how to join tables
Determine what selection criteria are needed
Determine the order in which to display the output

DBMS Performance Tuning#

Managing DBMS processes in primary memory and the structures in physical storage
DBMS performance tuning at server end focuses on setting parameters used for:
- Data cache
- SQL cache
- Sort cache
- Optimizer mode
In-memory database: Store large portions of the database in primary storage
Recommendations for physical storage of databases:
- Use RAID (Redundant Array of Independent Disks) to provide a balance between performance improvement and fault tolerance
- Minimize disk contention
- Put high-usage tables in their own table spaces
- Assign separate data files in separate storage volumes for indexes, system, and high-usage tables
Take advantage of the various table storage organizations in the database
Index-organized table or clustered index table: Stores the end-user data and the index data in consecutive locations in permanent storage
Partition tables based on usage
Use denormalized tables where appropriate
Store computed and aggregate attributes in tables

Business Intelligence#

Business Intelligence(BI)#

Comprehensive, cohesive, integrated set of tools and processes
- Captures, collects, integrates, stores, and analyzes data
Generates and presents information to support business decision making
Allows a business to transform:
- Data into information
- Information into knowledge
- Knowledge into wisdom
ETL: extraction transformation and loading
QUERY & Reporting : include ml algorithm
Governance: How often to upadte warehouse
Management: How ETL happens

Business Intelligence Benefits#

Improve decision making
Integrating architecture
Common user interface for data reporting and analysis
Common data repository fosters single version of company data
Imporeve organizational performance

Decision Support Data#

Effectiveness of BI depends on quality of data gathered at operational level
Operational data
- Seldom well-suited for decision support tasks
- Stored in relational database with highly normalized structures
- Optimized to support transactions representing daily operations
Differ from operational data in:
Time span
Granularity
- Drill down: Decomposing a data to a lower level
- Roll up: Aggregating a data into a higher level
Dimensionality

Decision Support Database Requirements#

Database schema
- Must support complex, non-normalized data representations
- Data must be aggregated and summarized
- Queries must be able to extract multidimensional time slices
Data extraction and loading
- Allow batch and scheduled data extraction
- Support different data sources and check for inconsistent data or data validation rules
- Support advanced integration, aggregation, and classification
Database size should support
- Very large databases (VLDBs)
- Advanced storage technologies
- Multiple-processor technologies

Data Marts#

Small, single-subject data warehouse subset
Provide decision support to a small group of people
Benefits over data warehouses
Lower cost and shorter implementation time § Technologically advanced
Inevitable people issues

Components of Star Schemas#

Facts
- Numeric values that represent a specific business aspect
Dimensions
- Qualifying characteristics that provide additional perspectives to a given fact
Attributes
- Used to search, filter, and classify facts
- Slice and dice: Ability to focus on slices of the data cube for more detailed analysis.
Attribute hierarchy
- Provides a top-down data organization

Performance-Improving Techniques for the Star Schema#

Normalizing dimensional tables
- Snowflake schema: Dimension tables can have their own dimension tables
Maintaining multiple fact tables to represent different aggregation levels
Denormalizing fact tables

Data Anaytics#

Encompasses a wide range of mathematical, statistical, and modeling techiques to extract knowledge from data
- subset of BI functionality
Classfication of tools
- Explanatory analytics: Focuses on discovering and explaining data characteristics and relationships based on existing data
- Predictive analytics: Focuses on predicting future outcomes with a high degree of accuracy

OLAP-> Online Analytical Processing#

Advanced data analysis environment that supports decision making, business modeling, and operations research
Characteristics:
- Multidimensional data analysis techniques
- Advanced database support
- Easy-to-use end-user interfaces

Relational Online Analytical Processing (ROLAP)#

Provides OLAP functionality using relational databases and familiar relational tools to store and analyze multidimensional data
Extensions added to traditional RDBMS technology
Multidimensional data schema support within the RDBMS
Data access language and query performance optimized for multidimensional data
Support for very large databases (VLDBs)

Multidimensional Online Analytical Processing (MOLAP) pointer to pointer to formulate multi-dimensional database#

Extends OLAP functionality to multidimensional database management systems (MDBMSs)
MDBMS: Uses proprietary techniques store data in matrix-like n-dimensional arrays
End users visualize stored data as a 3D data cube §Grow to n dimensions, becoming hypercubes
Held in memory in a cube cache to speed access
Sparsity: Measures the density of the data held in the data cube

SQL Extensions for OLAP#

The ROLLUP extension(get upper domian data sorted by lower dimension)#

Used with GROUP BY clause to generate aggregates by different dimensions
Enables subtotal for each column listed except for the last one, which gets a grand total
Order of column list important

The CUBE extension(only get total by one dimension separately)#

Used with GROUP BY clause to generate aggregates by the listed columns
Includes the last column

Data lake#

A traditional data warehouse is an ETL-based, historical record of transactions - very RDB-like
A ‘modern’ alternative is a ‘data lake’, which offers a more continuous form of analytics, driven by the rise of unstructed data, streaming, cloud storage, etc. In a data lake, data is NOT ETLd, rather, it is stored in its ‘raw’ (“natural”) form [even incomplete, untransformed…].

Discussion1#

Data - origins?
- copper
- parchment
- leaves
- stone
Data - tech?
- hologram
- …
Data vs info vs knowledge
- info is by Searching sorting … calculation - computation of data(collected from somewhere) |exact information
- knowledge is useful information
Modeling
- Why? leave the part I dont want (abstraction) features!
Relational -> relating different entity
Another example of a ‘unary’ relationship? (employee table manager)
Another example of a ternary relationship? (manufactory)
Instead of TinyCollege, consider USC - more entities?
Why is full functional dependence a good thing?
Diff between entity and referential integrity, in terms of…
Examples of ‘blind’ (non-identifying) keys in RL?
How can we use tables to model unary relationships?
- one the same table a column called manager column -> point to itself’s employee_id -> drawback if many people donot have manager, it would be sparse column
- use two tables -> make a table called manager_employee which only has two cols (employee1, employee2) em1 -> employeers, em2-> managers
Closure - where else?
Why is closure a good property?
Set diff is non-commutative - what else is?
What is the conceptual shift, from a ‘spreadsheet DB’ to a relational one?

Discussion 2#

SQL: ____ Oriented Programming - what? What else?
- imperative programming: use for loop to accomplish the same goal. SQL is not but java c c++ is
- Object Oriented -> take data and method of data package them together
- Functional programming language: put function as input in and get function out as output. SQL is functional oriented
- - Function oriented programming language
- - Set oriented language -> SQL engine figuer out how to do 21:46
- - Declearative … tell what u want fine control
- Aspect oriented programming -> AOP
  - aspect means some tasks a program need to do. draw, email, send to printer
  - get all the task classes together and make an internface
  - if we implement the interface called aspect
  - SQL is also domain specific language DSL
Pattern matching - https://www.rexegg.com/regex-quickstart.html _-> single word % -> more than one word
More datatypes?
- Color type
- currency type
Breaking closure -> It cannot chain further stop chaining operation
RL(real life) situations (not DB) where transactions are done sequentially?
- dating cannot date with many peoples
Robert rules room how to make sure everybody get a chance to speak
DB cases in RL where interleaving in inevitable?
- travel book ticket
- buy books from amazon
- Linkedin Job
Database state, during lost updates vs inconsistent analysis?
- lost updates -> t2 override t1’s modification -> data destroy/ table is not always fine.
- inconsistent analysis -> nonrepeatable read -> integrate of table not harm. table not change the datastructure.
Serial schedule - how many exist, for n serializable transactions?
- rearrange many transactions very important
- transaction isolation level -> 4 levels for example, allow dirty read allow nonrepeatable read …
  - avoid dirty read
  - avoid nonrepeatable read
  - avoid phantom read
Why not make all schedules serial? -> consider performance
2PL(Static/Conservative) vs Strict 2PL(“S2PL”){hold on some writes and release all the reads, release read lock } vs Rigorous(Stricter/StrongStrict/SS2PL!)(hold on all the lock until commit complete)
Newton and his 2 cats
Replication - three reasons?
- failure reasons
- network-balancing
Replication tradeoffs with consistency? -> amazon
2PC - failure, how? rollback hang

Algorithms review 3 Amortized Analysis

Posted on 2020-02-01 In Algorithms

This is a personal notes of algorithms courses

Definition
Unbounded Array
- The aggregrate method
Reference

Definition#

In a sequence of operations, the worst-case does not occur ofen in each operation; some operation may be cheap (even be O(1)), some may be expensive. Therefore, a traditional worst-case per operation analysis can give an overly pessimitic bound. When the same operation takes different times, amortized analysis can accurately calculate the runtime complexity. it can give the average performance of each operation in the worst case. Please notice that amortized analysis is not anverage case analysis. In average case analysis we compute the expected cost of each operation. To describe it in a more general example. There is an algorithm with 1000 operations. There are 99 operations take $0(1)$ and one operation takes $O(n ^ 2)$. If we use average case analysis. The expected cost of this algorithm might be $O(n^2)$. However, it does not need so much time since this time-consuming operation can be amortized to other 99 operations. So, amortized function can give us a tigher upper bound of an algorithm. There are generally three method to perform amortized analysis.(We only forcus on the first two methods)

The aggregate method computes the upper bound $T(n)$ on the total cost of n operations. The amortized cost is given by $T(n)/n$. In this method each operation will get the same amortized cost, even if there are several types of operations in the sequence.
The accounting method(or the banker’s method) computes th individual cost of each operation. We assign different charges to each operation; some operations may charge more or less than they actually cost. The amount we charge an operation is called its amortized cost.
The potential method(or the physicist’s method).

Unbounded Array#

The aggregrate method#

We implement the idea of java.util.ArrayList in java. We maintain an array of a fixed length limit and an internal index size. When we insert a value, if current index dose reach array.length, we will double the size and then we will copy our elements to the new array. The table showing below records the current size of the array, its new size, the number of inserts and the number of copies.

Insert	Old size	New Size	Copy
1	1	-	-
2	1	2	1
3	2	4	2
4	4	-	-
5	4	8	4
6	8	-	-
7	8	-	-
8	8	-	-
9	8	16	8

This table shows that 9 inserts requre$1 + 2 + 4 + 8 = 15$ copy operations. Therefore, the amortized cost of a single insert is the total cost$(9+15=24)$ over $9$ inserts, which is $2.67$.

When we generalize this pattern. Assume we start with the array of size $1$ and make $2^n + 1$ inserts. These inserts will require $\sum\limits_{k=1}^n{2^n} = 2^{n+1} - 1$ copy operations. So the total work $T(n) = (2^n + 1) + (2^{(n+1)} - 1) = 3\cdot 2^n$. Then we compute the average cost per insert as a limit when the input size tend to infinity:
$$\lim_{n\to\infty}\frac{3\cdot2^n}{1+2^n} = 3$$
So the amortized cost of insertion is constant, namely $O(3)$. This method is called aggegate method which seeks an upper bound on the total running time of a sequence of operations.

Reference#

[1] Algorithms in Action, by V. Savvich, First Edition, 2019.

Algorithms review 5 Greedy Algorithms

Posted on 2020-02-01 In Algorithms

This is a personal notes of algorithms courses

Optimization problem
Suboptimal solution

Optimization problem#

We are to make a change of $0.40 using US currency and assuming that there is an unlimited supply of coins.

Like the graph shown before, the height of the tree would be $h$, and the total time complexity of all choices would be $O(4^h)$. But we only need the most optimal choice, we always choose the coin which has the largest value. Then our choice will become $40 = 25+30+5$. The greedy algorithms is that instead we go through the whole tree, we only go through a single branches. Then the time compleity would be lower to $O(h)$.

Suboptimal solution#

Algorithms review 2 Trees and Graphs

Posted on 2020-01-30 In Algorithms

This is a personal notes of algorithms courses

Definition
Graph Traversals
Planar Graph
- Theorem1.
- Theorem2.
Applications
- Example1
  - Description
  - Solution
Reference

Definition#

A graph G is a pair (V, E), where V is a set of vertices(also called Nodes) and E is a set of Edges connecting the vertices. Graphs can be directed or undirected and weighted or unweighted.(weights are always assigned on the edges)

self-loop is an edge that connects to the same vertex twice.

mult-edge is a set of two or more edges that have the same two vertices.

Simple graph is a graph does not have mult-edges or self-loops

In a graph G, if every two vertices of G are connected by a unique path (no circle) <-> G is connected and $V = E + 1$

In an undirected simple graph $G = (V, E)$, there are at most $V(V-1)/2$ edges. In short, by using the asymptotic notation,$E = O(v^2)$

a graph G is dense if $E=\Omega(v^2)$

a graph G is sparse if $E=O(V)$

Graph Traversals#

Graph Traversal means we visiting all versices in a systematic order. Since a graph may contain circle, we need a boolean visited array to mark whether a node has been visited or not. Moreover, there are two ways to represent a graph, either the adjcent matrix or the table.

DFS#

DFS uses a stack data structure or recursive way for backtracking.

recursive:

class Solution{
    public void backTracking(int[][] adj) {
        int n = adj.length;
        boolean[] visited = new boolean[n];
        Arrays.fill(visited, false);
        for(int i = 0; i < n; i++) {
            if(!visited[i]) {
                traversal(i, adj, visited);
            }
        }
    }
    public void traversal(int cur, int[][] adj, boolean visited) {
        visited[cur] = true;
        //visited cur
        for(int i = 0; i < adj[cur].length; i++) {
            if(!visited[i] && adj[i][j] == 1) {
                traversal(i, adj, visited);
            }
        }
    }
}

Stack

class Solution{
    public void backTracking(int[][] adj) {
        int n = adj.length;
        boolean[] visited = new boolean[n];
        Arrays.fill(visited, false);
        Stack<Integer> stack = new Stack<>();
        for(int i = 0; i < n; i++) {
            if(!visited[i]) {
                visited[i] = true;
                //Visit i
                stack.push(i)
            }
            while(!stack.isEmpty()) {
                int k = stack.pop();
                for(int j = 0; j < n; j++) {
                    if(adj[k][j] == 1 && !visited[j]) {
                        visited[j] = true;
                        //visit j
                        stack.push(j);
                        break;
                    }
                }
            }
        }
    }
}

BFS#

BFS uses a FIFO queue for book-keeping.

class Solution{
    public void bfs(int[][] adj) {
        int n = adj.length;
        boolean[] visited = new boolean[n];
        Arrays.fill(visited, false);
        Queue<Integer> queue = new LinkedList<>();
        for(int i = 0; i < n; i++) {
            if(!visited[i]) {
                visited[i] = true;
                //Visit i
                queue.offer(i);
            }
            while(!queue.isEmpty()) {
                int k = queue.poll();
                for(int j = 0; j < n; j++) {
                    if(adj[k][j] == 1 && !visited[j]) {
                        visited[j] = true;
                        //visit j
                        queue.offer(j);
                    }
                }
            }
        }
    }
}

Complexity Analysis#

Please notice that the time complexity is related to the data structure of graph

If a graph stored with adjacent matrix, the time complexity is $O(V^2)$

However, if a graph stored with adjacent list, the time complexity can be lowing to $O(V+E)$;

It visited all the vertices in the connected component;
edges labeled by traversal form a spanning tree of the connected component

Topological Sort#

In a DAG(direct acyclic graph), each vertex represents a task that must be completed and a directed edge$(u,v)$ indicates that task v depends on task u. If we want to find an exist order to complete the tasks. This called topological order or topological sort. There are two ways to get a proper topological order:

1. select a vertex
1. run DFS and return vertices that has no undiscorvered leaving edges
1. run DFS several times(each DFS uses linear time)
1. We can get verties in reverse order

1. select a vertex whose indegree = 0
1. return that vertex and continue search vertexs whose indegree = 0

Planar Graph#

A connected graph is planar if it can be drawn in the plane with edge drawn as a continuous curve such that no two edge cross. A planar graph in addition to vertices and edges also has disjoint faces.

Theorem1.#

(Euler’s formula) If G is a connected plaar graph with V vertices, E edges, and F faces, then
$$V - E + F = 2$$

Theorem2.#

In any simple connected planar graph with at least 3 vertices,
$$E\leq3V-6$$

remaing proof for later :D

Applications#

Example1#

Description#

Finding the longest pat in a DAG with linear time.

Solution#

We use a topological Order to solve that problem. We apply a new attribute to all node which means the longest path from the start point to the current node. If we can find a longer path we update this attribute.
Here is the code:

Class Solution{
    public void TopologicalOrder(int[][] adj) {
        int n = adj.length;
        int[] indegree = new int[n];
        int[] longestPath = new int[n];
        ArrayList<Integer> res = new ArrayList<>();
        Arrays.fill(indegree, 0);
        Arrays.fill(longestPath, 0);
        for(int i = 0; i < n; i++) {
            for(int j = i; j < n; j++) {
                if(adj[i][j] == 1) indegree[j]++;
            }
        }
        for(int i = 0; i < n; i++) {
            if(indegree[i] == 0) {
                queue.add(i);
            }
        }
        while(!queue.isEmpty()) {
            int j = queue.remove();
            for(int k = 0; k < n; k++) {
                if(adj[j][k] == 1) {
                    indegree[k] -= 1;
                    longestPath[k] = Math.max(longestPath[k], 1 + longestPath[j]);
                    if(indegree[k] == 0) queue.add(k);
                }
            }
        }
    }
}

Reference#

[1] Algorithms in Action, by V. Savvich, First Edition, 2019.

Algorithms review 1 Runtime Complexity

Posted on 2020-01-29 In Algorithms

This is a personal notes of algorithms courses

Definition
Upper Bound (Big-O)
Lower Bound (Big-Omega)
Exact Bound (Big-Theta)
Lower Bound for Sorting
Some tips to compare time complexity
Reference

Definition#

We measure the run time of an algorithm by counting the number of steps and therefore define an algorithmic complexity as a numerical function T(n), where n is the input size.
Constant time means the time is independent of the input size
We measure the runtime of an algorithm using $\Theta$,$\Omega$,O

Upper Bound (Big-O)#

For any monotonic functions with two positive number f, g,we say $f(n) = O(g(n))$[or $f(n)\in O(g(n))$ if $g(n)$ eventually dominates $f(n)$]. Formally, there exists a positive real number $c>0$, and a real number, $n_0$, such that $f(n) \leq c \cdot g(n)$ for all $n \geq n_0$. (To be simple, when $n$ goes infinity, $f(n)$ is smaller or equal than $g(n)$) or we called $g(n)$ is upper bound of $f(n)$.

Lower Bound (Big-Omega)#

For any monotonic functions with two positive numbers f, g,we say $f(n) = \Omega(g(n))$[or $f(n)\in \Omega(g(n))$ if $f(n)$ eventually dominates $g(n)$]. Formally, there exists a positive real number $c>0$, and a real number, $n_0$, such that $f(n) \geq c \cdot g(n)$ for all $n \geq n_0$. (To be simple, when $n$ goes infinity, $f(n)$ is greater or equal than $g(n)$)

Exact Bound (Big-Theta)#

For any monotonic functions with two positive numbers f, g,we say $f(n) = \Theta(g(n))$[or $f(n)\in \Theta(g(n))$ if $f(n) = \Omega(g(n))$ and $f(n) = O(g(n))$]. Formally, there exists two positive real number $c_1$ and $c_2$, and a real number, $n_0$, such that $c_1 \cdot g(n) \leq f(n) \leq c_2 \cdot g(n)$ for all $n \geq n_0$. (To be simple, when $n$ goes infinity, $f(n)$ has the same increasing rate with $g(n)$)

Lower Bound for Sorting#

For any sorting based on comparison we must take at least($\Omega(n\log(n))$)to sort an array of n elements in the worst case. A decision tree model was introducted to illustrate this model.

Suppose we have input size = n, Now we have 6 leaves in this decision tree which can represent as $n!$. If we make it as a Full Binary Tree, the number of leaves becomes $2^h$ where h represents its height. Hence, $2^h \geq n!$, after taking log on both sides, $h \geq\log(n!)$. Since h(the height of decision tree) represents the number of comparisons in a sorting algorithm, so $T(n) = h \geq \log(n!)$. To simplify $\log(n!)$:
$$\log(n!) = \log(n(n-1)(n-2)…1)$$
$$\log(n!) \geq \log(n(n-1)(n-2)…(n-n/2))$$
$$\log(n!) \geq \log((n/2)^{n/2}) = \Omega(n \cdot \log(n))$$

Some tips to compare time complexity#

Example 1#

Arrange the following functions in increasing order of growth rae with $g(n)$ gollowing $f(n)$ in your list if and only if f(n) = $O(g(n))$

$\log{n^n}$, $n^2$,$n^{\log n}$,$n\log\log{n}$,$2^{\log{n}}$,$\log^2{n}$,$n^{\sqrt{2}}$

Tips:

$2^{\log{n}}$ = $n$

$\log{n^n}$ = $n\cdot\log{n}$

TRICK: set n = 1024

Example 2#

Suppose that f(n) and g(n) are two positive non-dereasing functions such that $f(n) = O(g(n))$, Is it true that $2^{f(n)} = O(2^{g(n)})$ ?

Tip : Suppose $f(n) = 2n$ and $g(n) = n$.

Example 3#

Arrange the following functions in increasing order of growth rae with $g(n)$ gollowing $f(n)$ in your list if and only if f(n) = $O(g(n))$

$4^{\log n}$,$\sqrt{\log n}$,$n^{\log\log n}$,$(\sqrt{2})^{\log n}$,$2^{2\sqrt{\log n}}$,$n^{1/\log{n}}$,$(\log{n})!$

Tips:

$\because$ $a^{\log_a{N}} = N$ $\And$ $(a^m)^n = a^{mn} = {a^n}^m$

$\therefore$ $4^{\log n} = n^2$

$\because$ $\log_a{N} = \frac{\log_m{N}}{\log_m{a}}$

$\therefore$ $\log_{a^m}{b^n} = \frac{n}{m}\log_a{b}$ $\And$ $\log_a{b} = \frac{1}{\log_b{a}}$

$\therefore$ $n^{1/\log{n}} = n^{log_n{2}} = 2$

$\because$ $n! = \sqrt{2\pi{n}}(\frac{n}{e})^n$

$\therefore$ $n! = O(\sqrt{n}\cdot{n^n})$

$\therefore$ $(\log{n})! = O(\sqrt{\log{n}}\cdot(\log n)^{\log n})$

Set $\log{n} = t$

$n^{\log\log n} = n^{\log t} = n^{\frac{\log_n{t}}{\log_n{2}}} = (n^{log_n{t}})^\frac{1}{\log_n{2}} = t^\frac{1}{log_n{2}} = (\log{n})^{\log{n}}$

Example 4#

Suppose that f(n) and g(n) are two positive non-dereasing functions such that $f(n) = \Omega(g(n))$, Is it true that $2^{f(n)} = \Omega(2^{g(n)})$ ?

Tip : Suppose $f(n) = n$ and $g(n) = 5n$.

Reference#

[1] Algorithms in Action, by V. Savvich, First Edition, 2019.

latex

Posted on 2020-01-13

MYSQL NOTES

Posted on 2020-01-04 In Database

This is a note of MySQL datbase for future review

1. Basic Concept
2. DQL (Data query language)

1. Basic Concept#

DB: database, a container used to store data.
DBMS: database management system, a system used to create or manamge database
SQL: strucutred query language, a kind of way to communicate with database.

2. DQL (Data query language)#

1. Syntax#

select col_name
from table_name;

2. Character#

col_name can be a single or muliple column, constant value, expression, function
the result is a virtual table

3. Example#

Query a single column
- select col_name from table_name;
Query multiple columns
- select col_name1, col_name2 from table_name;
Query all columns
- select * from table_name;
Query constant value
- select * cons_val
- constant value with char and date must be quoted with single quote.
Query function
- select function_name(parm_list)
Query expression
- select 100/1234;
alias
- as
- space
distinct -> reduce duplication
- select distinct col_name from table_name;
+ // add operation
- select number1 + number2
- select character + number // trying to convert character to number. If succeed, continue calculation. If failed, regard character as 0.
- select null + any value // result would be always null
concat function // do characters concatination
- select concat(char1, char2, char3, …);
ifnull function // identify xxx is null or not null. if null, return 0 or return original value
- select ifnull(xxx, 0) from employees;
isnull function // identify a col is null or nut, if null, return 1 or return 0

Coding Interview

Posted on 2019-11-27 In Programming

This is a record of my craking process for 剑指offer.

1. 二维数组中的查找
- 1.1 问题描述
- 1.2 解决方案
2 替换空格
- 2.1 问题描述
- 2.2 解决方案
3 从尾到头打印链表
- 3.1 问题描述
- 3.2 解决方案
4 重建二叉树
- 4.1 问题描述
- 4.2 解决方案
5 用两个栈实现队列
- 5.1 问题描述
- 5.2 解决方案
6 旋转数组的最小数字
- 6.1 问题描述
- 6.2 解决方案
7 斐波那契数列
- 7.1 问题描述
- 7.2 解决方案
8 跳台阶
- 8.1 问题描述
- N.2 解决方案
9 变态跳台阶
- 9.1 问题描述
- 9.2 解决方案
10.矩形覆盖
- 10.1 问题描述
- 10.2 解决方案
11 二进制中1的个数
- 11.1 问题描述
- 11.2 解决方案
12 数值的整数次方
- 12.1 问题描述
- 12.2 解决方案
13 调整数组顺序使奇数位于偶数前面
- 13.1 问题描述
- 13.2 解决方案
14 链表中倒数第k个结点
- 14.1 问题描述
- 14.2 解决方案
15 反转链表
- 15.1 问题描述
- 15.2 解决方案
16 合并两个排序的链表
- 16.1 问题描述
- 16.2 解决方案
17 工具模版
- 17.1 问题描述
- 17.2 解决方案
18 二叉树的镜像
- 18.1 问题描述
- 18.2 解决方案
19 顺时针打印矩阵
- 19.1 问题描述
- 19.2 解决方案
20 包含min函数的栈
- 20.1 问题描述
- 20.2 解决方案
21 栈的压入、弹出序列
- 21.1 问题描述
- 21.2 解决方案
22 从上往下打印二叉树
- 22.1 问题描述
- 22.2 解决方案
23 二叉搜索树的后序遍历序列
- 23.1 问题描述
- 23.2 解决方案
24 二叉树中和为某一值的路径
- 24.1 问题描述
- 24.2 解决方案
N 工具模版
- N.1 问题描述
- N.2 解决方案

1. 二维数组中的查找#

1.1 问题描述#

在一个二维数组中（每个一维数组的长度相同），每一行都按照从左到右递增的顺序排序，每一列都按照从上到下递增的顺序排序。请完成一个函数，输入这样的一个二维数组和一个整数，判断数组中是否含有该整数。

1.2 解决方案#

先将坐标定位在右上角，然后将该位置的值与target比较，如果target值比较大则坐标向下移动，否则坐标向左移动。

public boolean Find(int target, int [][] array) {
    int m = array.length - 1;
    int n = array[0].length - 1;
    for(int i = 0, j = n; i <= m && j >= 0;) {
        if(array[i][j] == target) return true;
        if(array[i][j] < target) i++;
        else j--;
    }
    return false;
}

2 替换空格#

2.1 问题描述#

请实现一个函数，将一个字符串中的每个空格替换成“%20”。例如，当字符串为We Are Happy.则经过替换之后的字符串为We%20Are%20Happy。

2.2 解决方案#

这个问题只需要遍历一遍str就可以了，重点看一下StringBuffer的api

public String replaceSpace(StringBuffer str) {
	for(int i = 0; i < str.length(); i++)
        if(str.charAt(i) == ' ') str.replace(i, i+1, "%20");
    return str.toString();
}

3 从尾到头打印链表#

3.1 问题描述#

输入一个链表，按链表从尾到头的顺序返回一个ArrayList。

3.2 解决方案#

这个问题只需要遍历一遍链表然后把得到的数值插入ArrayList的第一个位置上。

/**
*    public class ListNode {
*        int val;
*        ListNode next = null;
*
*        ListNode(int val) {
*            this.val = val;
*        }
*    }
*
*/
import java.util.ArrayList;
public class Solution {
    public ArrayList<Integer> printListFromTailToHead(ListNode listNode) {
        ArrayList<Integer> res = new ArrayList<>();
        while(listNode != null) {
            res.add(0,listNode.val);
            listNode = listNode.next;
        }
        return res;
    }
}

4 重建二叉树#

4.1 问题描述#

输入某二叉树的前序遍历和中序遍历的结果，请重建出该二叉树。假设输入的前序遍历和中序遍历的结果中都不含重复的数字。例如输入前序遍历序列{1,2,4,7,3,5,6,8}和中序遍历序列{4,7,2,1,5,3,8,6}，则重建二叉树并返回。

4.2 解决方案#

根据前序遍历的第一个元素为根结点，然后在中序遍历中找到根结点的位置则中序遍历中根前面的一段数组为根的左子树，中序遍历中后一半数组为根的右子树。而后递归求解。
举例：前序遍历数组中第一个元素为1，说明根结点为1 然后在中序遍历数组中寻找1的左子树为{4,7,2},右子树为{5,3,8,6}.

/**
* Definition for binary tree
* public class TreeNode {
*     int val;
*     TreeNode left;
*     TreeNode right;
*     TreeNode(int x) { val = x; }
* }
*/
public class Solution {
    public TreeNode reConstructBinaryTree(int [] pre,int [] in) {
        TreeNode root = new TreeNode(pre[0]);
        for(int i = 0; i < in.length; i++) {
            if(pre[0] == in[i]) {
                root.left = helper(pre, in, 1, i, 0, i - 1);
                root.right = helper(pre, in, i + 1, pre.length - 1, i + 1, in.length - 1);
                return root;
            }
        }
        return null;
    }
    private TreeNode helper(int[] pre, int[] in, int lpre, int rpre, 
                            int lin, int rin) {
        if(lpre == rpre) return new TreeNode(pre[lpre]);
        for(int i = lin; i <= rin; i++) {
            if(in[i] == pre[lpre]) {
                TreeNode root = new TreeNode(pre[lpre]);
                root.left = helper(pre, in, lpre + 1, lpre + i - lin, lin, i - 1);
                root.right = helper(pre, in, lpre + i - lin + 1, rpre, i + 1, rin);
                return root;
            }
        }
        return null;
    }
}

5 用两个栈实现队列#

5.1 问题描述#

用两个栈来实现一个队列，完成队列的Push和Pop操作。队列中的元素为int类型。

5.2 解决方案#

第一个栈只负责将新入队列的数值押入栈。当进行出“队列”操作时，先检测右边的栈是否为空，如果为空，则将第一个栈中的数字依次弹出并将数值依次押入第二个栈，然后再从第二个栈中pop出一个值。如果第二个栈不为空，则直接从第二个栈中pop一个数值。

核心思想是利用两个栈将输入的值逆序存放在第二个栈中。

import java.util.Stack;

public class Solution {
    Stack<Integer> stack1 = new Stack<Integer>();
    Stack<Integer> stack2 = new Stack<Integer>();
    
    public void push(int node) {
        stack1.push(node);
    }
    
    public int pop() {
        if(stack2.size() != 0) return stack2.pop();
        while(!stack1.empty()) {
            stack2.push(stack1.pop());
        }
        return stack2.pop();
    }
}

6 旋转数组的最小数字#

6.1 问题描述#

把一个数组最开始的若干个元素搬到数组的末尾，我们称之为数组的旋转。
输入一个非递减排序的数组的一个旋转，输出旋转数组的最小元素。
例如数组{3,4,5,1,2}为{1,2,3,4,5}的一个旋转，该数组的最小值为1。

NOTE：给出的所有元素都大于0，若数组大小为0，请返回0。

6.2 解决方案#

从左向右扫描数组，默认是递增的，如果发现后一个数字比前一个数字小，则后一个数字一定是最小的数字。

import java.util.ArrayList;
public class Solution {
    public int minNumberInRotateArray(int [] array) {
        if(array.length == 0) return 0;
        for(int i = 1; i < array.length; i++) {
            if(array[i] < array[i - 1]) return array[i];
        }
        return array[0];
    }
}

7 斐波那契数列#

7.1 问题描述#

大家都知道斐波那契数列，现在要求输入一个整数n，请你输出斐波那契数列的第n项（从0开始，第0项为0）。
n<=39

7.2 解决方案#

=.=不解释了

public class Solution {
    public int Fibonacci(int n) {
        if( n == 0) return 0;
        if( n == 1) return 1;
        if( n == 2) return 1;
        int num1 = 1, num2 = 1;
        for(int i = 3; i <= n; i++) {
            int temp = num1 + num2;
            num2 = num1;
            num1 = temp;
        }
        return num1;
    }
}

8 跳台阶#

8.1 问题描述#

一只青蛙一次可以跳上1级台阶，也可以跳上2级。求该青蛙跳上一个n级的台阶总共有多少种跳法（先后次序不同算不同的结果）。

N.2 解决方案#

每次跳一阶或者两阶，直接递归求解

public class Solution {
    public int JumpFloor(int target) {
        if(target == 1) return 1;
        if(target == 2) return 2;
        if(target == 0) return 0;
        return JumpFloor(target - 1) + JumpFloor(target - 2);
    }
}

9 变态跳台阶#

9.1 问题描述#

一只青蛙一次可以跳上1级台阶，也可以跳上2级……它也可以跳上n级。求该青蛙跳上一个n级的台阶总共有多少种跳法。

9.2 解决方案#

青蛙跳n阶台阶有如下几种跳法

从第 1 阶跳 n-1 台阶
从第 2 阶跳 n-2 台阶

从第 3 阶跳 n-3 台阶

从第 4 阶跳 n-4 台阶

从第 5 阶跳 n-5 台阶

…

从第 n-1 阶跳 1 阶

那么总的跳法 f(n) = f(1) + f(2) + … + f(n - 1)

又因为f(n - 1) = f(1) + f(2) + … + f(n - 2)

所以 f(n) = 2 * f(n - 1) 【递推式已有】

public class Solution {
    public int JumpFloorII(int target) {
        if(target == 0) return 0;
        if(target == 1) return 1;
        if(target == 2) return 2;
        return 2 * JumpFloorII(target - 1);
    }
}

10.矩形覆盖#

10.1 问题描述#

我们可以用21的小矩形横着或者竖着去覆盖更大的矩形。请问用n个21的小矩形无重叠地覆盖一个2*n的大矩形，总共有多少种方法？

10.2 解决方案#

覆盖2 * 1（宽 * 长）只有一种办法

覆盖(2*2)矩形有横着铺两块竖着铺两块两种方法

覆盖(23)矩形有覆盖(22)的方法 + 覆盖(2*1)的方法

覆盖2 * n面积的矩形有覆盖n*(n - 2)的方法 + 覆盖n*(n-1)的方法

public class Solution {
    public int RectCover(int target) {
        if(target == 0) return 0;
        if(target == 1) return 1;
        if(target == 2) return 2;
        return RectCover(target - 1) + RectCover(target - 2);
    }
}

11 二进制中1的个数#

11.1 问题描述#

输入一个整数，输出该数二进制表示中1的个数。其中负数用补码表示。

11.2 解决方案#

每次只关心n的二进制最后一位

如果最后一位是1，那么 n - 1 最后一位就是零，进行一次 n & (n - 1)后最后一位变为0，然后逐步向前继续寻找1。

如果最后一位是0，那么就不管他了直接看倒数第二位，如果倒数第二位是1 同理经过 n & (n - 1)后倒数第二位的1也置0了计数器加一。

以此类推知道原来的数字全变为0了，做了几次n & (n - 1)就说明原来的数字有几个1

public class Solution {
    public int NumberOf1(int n) {
        int count = 0;
        while(n != 0) {
            n &= n - 1;
            count++;
        }
        return count;
    }
}

12 数值的整数次方#

12.1 问题描述#

给定一个double类型的浮点数base和int类型的整数exponent。求base的exponent次方。

保证base和exponent不同时为0

12.2 解决方案#

2的4次方可以拆成 (2^2)^(4/2)
3的四次方可以拆成 (3^2)^(4/2)
注意处理exponent为负数以及奇数的情况

public class Solution {
    public double Power(double base, int exponent) {
        if(exponent == 0) return 1.0;
        if(base == 0) return 0.0;
        if(exponent >= 0)
            return PositivePower(base, exponent);
        else
            return 1/PositivePower(base, 0-exponent);
    }
    public double PositivePower(double base, int exponent){
        if(exponent == 1 || exponent == 0)
            return base;
        else if(exponent % 2 == 0)
            return Power(base * base, exponent / 2);
        else
            return Power(base * base, exponent / 2) * base;
    }
}

13 调整数组顺序使奇数位于偶数前面#

13.1 问题描述#

输入一个整数数组，实现一个函数来调整该数组中数字的顺序，使得所有的奇数位于数组的前半部分，所有的偶数位于数组的后半部分，并保证奇数和奇数，偶数和偶数之间的相对位置不变。

13.2 解决方案#

定义两个变量i,j i找到第一个偶数，j从i+1的位置开始找找到第一个奇数，i到j之间的数字必为偶数。将 i 到 j 的数字依次后移，然后将j位置的奇数放到前面去

public class Solution {
    public void reOrderArray(int [] array) {
        int p = 0, q = 0;
        while(p < array.length) {
            while(p < array.length && array[p] % 2 != 0) p++;
            q = p + 1;
            while(q < array.length && array[q] % 2 != 1) q++;
            if(q >= array.length) break;
            int tmp = array[q];
            for(int i = q - 1; i >= p; i--) {
                array[i + 1] = array[i]; 
            }
            array[p++] = tmp;
        }
    }
}

14 链表中倒数第k个结点#

14.1 问题描述#

输入一个链表，输出该链表中倒数第k个结点。

14.2 解决方案#

没啥好说的定义两个指针同时向后移动。注意testcase 有链表为空或者k=0的情况

/*
public class ListNode {
    int val;
    ListNode next = null;

    ListNode(int val) {
        this.val = val;
    }
}*/
public class Solution {
    public ListNode FindKthToTail(ListNode head,int k) {
        if(head == null) return head;
        if(k == 0) return null;
        ListNode p = head;
        ListNode q = head;
        for(int i = 0; i < k; i++) {
            if(q == null) return null;
            q = q.next;
        }
        while(q != null) {
            p = p.next;
            q = q.next;
        }
        return p;
    }
}

15 反转链表#

15.1 问题描述#

输入一个链表，反转链表后，输出新链表的表头。

15.2 解决方案#

定义一个虚拟头部dummy,做正常的反转操作即可

/*
 public class ListNode {
     int val;
     ListNode next = null;

     ListNode(int val) {
         this.val = val;
     }
 }*/
 public class Solution {
     public ListNode ReverseList(ListNode head) {
         if(head == null || head.next == null) return head;
         ListNode dummy = new ListNode(0);
         dummy.next = head;
         ListNode p = head;
         while(p.next != null) {
             ListNode point = p.next.next;
             p.next.next = dummy.next;
             dummy.next = p.next;
             p.next = point;
         }
         return dummy.next;
     }
 }

16 合并两个排序的链表#

16.1 问题描述#

输入两个单调递增的链表，输出两个链表合成后的链表，当然我们需要合成后的链表满足单调不减规则。

16.2 解决方案#

归并排序思想

/*
public class ListNode {
    int val;
    ListNode next = null;

    ListNode(int val) {
        this.val = val;
    }
}*/
public class Solution {
    public ListNode Merge(ListNode list1,ListNode list2) {
        ListNode p = list1;
        ListNode q = list2;
        ListNode dummy = new ListNode(0);
        ListNode ans = dummy;
        while(p != null && q != null) {
            if(p.val <= q.val) {
                dummy.next = p;
                p = p.next;
            } else {
                dummy.next = q;
                q = q.next;
            }
            dummy = dummy.next;
        }
        if(p == null) dummy.next = q;
        else dummy.next = p;
        return ans.next;
    }
}

17 工具模版#

17.1 问题描述#

输入两棵二叉树A，B，判断B是不是A的子结构。（ps：我们约定空树不是任意一个树的子结构）

17.2 解决方案#

递归比较就好了～

/**
public class TreeNode {
    int val = 0;
    TreeNode left = null;
    TreeNode right = null;

    public TreeNode(int val) {
        this.val = val;

    }

}
*/
public class Solution {
    public boolean HasSubtree(TreeNode root1,TreeNode root2) {
        if(root2 == null) return false;
        if(root1 == null) return false;
        return helper(root1, root2) ||
            HasSubtree(root1.left, root2) ||
            HasSubtree(root1.right, root2);
    }
    public boolean helper(TreeNode root1, TreeNode root2) {
        if(root2 == null) return true;
        if(root1 == null) return false;
        return root1.val == root2.val &&
            helper(root1.left, root2.left) &&
            helper(root1.right, root2.right);
        
    }
}

18 二叉树的镜像#

18.1 问题描述#

操作给定的二叉树，将其变换为源二叉树的镜像。
输入描述:

二叉树的镜像定义：源二叉树

18.2 解决方案#

先把根结点左右节点交换，然后再对左右节点递归

/**
public class TreeNode {
    int val = 0;
    TreeNode left = null;
    TreeNode right = null;

    public TreeNode(int val) {
        this.val = val;

    }

}
*/
public class Solution {
    public void Mirror(TreeNode root) {
        if(root == null) return;
        TreeNode tmp = root.left;
        root.left = root.right;
        root.right = tmp;
        Mirror(root.left);
        Mirror(root.right);
    }
}

19 顺时针打印矩阵#

19.1 问题描述#

输入一个矩阵，按照从外向里以顺时针的顺序依次打印出每一个数字，例如，如果输入如下4 X 4矩阵： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 则依次打印出数字1,2,3,4,8,12,16,15,14,13,9,5,6,7,11,10.

19.2 解决方案#

旋转打印就好向右向下向左向上 …

import java.util.ArrayList;
public class Solution {
    public ArrayList<Integer> printMatrix(int [][] matrix) {
        ArrayList<Integer> ans = new ArrayList<Integer>();
        if(matrix.length == 0) return ans;
        int m = matrix.length;
        int n = matrix[0].length;
        int i = 0,j = 0;
        int direction = 0;
        for(int count = 0; count < m * n; count++) {
            if(direction % 4 == 0) {
                ans.add(matrix[i][j]);
                if(j == n - 1 - direction / 4) {
                    direction += 1;
                    i++;
                } else {
                    j++;
                }
            } else if(direction % 4 == 1) {
                ans.add(matrix[i][j]);
                if(i == m - 1 - direction / 4) {
                    direction += 1;
                    j--;
                }
                else
                    i++;
            } else if(direction % 4 == 2) {
                ans.add(matrix[i][j]);
                if(j == 0 + direction / 4) {
                    direction += 1;
                    i--;
                }
                else
                    j--;
            } else {
                ans.add(matrix[i][j]);
                if(i == 1 + direction / 4) {
                    direction += 1;
                    j++;
                }
                else
                    i--;
            }
        }
        return ans;
    }
}

20 包含min函数的栈#

20.1 问题描述#

定义栈的数据结构，请在该类型中实现一个能够得到栈中所含最小元素的min函数（时间复杂度应为O（1））。

20.2 解决方案#

准备两个栈，一个栈做正常的push，pop操作。第二个栈记录当前的最小值，如果最小值即将从第一个栈中pop出去，那么他也从第二个栈中pop出去。

import java.util.Stack;

public class Solution {

    Stack<Integer> stack1 = new Stack<>();
    Stack<Integer> stack2 = new Stack<>();
    
    public void push(int node) {
        stack1.push(node);
        if(stack2.empty() || node < stack2.peek()) {
            stack2.push(node);
        }
    }
    
    public void pop() {
        if(stack1.peek() == stack2.peek()) stack2.pop();
        stack1.pop();
    }
    
    public int top() {
        return stack1.peek();
    }
    
    public int min() {
        return stack2.peek();
    }
}

21 栈的压入、弹出序列#

21.1 问题描述#

输入两个整数序列，第一个序列表示栈的压入顺序，请判断第二个序列是否可能为该栈的弹出顺序。假设压入栈的所有数字均不相等。例如序列1,2,3,4,5是某栈的压入顺序，序列4,5,3,2,1是该压栈序列对应的一个弹出序列，但4,3,5,1,2就不可能是该压栈序列的弹出序列。（注意：这两个序列的长度是相等的）

21.2 解决方案#

解法如下根据popA的结果来模拟原来栈的插入和弹出过程。

如果栈顶元素和popA的当前元素不同，则继续push元素

如果栈顶元素和popA的当前元素相同，则弹出栈的当前元素并把popA的当前位置后移一个

最后如果pushA数组中的元素完全进入了栈中，则弹出序列已经无法更改，直接把栈中元素一个个弹出和popA数组的剩余元素比就好了

import java.util.Stack;

public class Solution {
    public boolean IsPopOrder(int [] pushA,int [] popA) {
        if(pushA.length == 0) return false;
        if(pushA.length == 1) return pushA[0] == popA[0];
        Stack<Integer> stack = new Stack<>();
        int i = 0, j = 0;
        while(i < pushA.length) {
            if(stack.empty() || stack.peek() != popA[j]) {
                stack.push(pushA[i++]);
            } else {
                stack.pop();
                j++;
            }
        }
        if(j == popA.length - 1) return true;
        else {
            for(int k = j; k < popA.length; k++) {
                if(popA[k] != stack.pop()) return false;
            }
            return true;
        }
    }
}

22 从上往下打印二叉树#

22.1 问题描述#

从上往下打印出二叉树的每个节点，同层节点从左至右打印。

22.2 解决方案#

很简单的队列操作每次把根结点的值提出来然后把他的对应的子节点押入新队列然后用新队列替换掉老队列。

import java.util.ArrayList;
/**
public class TreeNode {
    int val = 0;
    TreeNode left = null;
    TreeNode right = null;

    public TreeNode(int val) {
        this.val = val;

    }

}
*/
public class Solution {
    public ArrayList<Integer> PrintFromTopToBottom(TreeNode root) {
        if(root == null) return new ArrayList<Integer>();
        ArrayList<TreeNode> helperArr = new ArrayList<>();
        ArrayList<Integer> ans = new ArrayList<>();
        if(root.left == null && root.right == null) {
            ans.add(root.val);
            return ans;
        }
        helperArr.add(root);
        while(helperArr.size() != 0) {
            ArrayList<TreeNode> temp = new ArrayList<>();
            for(int i = 0; i < helperArr.size(); i++) {
                TreeNode node = helperArr.get(i);
                ans.add(node.val);
                if(node.left != null) {
                    temp.add(node.left);
                }
                if(node.right != null) {
                    temp.add(node.right);
                }
            }
            helperArr = temp;
        }
        return ans;
    }
}

23 二叉搜索树的后序遍历序列#

23.1 问题描述#

输入一个整数数组，判断该数组是不是某二叉搜索树的后序遍历的结果。如果是则输出Yes,否则输出No。假设输入的数组的任意两个数字都互不相同。

23.2 解决方案#

递归操作，根据二叉搜索树的特点寻找左子树与右子树的分界点然后递归操作就可以啦

    public class Solution {
    public boolean VerifySquenceOfBST(int [] sequence) {
        if(sequence.length == 0) return false;
        return verify(sequence, 0, sequence.length - 1);
    }
    private boolean verify(int [] sequence, int l, int r) {
        if(l >= r) return true;
        int i = r;
        while(i > l && sequence[i] >= sequence[r]) --i;
        for(int j = i - 1; j >= l; j--) 
            if(sequence[j] > sequence[r]) 
                return false;
        return verify(sequence,l, i - 1) && verify(sequence, i, r - 1);
    }
}

24 二叉树中和为某一值的路径#

24.1 问题描述#

输入一颗二叉树的跟节点和一个整数，打印出二叉树中结点值的和为输入整数的所有路径。路径定义为从树的根结点开始往下一直到叶结点所经过的结点形成一条路径。(注意: 在返回值的list中，数组长度大的数组靠前)

24.2 解决方案#

标准深度优先遍历注意corner case

import java.util.ArrayList;
/**
public class TreeNode {
    int val = 0;
    TreeNode left = null;
    TreeNode right = null;

    public TreeNode(int val) {
        this.val = val;

    }

}
*/
public class Solution {
    public ArrayList<ArrayList<Integer>> FindPath(TreeNode root,int target) {
        if(root == null) return new ArrayList<>();
        ArrayList<ArrayList<Integer>> ans = new ArrayList<>();
        helper(ans, new ArrayList(), root, target);
        return ans;
        
    }
    private void helper(ArrayList<ArrayList<Integer>> ans, ArrayList<Integer> tempList, TreeNode node,int target) {
        if(target == node.val && node.left == null && node.right == null) {
            tempList.add(node.val);
            ans.add(new ArrayList<Integer>(tempList));
            tempList.remove(tempList.size()-1);
        } else {
            if(node.left != null) {
                tempList.add(node.val);
                helper(ans, tempList, node.left, target - node.val);
                tempList.remove(tempList.size()-1);
            }
            if(node.right != null) {
                tempList.add(node.val);
                helper(ans, tempList, node.right, target - node.val);
                tempList.remove(tempList.size()-1);
            }
        }
    }
}

N 工具模版#

N.1 问题描述#

xxx

N.2 解决方案#

xxx

1	System.out.println("工具人");

Java Personal Notes

Posted on 2019-11-27 In Interviews

This is a record of JAVA questions from Nowcoder.

!!!
Most original answers rights to the original author listed on Nowcoder websites if there is any wrong please contact me as soon as possible to do delete processing.[xjm.good@gmail.com]

Java Basic
- Basic grammar
- Exception
- OOP
- Design Pattern
- Networking
- I/O
- Multi-threading
Java Web
- Java Web Basic
- Spring Framework

Java Basic#

Basic grammar#

三元操作符如果遇到可以转换为数字的类型，会做自动类型提升。
1
2
Object o1 = (false) ? new Double(1.0) : new Integer(2);
System.out.println(o1);
Output: 2.0

三元操作符类型的转换规则：
- 若两个操作数不可转换，则不做转换，返回值为Object类型
- 若两个操作数是明确类型的表达式（比如变量），则按照正常的二进制数字来转换，int类型转换为long类型，long类型转换为float类型等。
- 若两个操作数中有一个是数字S,另外一个是表达式，且其类型标示为T，那么，若数字S在T的范围内，则转换为T类型；若S超出了T类型的范围，则T转换为S类型。
- 若两个操作数都是直接量数字，则返回值类型为范围较大者
Java中的那些基本类型属于原生类，而数组是引用类型，不属于原生类，可以看成是一种对象。
Synchronized保证三大性，原子性，有序性，可见性，volatile保证有序性，可见性，不能保证原子性
Java的堆内存分为两块:permantspace（持久带）和 heap space。
持久带中主要存放用于存放静态类型数据，如 Java Class, Method 等，与垃圾收集器要收集的Java对象关系不大。
而heapspace分为年轻带和年老带
年轻代的垃圾回收叫 Young GC，年老代的垃圾回收叫 Full GC。
在年轻代中经历了N次（可配置）垃圾回收后仍然存活的对象，就会被复制到年老代中。因此，可以认为年老代中存放的都是一些生命周期较长的对象
年老代溢出原因有循环上万次的字符串处理、创建上千万个对象、在一段代码内申请上百M甚至上G的内存，既A B D选项
持久代溢出原因动态加载了大量Java类而导致溢出。
Java 内部类、成员类、局部类、匿名类等
- 类指外部类，最大的类，修饰符有public(表示该类在项目所有类中可以被导入），default(该类只能在同一个package中使用）,abstract,final
- 内部类指位于类内部但不包括位于块、构造器、方法内，且有名称的类，修饰符有public,private,protected访问控制符，也可以用static,final关键字修饰，public和private比较简单，一个表示所有可以被所有类访问，一个表示只能被自身访问，protected修饰的成员类可以被同一个包中的类和子类访问。而default修饰的成员类只能被同一个包中的类访问。
- 局部内部类指位于块、构造器、方法内的有名称类，最多只能有final修饰
Reference Link

Java几个常用程序

程序名	用法
java.exe	java虚拟机
javadoc.exe	制作java文档
jdb.exe	java的调试器
javaprof.exe	剖析工具
jps.exe	查看本机java进程信息
jstack	打印线程的栈信息，制作线程dump文件
jmap	打印内存映射，制作堆dump文件
jstat	性能监控工具
jhat	内存分析工具
jconsole	简易的可视化控制台
jvisualvm	功能强大的控制台

Java中的位运算符：
- >>表示右移，如果该数为正，则高位补0，若为负数，则高位补1；
- >>>表示无符号右移，也叫逻辑右移，即若该数为正，则高位补0，而若该数为负数，则右移后高位同样补0。
抽象类中可以有普通成员变量，接口中没有普通成员变量
HttpServletResponse完成：设置http头标，设置cookie，设置返回数据类型，输出返回数据；读取路径信息是HttpServletRequest做的
ArrayList的空间浪费主要体现在在list列表的结尾预留一定的容量空间，而LinkedList的空间花费则体现在它的每一个元素都需要消耗相当的空间。//正确，因为ArrayList空间的增长率为1.5倍，所以，最后很可能留下一部分空间是没有用到的，因此，会造成浪费的情况。对于LInkedList的话，由于每个节点都需要额外的指针。

可以正确获取结果集的有
+

1 2	Statement sta=con.createStatement(); ResultSet rst=sta.executeQuery(“select * from book”);

1
2
PreparedStatement pst=con.prepareStatement(“select * from book”);
ResultSet rst=pst.executeQuery();

Java允许对boolean类型的值进行按位“与”、“或”和“异或”操作，但不能进行按位“非”。对于boolean值，按位操作与逻辑操作有相同的结果，但是不会发生“短路”。
Array的foreach可以用break直接跳出
java初始化顺序
- 不同类中父类 > 子类
- 同类中静态变量 > 静态代码块 > 普通成员变量 > 普通代码块 > 构造方法
(b1 + b4) 结果仍然转换成int //b1, b4均为byte
Java复制的效率System.arraycopy>clone>Arrays.copyOf>for循环，这个有兴趣自己测试一下就知道了。这里面在System类源码中给出了arraycopy的方法，是native方法，也就是本地方法，肯定是最快的。而Arrays.copyOf(注意是Arrays类，不是Array)的实现，在源码中是调用System.copyOf的，多了一个步骤，肯定就不是最快的。

二维数组定义，一维长度必须定义，二维可以后续定义

float f[][] = new float[6][6];
float []f[] = new float[6][6];
float [][]f = new float[6][6];
float [][]f = new float[6][];

java Collection and map
- Collection
  - List
    - LinkedList 非同步
    - ArrayList 非同步，实现了可变大小的元素数组
    - Vector 同步
      - Stack
  - Set 不允许有相同的元素
- Map
  - HashTable 同步，实现一个key–value映射的哈希表
  - HashMap 非同步，
  - WeakHashMap 改进的HashMap，实现了“弱引用”，如果一个key不被引用，则被GC回收
加载驱动方法
- Class.forName(“com.microsoft.sqlserver.jdbc.SQLServerDriver”);
- DriverManager.registerDriver(new com.mysql.jdbc.Driver());
- System.setProperty(“jdbc.drivers”, “com.mysql.jdbc.Driver”);
1
Arrays.asList(num1, num2, num3)

1
2
3
Arrays.sort(intervals, (x, y) -> {
       return Integer.compare(x[0], y[0]);
    });

1
2
map.getOrDefault(t.charAt(i), 0);
map.computeIfAbsent() //V computeIfAbsent(K key, Function<? super K, ? extends V> mappingFunction)

Exception#

Java的try后不一定必须有catch块，也可以加final块但是一定要有(要么有catch要么有final)
异常是指程序运行时（非编译）所发生的非正常情况或错误，当程序违反了语音规则，jvm就会将出现的错误表示一个异常抛出。
异常也是java 的对象，定义了基类 java。lang。throwable作为异常父类。这些异常类又包括error和exception。两大类
error类异常主要是运行时逻辑错误导致，一个正确程序中是不应该出现error的。当出现error一般jvm会终止。
exception表示可恢复异常，包括检查异常和运行时异常。检查异常是最常见异常比如 io异常sql异常，都发生在编译阶段。这类通过try、catch捕捉
而运行时异常，编译器没有强制对其进行捕捉和处理。一般都会把异常向上抛出，直到遇到处理代码位置，若没有处理块就会抛到最上层，多线程用thread。run（）抛出，单线程用main（）抛出。常见的运行异常包括空指针异常类型转换异常数组月结异常数组存储异常缓冲区溢出异常算术异常等。

OOP#

面向对象程序设计方法的优点包含
- 可重用性
- 可扩展性
- 易于管理和维护

Design Pattern#

Java中四种线程安全的单例模式实现方式

饿汉式(线程安全)

public class Single2 {

    private static Single2 instance = new Single2();
    
    private Single2(){
        System.out.println("Single2: " + System.nanoTime());
    }
    
    public static Single2 getInstance(){
        return instance;
    }
}

懒汉式(方法中没有synchronize关键字，不安全)

public class Single3 {

    private static Single3 instance = null;
    
    private Single3(){
        System.out.println("Single3: " + System.nanoTime());
    }
    
    public static synchronized Single3 getInstance(){
        if(instance == null){
            instance = new Single3();
        }
        return instance;
    }
}

懒汉模式改良版（线程安全，使用了double-check，即check-加锁-check，目的是为了减少同步的开销）

public class Single4 {

    private volatile static Single4 instance = null;
    
    private Single4(){
        System.out.println("Single4: " + System.nanoTime());
    }
    
    public static Single4 getInstance(){
        if(instance == null){
            synchronized (Single4.class) {
                if(instance == null){
                    instance = new Single4();
                }
            }
        }
        return instance;
    }
}

利用私有的内部工厂类（线程安全，内部类也可以换成内部接口，不过工厂类变量的作用于要改为public了。）

public class Singleton {

    private Singleton(){
        System.out.println("Singleton: " + System.nanoTime());
    }
    
    public static Singleton getInstance(){
        return SingletonFactory.singletonInstance;
    }
    
    private static class SingletonFactory{
        private static Singleton singletonInstance = new Singleton();
    }
}

Networking#

如果希望监听TCP端口9000，服务器端应该怎样创建socket?
1
Server serverProcess = new ServerSocket(9000);
OSI七层协议
1. 物理层传输比特流 -> 网卡接收发送比特流
2. 数据链路层物理寻址同时将原始比特流转变为逻辑传输线路如何格式化数据如何控制对物理层的访问错误校验
3. 网络层网络地址到网络地址 IP 协议
4. 传输层接受上一层数据在必要的时候进行数据拆分并且将保证这些数据段有效到达对端流量控制 TCP/UDP
5. 会话层建立管理应用层通讯
6. 表示层不同系统间通讯语法问题对信息进行格式化
7. 应用层消息头消息体规定数据传递格式
自上而下封包自下而上拆包
TCP/IP 四层结构
1. 应用层文件传输数据格式化接触或建立去别的点的协议
2. 传输层提供端对端接口
3. 网络层为数据包选择路由
4. 链路层以二进制形式再物理媒体上传输数据
TCP三次握手
1. TCP简介
  1. 面向连接的可靠的基于字节流的传输层通信协议
  2. 将应用层的数据流分割成报文段并发送给目标节点的tcp层
  3. 数据包都有序号对方收到则发送ACK确认未收到则重传
  4. 使用校验和来检验数据再传输过程中是否有误
2. TCP报文头部
  1. Source Port 源端口 2字节唯一标示主机的一个进程
  2. Destination Port 目的端口 2字节
  3. Sequence Number 4个字节按顺序对每个报文段进行编号
  4. Acknowledge Number 期望收到对方下一个报文的第一个数据字节的序号
  5. Offset 因为TCP头又可选字段 offset指出tcp的数据距离起始头部有多远
  6. Reserved 保留域
  7. TCP Flags
    1. URG 紧急指针标志为1时紧急指针有效
    2. ACK 确认序号标志为1表明确认信息有效为0表明报文中不含确认信息忽略确认号字段
    3. PSH push标志为1表示带有push的数据接收方收到后尽快交给应用层而不是在缓冲区排队
    4. RST 重置连接标志重置由于主机崩溃或其他原因出现错误的连接拒绝非法报文段和拒绝连接请求
    5. SYN 同步序列号用于建立连接过程
    6. FIN finsh标志用于释放标志为1时标识发送方没有数据发送了即将关闭连接
  8. window 窗口大小用户控制传输速率
  9. checksum 校验和
  10. Urgent Pointer 指出本报文段紧急数据的字节数
3. 三次握手为了建立连接
  1. A B 同时处于closed 状态主动打开的客户端被动打开的是服务端服务器端建立一个TCB 传输控制块准备接受请求服务器端进入listen状态
  2. A创建一个TCB 主动打开连接向服务器发送连接请求报文 SYN seq = x x是一个任意正整数发送后A进入SYN-SENT状态不可携带报文段数据
  3. B接收到请求如果同意则发送一个确认报文确认报文中包含中 SYN = 1 ACK =1 seq = y ack = x+1 服务器进去SYN-RCVD状态不可携带报文段数据
  4. A发送ACK = 1 seq = x + 1 ack = y + 1 A进入established状态可以携带报文段数据
  5. B收到后进入established状态
4. 为什么需要三次握手才能建立起连接
  1. 为了初始化Sequence Number的初始值 tcp会用这个序列号来拼接数据
  2. SYN 超时如果Server收到Client的SYN，回复SYN-ACK的时候未收到确认这个时候Server会不断重试直至超时 Linux默认5次初始1秒每次翻倍
  3. 针对SYN Flood的防护措施
    1. SYN队列满后通过tcp_syncookies参数回发SYN Cookie
    2. 若为正常连接则Client会回发SYN Cookie 直接建立连接
5. 建立连接后 Client出现故障怎么办
  
  保活机制
  
  在一段时间后这个时间被称为保活时间 keep-alive time 在这段时间内连接出去非活动状态开启保活功能的一段将向对方发送一个保活探测报文如果发送端没有收到响应每隔一段时间keep-alive-interval继续发送知道尝试次数达到保活探测数仍未收到响应则中断连接这时对方主机被称为不可达连接也会被中断
TCP的四次挥手

挥手是为了终止连接假设由客户端主动结束连接
1. 四次挥手流程
  1. 最开始的时候客户端A服务端B都处于established状态由客户端A主动关闭连接
  2. A发送连接释放报文并且停止发送数据 FIN = 1 seq = u 发送后A进入FIN-WAIT-1状态即使FIN 报文段不携带任何数据也要消耗掉一个序列号
  3. B接收到连接释放报文回复确认报文 ACK = 1 seq = v ack = u+1 v是自己的序列号发送后进入 close-wait状态处于半关闭状态服务器还可以发送最后的数据
  4. A收到第一次挥手确认报文后进入close-wait2状态等待服务器发送释放连接报文 A还可以接收服务器发送的数据
  5. B向A发送连接释放报文 FIN=1 ACK = 1 seq = w ack = u+1 进入last-ack状态等到最后确认
  6. A在接受到B发送的连接释放报文后必须回复确认报文 ACK = 1 seq = u+1 ack = w+1 客户端进入time-wait状态等待2MSL后进入CLOSED状态
  7. B收到A发来的确认立即进入closed状态
2. 为什么会有TIME_WAIT状态
  1. 确保有足够的时间让对方收到ACK包
  2. 避免新旧连接混淆
3. 为什么需要4次握手才能断开连接
  
  因为TCP是全双工的发送方和接收方都需要FIN报文和ACK报文
4. 服务器出现大量close_wait原因
  
  客户端一直在请求服务器端返回给客户端的信息是异常的或者错误的
  
  对方关闭socket连接我方忙于读写没有及时关闭连接
  
  检查代码特别是释放资源的代码
  
  检查配置特别是处理请求的线程配置不合理
UDP简介
1. UDP报文结构
  1. Source Port
  2. Destination Port
  3. Length
  4. Checksum
2. 特点
  1. 面向非连接双方不建立连接
  2. 不维护连接状态支持同时向多个客户端传输相同的消息
  3. 数据包报头只有8个字节额外开销小很多
  4. 吞吐量之受限于数据生成速率传输速率以及机器性能
  5. 尽最大努力交付不保证可靠交付不需要维持复杂的链路状态表
  6. 面向报文不对应用程序提交的报文信息进行拆分或者合并
TCP和UDP 区别
1. 面向连接 vs 无连接
2. 可靠 vs 不可靠
3. 有序 vs 无序
4. 速度慢 vs 速度快
5. 重量级 vs 轻量级
TCP的滑动窗口
1. RTT和RTO
  1. RTT round trip time发送一个数据包到收到对应的ACK所花费的时间
  2. RTO重传时间间隔 retransmission timeout
2. TCP使用滑动窗口做流量控制与乱序重排
  1. 保证TCP的可靠性
  2. 保证TCP的流控特性
3. 窗口数据计算过程
  
  滑动窗口包含已经发送的数据和即将允许发送的数据序列号长度之和窗口外的数据不能发送
  
  接收端使用公式算出接收端advertisedwindow的值
  
  发送方要保证 LastByteSent - LastByteAcked小于这个值通过计算EffectiveWindow来计算出还可以发送多少
  
  $$AdvertisedWindow = MaxRcvBuffer - (LastByteRcvd - LastByteRead)$$
  
  $$EffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked)$$
HTTP简介
1. 超文本传输协议HTTP主要特点
  1. 支持客户/服务器模式
  2. 简单快速
  3. 灵活允许传输任意类型的数据对象
  4. 限制每次连接只处理一个请求处理完就断开从HTTP1.1起默认启用长连接需要等待一定时间后才断开连接
  5. 无状态协议对事务处理没有记忆能力如果

I/O#

哪个I / O类可以附加或更新文件?

RandomAccessFile 可以通过 seek(long pos) 方法去移动文件指针进行追加更新写入.OutputStream() 是一个抽象类不能直接实例化去写入.DataOutputStream() 也无法追加写入

Multi-threading#

volatile与synchronized的区别：
- volatile本质是在告诉jvm当前变量在寄存器中的值是不确定的,需要从主存中读取,synchronized则是锁定当前变量,只有当前线程可以访问该变量,其他线程被阻塞住.
- volatile仅能使用在变量级别,synchronized则可以使用在变量,方法.
- volatile仅能实现变量的修改可见性,但不具备原子特性,而synchronized则可以保证变量的修改可见性和原子性.
- volatile不会造成线程的阻塞,而synchronized可能会造成线程的阻塞.
- volatile标记的变量不会被编译器优化,而synchronized标记的变量可以被编译器优化.
多进程与多线程
- 一个线程只能属于一个进程，而一个进程可以有多个线程，但至少有一个线程（通常说的主线程）。
- 资源分配给进程，同一进程的所有线程共享该进程的所有资源。
- 线程在执行过程中，需要协作同步。不同进程的线程间要利用消息通信的办法实现同步。
- 处理机分给线程，即真正在处理机上运行的是线程。
- 线程是指进程内的一个执行单元，也是进程内的可调度实体。

Java Web#

Java Web Basic#

Servlet的生命周期分为5个阶段
- 加载：容器通过类加载器使用servlet类对应的文件加载servlet
- 创建：通过调用servlet构造函数创建一个servlet对象
- 初始化：调用init方法初始化
- 处理客户请求：每当有一个客户请求，容器会创建一个线程来处理客户请求
- 卸载：调用destroy方法让servlet自己释放其占用的资源

Spring Framework#

Introduction To Computer System

Posted on 2019-11-16 In Computer Architecture