떼닝로그

What is Data Science - Data Literacy for Data Science (2) 본문

Coursera/IBM Data Science

What is Data Science - Data Literacy for Data Science (2)

떼닝 2023. 12. 5. 23:11

Data Literacy for Data Science

Data Literacy

Data Collection and Organization

Introduction

- Data Repository is a genral term used to refer to data that has been collected, organized, and isolated

- can be for use in business operations, and mined for reporting and data analysis

 

Databases

- Collection of data for input, storage, search, retrieval, and modification of data

- Database Management System(DBMS) is a set of programs for creating and maintaining the database, and storing, modifying, and extracting information from the database using a function called Querying

- even though a database and DBMS mean different things, the terms are often used interchangably

- factors governing choice of database include : data type, data structure, querying mechanisms, latency requirements, transaction speeds, intended use of data

 

Relational Databases

- data is organized into a tabular format with rows and columns

- well-defined structure and schema

- optimized for data operations and querying

- use SQL as the standard querying language

 

Non-Relational Databases

- emerged in response to the volume, diversity, and speed at which data is being generated today

- built for speed, flexibility, and scale

- data can be stored in a schema-less form

- widely used for processing big data

 

Data Warehouse

- consolidates data through the extract, transform, and load process, also known as the ETL process, into one comprehensive database for analytics and business intelligence (consolidate : 굳히다, 통합하다)

- ETL helps extract data from different data sources

- Transform the data into a clean usable state

- Load the data into data repository

 

BIg Data Stores

- distributed computational and storage infrastructure to store, scale, and process very large data sets

 

Summary

- Data repositories help to isolate data and make reporting and analytics more efficient and credible while also serving as a data archive

 

Relational Database Management System

What is a Relational Database?

- a relational database is a collection of data organized into a table structure, where the tables can be linked, or related, based on data common to each

- table rows : records, columns : attributes

- capability of relating tables based on common data enables to retrieve an entirely new table from data in one or more tables with a single query

- allows to understand the relationships among all available data and gain new insights for making better decisions

- relational databases use structured query language, or SQL, for querying data

 

Similarities between Relational databases and Spreadsheets

- relational databases build on the organizational principles of flat files such as spreadsheets, with data organized into rows and columns following a well-defined structure and schema

Differences between Relational databases and Spreadsheets

- ideal for the optimized storage, retrieval, and processing of data for large volumes of data

- each table has a unique set of rows and columns

- relationships can be defined between tables

- fields can be restricted to specific data types and values

- can retrieve millions of records in seconds using SQL for querying data (retrieve : 검색하다, 되찾아오다, 수습하다)

- security architecture of relational databases provides greater access control and governance (governance : 관리, 관리 방식)

 

Example of RDBMS

Relational Databases can be:

- open-source with internal support

- open-source with commercial support

- commercial closed-source

- IBM DB2, MSSQL, MySQL, Oracle Database, PostgreSQL

 

Cloud-based Relational Databases, or Database-as-a-Service:

- Amazon RDS, Google Cloud SQL, IBM DB2 on Cloud, Oracle Cloud, Azure SQL

 

Advantages of the Relational Database Approach

- create meaningful information by joining tables

- flexibility to make changes while the database is in use

- minimize data redundancy by allowing relationships to be defined between tables (redundancy : 불필요한 중복)

- offer export and import options that provide ease of backup and disaster recovery

- are Atomicity, Consistency, Isolation, Durability (ACID) compliant, ensuring accuracy and reliability in database transactions (compliant : 따르는, 부응하는)

 

Use Cases for RDBMS

- Online Transaction Processing(OLTP) application can support transaction-oriented tasks that run at high rates and accomodate large number of users, manage small amounts of data, support frequent queries and fast response times

- Data Warehouses can be optimized for online analytical processing(OLAP)

- IoT Solutions provide the speed and ability to collect and process data from edge devices

 

Limitations of RDBMS

- does not work well with semi-structured and unstructured data

- migration between two RDBMS's is possible only when the source and destination tables have identical schemas and data types

- entering a value greater than the defined length of a data field results in loss of information

 

NoSQL

What is a NoSQL Database?

- NoSQL(not only SQL) or Non SQL is a non-relational database design that provides flexible schemas for the storage and retrieval of data

- Gained greater popularity due to the emergence of cloud computing, big data, and high-volume web and mobile applications

- chosen for their attributes around scale, performance, and ease of use

- built for specific data models

- has flexible schemas that allow programmers to create and manage modern applications

- do not use a traditional row/column/table database design with fixed schemas

- do not, typically use the structured query language(or SQL) to qeury data

- allows data to be stored in a schema-less or free-form fashion

 

Four Different types of NoSQL Databases

Key-value store :

- data in a key-value database is stored as a collection of key-value pairs

- a key represents an attribute of the data and is a unique identifier

- both keys and values can be anything from simple integers or strings to complex JSON documents

- great for storing user session data, user preferences, real-time recommendations, targeted advertising, in-memory data caching

- not a great fit if you want to : query data on specific data value, need relationships between data values, need multiple unique keys

 

Document based:

- document databases store each record and its associated data within a single document

- enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents (ad hoc : 즉석)

- preferred for eCommerce platforms, medical records storage, CRM platforms, and analytics platforms

- not a great fit if you want to : run complex search queries, perform multi-operation transactions

 

Column based:

- data is stored in cells grouped as columns of data instead of rows

- a logical grouping of columns is referred to as a column family

- all cells corresponding to a column are saved as a continuous disk entry, making access and search easier and faster

- great for systems that require heavy write requests, storing time-series data, weather data, and IoT data

- not a great fit if you want to : run complex queries, change querying patterns frequently

 

Graph based:

- use a graphical model to represent and store data

- useful for visualizing, analyzing, and finding connections between different pieces of data

- excellent choice for working with connected data

- great for social networks, product recommendations, network diagrams, fraud detection

- not a great fit if you want to : process high volumes of transactions

 

Advantages of NoSQL

- its ability is to handle large volumees of structured, semi-structured, and unstructured data

- its ability is to run as a distributed system scaled across multiple data centers

- an efficient and cost-effective scale-out architecture that provides additional capacity and performance with the addition of new nodes

- simple design, better control over availability, and imprroved scalability that makes it agile, flexible, and support quick iterations

 

Key Differences

Relational Databases Non-Relational Databases
RDBMS schemas rigidly define how all data inserted into the database must be typed and composed NoSQL databases can be schema-agnostic, allowing unstructured and semi-structured data to be stored and manipulated (schema-agnostic : 스키마에 대한 지식 없이도 기능 수행 가능)
Maintaining high-end, commercial relational database. management systems can be expensive Specifically designed for low-cost commodity hardware
Support ACID-compliance, which ensures reliability of transactions and crash recovery Most NoSQL databases are not ACID compliant
A mature and well-documented technology, which means the risks are more or less perceivable A relatively newer technology

 

 

Data Marts, Data Lakes, ETL, and Data Pipelines

Data Warehouses

- data warehouse works like a multi-purpose storage for different use cases

- by the time the data comes into the warehouse, it has already been modeled and structured for a specific purpose, meaning it is analysis ready

- serves single source of truth : storing current and historical data that has been cleansed, conformed, and categorized

- data warehouse is a multi-purpose enabler of operational and performance analytics

 

Data Mart

- sub-section of the data warehouse, built specifically for a particular business function, purpose, or community of users

- idea is to provide stakeholders data that is most relevant to them, when they need it

- provide analyticial capabilities for restricted area of the data warehouse

- offered isolated security and isolated performance

- the most important role is business specific reporting and analytics

 

Data Lakes

- storage repository that can store large amounts of structured, semi-structured, and unstructured data in their native format, classified and tagged with metadata

- while a data warehouse stores data processed for a specific need, a data lake is a pool of raw data where each data element is given a unique identifier and is tagged with metatags for further use

- data from a data lake is selected and organized based on the use case you need it for

- a data lake retains all source data without exclusions

- the most important role is in predictive and advanced analytics

 

Extract, Transform, and Load Process (ETL)

- ETL is an automated process which includes : gathering raw data, extracting information needed for reporting and analysis, cleaning, standardizing and transforming data into usable format, loading data into a data repository

- increasingly being used for real-time streaming event data

 

Extract:

- step where data from source locations is collected for transformation

- can be through batch processing : large chuncks of data moved from source to destination at scheduled intervals

- can be through stream processing : data pulled in real-time from source, transformed in transit, and loaded into data repository

 

Transform:

- involves the execution of rules and functions that converts raw data into data that can be used for analysis

- eg. standardizing date formats and units of measurement, removing duplicate data, filtering out data that is not required, enriching data, establishing key relationships across tables, applying business rules and data validations

 

Loading:

- the transportation of processed data in to a data repository

- can be intial loading : populating all of the data in the repository

- can be incremental loading : applying updates and modifications periodically

- can be full-refresh : erasing a data table and reloading fresh data

- Load Verification includes checks for : missing or null values, server performance, load failures

 

Data Pipeline

- broader term that encompasses the entire journey of moving data from one system to another, including the ETL process (encompass : 포함하다, 에워싸다)

- can be used for both batch and streaming data

- supports both long-running batch queries and smaller interactive queries

- typically loads data into a data lake but can also load data into a variety of target destinations - including other applications and visualization tools

 

**

정처기나 이런 데서 대충 글로만 외웠던 내용들을 보니깐

뭔가 반갑고... 사실은 이런 의미였구나 싶고...

(진작에 공부 좀 열심히 할걸)

**

Considerations for Choice of Data Repository

- a number of difference factors influence the selection of the right data repository

- type of data - structured, semi-structured, or unstructured

- schema of the data

- performance requirements

- whether you're working with data at rest or streaming data (data in motion)

- data encryption needs

- volume of data and whether you need a big data system

- storage requirements

- frequency of data access : frequent updates, keep in vault for long time

- standards set by your organization on the databases and database repositoriees that can be used

 

- capacity the data repository is required to handle

- type of access : at short intervals, run long-running queries

- purpose of data repository : transactional, analytical, archival, data warehousing

- compatibility of the data repository with the existing ecosystem of programming languages, tools, and processes

- security features of the data repository

- scalability from a long-term perspective

 

- very few organizations use one data repository

- have a prefered enterprise relational database, an open-source relational databse, and an unstructured data source

- important to think about the skills you have or want to foster

- cost of various solutions

- hosting pltform is an importnt consideration - AWS RDS, Amazon's Aurora, Google's relational offerings

 

**

... 암튼 이래저래 다양합니다

이하 생략...

**

 

Data Integration Platforms

Data Integration

- usage scenarios : data consistency across applications, master data management, data sharing between enterprises, data migration and consolidation

- includes : accessing, queueing, or extracting data from operational systems, transforming and merging extracted data either logically or physically, data quality and governance, delivering data through an integrated approach for analytics purposes

- to make customer data available for analytics : need to provide a unified view of the combined data, so that users can access, query, and manipulate this data from a single interface to derive statistics, analytics, and visualizations

 

Data Integration Workflow

- data pipeline covers the entire data movement journey from source to destination systems

- cause a data pipeline to perform data integration, while ETL is a process wtihin data integration

 

Capabilities of a Data Integration Platform

- pre-built connectors and adapters

- open-source architecture

- optimization for both batch processing of large-scale and continuous data streams or both

- integration with big data sources

- additional functionalities for data quality and governance, compliance, and security

- portability beetween on premise and different types of cloud environments

 

Conclusion

- Data Integration space continues to evolve as businesses embrace newer technologies and as data grows, be it in the variety of sources or its use in business decision-making

 

Practice Quiz : Data Integration Platforms

Q. The term "data repositories" exclusively refers to RDBMSes and NoSQL databases that are used to collect, organize, and isolate data for analytics

A. False

 

Q. In use cases for RDBMS, what is one of the reasons that relational databases are so well suited for OLTP applications?

A. Support the ability to insert, update, or delete small amounts of data

 

Q. Which NoSQL database type stores each record and its associated data within a single document and also works well wtih Analytics platforms?

A. Document-based

 

Q. What type of data repository is used to isolate a subset of data for a particular business function, purpose, or community of users?

A. Data Mart

 

Q. _ is ideal for data lakes where transformations on data are applied after raw data is loaded into the data lake.

A. ELT (Extract-Load-Transform) Process

 

Q. Which one of these statements explains what data integration is?

A. Data Integration includes extracting, transforming, merging, and delivering quality data for analytical purposes

 

Lesson Summary : Welcome to Data Literacy

Data repositories

- find and return data in a usable format

- data type helps you choose the type of repository : structured, semi-structured, unstructured

- Relational or NoSQL

- data warehouse, data mart, data lake

 

Relational databases

- structure data in tables

- each table relates to a topic

- schema connects the tables

 

Structured Query Language (SQL)

- search for and retrieve data you need

- use to manipulate the data

 

RDBMS Advantages

- visualization, analysis, finding connections

- linking tables creates meaningful information

- restrict fields by data type

- easy import and export

 

RDBMS Limitations

- do not work well with unstructured data

- slow to query big data

- limits field length

 

NoSQL Databases

- data diversity

- no schema needed

- house semi- and unstructured data

 

Types of NoSQL databases

- document-based : documents grouped into collections

- key-value : each data entry has a key to access it

- columnar : data stored in columns rather than rows

- graph : data stored in nodes with relationships and properties

 

Big Data Storage

- Data warehouse : multipurpose analysis ready for massive data stores

- Data mart : sub-section of a data warehouse restricted access

- Data lake : handles a combination of structured, semi- and unstructured data

 

Data Pipelines

- address the need for a data-handling process : collect, transform, move

- Extract / Transform / Load (ETL) is a type of data pipeline

 

The big picture

- repository choice depends on needs for : storage, organization, management, retrieval

 

Practice Quiz : Data Literacy

Q. What is the primary function of a Database Management System (DBMS), as explained in the video?

A. To allow for the input, storage, search, retrieval, and modification of data

 

Q. What is a data integration platform's primary role in analytics and data science, as explained in the text?

A. To extract, transform, and combine data from various sources for analytics

 

Q. What is the Extract, Transform, and Load (ETL) process's primary purpose in data management?

A. To convert raw data into analysis-ready data by extracting, cleaning, standardizing, and transforming it

 

 

 

 

 

 

Comments