Coursera/IBM Data Science

What is Data Science - Data Science Topics (1)

떼닝 2023. 11. 29. 21:57

Data Science Topics

Big Data and Data Mining

How Big Data is Driving Digital Transformation

DVD로만 보던 영화를 이제는 Netflix로,
Overhead Camera를 이용하여 가장 효율적은 play를 하게 된 NBA의 Houston Rockets.

video tracking system
analyzed... big data for high scores where to shots
changed the way the team ~ plays
got 3 points more games than any other teams
organizations all around us are all changing

우리 주위 대부분의 기관/산업들은 모두 다 변화가 진행중이거나, 변화를 준비하고 있다.

사실 요새 DX엔지니어... 이런 말들이 많은데 도대체 DX가 무엇인지 감이 잘 안 잡혔다.

근데 이 짧은 영상 보고 이런 걸 요새 DX라고 하는 거구나를 깨달을 수 있었음!!

Introduction to Cloud

Cloud Computing

- Delivery of on-demand computing resources (networks, servers, storage, ...)

- Applications and data that users access over the internet rather than locally (online web apps, storing personal files)

Cloud Computing User Benefits

- no need to purchase applications and install then on local computer

- use online versions of applications and pay a monthly subscription

- cost-effective

- save local storage space

- work collaboratively in real time

- accsess most current software versions

5 Characteristics of Cloud Computing

- On Demand Self-Service

- Broad Network Access

- Resource Pooling

- Rapid Elasticity (탄성? 그냥 유연성을 대충 말하는 건가 싶기도...)

- Measured Service

Cloud Computing is about using technology as a service by leveraging remote systems on-demand over the internet, and it has changed the way the world consumes compute services.

(leverage : 타인의 자본 등을 이용하여 이익을 얻는 것)

Cloud Deployment Models

- public cloud : Usage is shared by others

- private cloud : Cloud infrastructure is opened on some who are authorized

- hybrid cloud : public + private cloud

Cloud Service Models

- IaaS : Infrastructure as a Service. Can access the infrastructure and physical computing resources

- PaaS : Platform as a Service. Can access the platform that comprises the hardware and software tools usually need to develop

- SaaS : Software as a Service. Sofware licensing and delivery model in which software and applications are centrally hosted and licensed on a subscription basis. Also referred to as "on-demand softwrare"

On-Demand라는 말이 자주 등장하는데, 이 뜻이 무엇일까?

On-Demand : as soon as or whenever required (from Oxford Languages... tbh was from googleㅋ)

한국말로는 주문형, 이라고 이해하면 조금 더 직관적으로 받아들일 수 있을 것 같다

주문형 서비스 : 소비자가 있는 곳까지 찾아가서 상품과 서비스를 전달하는 것, 이용자의 요구에 따라 상품이나 서비스가 바로 제공되는 것

쉽게 설명하면 "내가 있는 곳으로 상품이나 서비스가 찾아온다" 라고 이해하라고 하네요... (https://yslab.kr/63)

짜장면을 먹으러 중국집에 방문하는 것 (X) -> 전단지를 보고 짜장면을 주문하는 것 (O)

Cloud for Data Science

- Allows to bypass the physical limitaions of the computers and the systems you're using

- Allows to deploy the analytics and storage capacities of advanced machines

- Accessible in Everywhere, Everytime, Everyone(who has been authorized)

- Some big tech companies offer cloud platforms become familiar in a pre-built environment(AWS, Google Cloud, Jupyter Notebook)

Foundtaions of Big Data

- refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines

- requires new, innovative, and scalable technology to collect, host and analytically process the vast amount of data

- Alternative tools such as Apache Spark, Hadoop provide ways to analyze data with computing resources

The Vs of Data

- Velocity

- Voume

- Variety

- Veracity (reliable, accurate, categorized, visualized, analyzed)

- Value

Data Science and Big Data

- 데이터 사이언스라는 분야가 예전부터 인기가 많았던 것은 아니다. 최근 들어서 핫해짐

- 내가 누구인지에 따라 빅데이터를 정의하는 방법이 다를 것

-> 통계학자들에게 빅데이터는 thumb drive에 넣을 수 없는 것 (* thumb drive : 우리나라의 USB 플래시 드라이브)

-> 교수님에게 빅데이터는 구글에서 시작된 것. 검색했을 때 나오는 페이지들 사이의 랭크를 어떻게 매길 것인가부터 시작됨

- 지금 빅데이터는 어떻게 분석을 할 것인가, 가 가장 큰 요점임

그냥 약간 교수님 훈화말씀처럼 들었음...
근데 약간 재미있음... (대학 졸업했다 이거지...)

What is Hadoop?

- 기존의 전통 CS, 확률, 통계, 수리 등을 하나로 합쳐 Decision Science라고 한다...

아니 저기... 열심히 적었던 내용이 사라진 것 같ㅇ은데........ 하아....

여기서부터 동영상 다시 보고 새로 적는 중....... ㅠㅠ

Big Data Processing Tools

- provide ways to work with large sets of strucutred, semi-structured, and unstructured data

- to make values derived from big data

Apache Hadoop

- a collection of tools that provides distributed storage and processing of big data

- allows distributed storage and processing of large datasets across clusters of computers

- node is a single computer, and nodes are as a cluster

- provides a reliable, scalable, and cost-effective solution for storing data with no format requirements

Benefits include:

- better real-time data-driven decisions

- improved data access and analysis

- data offload and consolidations (데이터 분할 및 통합...?)

HDFS (Hadoop Distributed File System)

- storage system for big data that runs on multiple commodity hadrware connected through a network

- provides scalable and reliable big data storage by partitioning files over multiple nodes

- splits large files across multiple computers, allowing parallel access to them

- replicates file blocks on different nodes to prevent data loss

Benefits include:

- fast recovery from hardware failures

- access to streaming data

- accomodation of large data sets

- portability

Apache Hive

- a data warehouse for data query and analysis

- open-source data warehouse software for reading, writing, and managing large data set files that are stored directly in either HDFS or other data sotrage systems such as Apache HBase

- queries have high latency : not suitable for applications that need fast response times

- read-based : not suitable for transaction processing that involves a high percentage of write operations

- better suited for : data warehousing tasks(such as ETL, reporting, and adata analysis), Easy access to data via SQL

Apache Spark

- a disitriubted analytics framework for complex, real-time data analytics

- general-purpose data processing engine designed to extract and process large volumes of data for a wide range of applications (including Interactive Analytics, Streams Processing, Machine Learning, Data Integration, ETL)

Key Attributes:

- has in-memory processing which signifacntly increases speed of computations

- provides interfaces for major programming lanuages such as java, scala, python, r, and sql

- can run using its standalone clustering technology

- can run on top of other infrastructures, such as Hadoop

- can access data in a large variety of data sources, including HDFS and Hive

- processes streaming data fast

- performs complex analytics in real-time

ETL이라는 게 대체 뭐길래 이렇게 자주 등장하는 걸까

ETL : Extract, Transform, Load (추출, 변환, 적재)

조직에서 여러 시스템의 데이터를 단일 데이터베이스, 데이터 저장소, 데이터 웨어하우스 또는 데이터 레이크에 결합하기 위해 허용되는 방법

Data Mining

Establishing Data Mining Goals

- first step in data mining requires to set up goals for the exercise

- must identify the key questions that need to be answered

- must determine the expected level of accuracy and usefulness of the results obtained from data mining

- high levels of accuracy from data mining would cost more and vice versa (vice versa : 반대로)

- the cost-benefit trade-offs for the desired level of accuracy are important considerations for data mining goals

Selecting Data

- output of a data-mining largely depends upon the quality of data being used

- data are readily available for further processing

- must identify other sources of data or even plan new data collection initiatives, including surveys

- identifying the right kind of data is needed for data mining that could answer the questions at reasonable cost is critical

Preprocessing Data

- identify the irrelevant attributes of data and expunge such attributes from further consideraton at this stage (expunge : 지우다)

- identifying the errorneous aspects of the data set and flagging them as such is necessary

- data should be subject to checks to ensure integrity

- must develop a formal method of dealing with missing data and determine whether the data are missing randomly or systemically

- must consider in advance if observations or variable containing missing data be excluded from the entire analysis or parts of it

Transforming Data

- important consideration in data mining is to reduce the number of attributes needed to explain the phenomena

- require transforming data reduction algorithms, such as principal component analysis, can reduce the number of attributes without a significant loss in information

- variables may need to be transformed to help explain the phenomenon being studied

- transforming variables from one type to another may be prudent to transform the continuous variable for income into a categorical variable where each record in the database is identified as low, medium, and high-income individual (prudent : 신중한)

Storing Data

- transformed data must be stored in a format that makes it conducive for data mining (conducive : ~에 좋은)

- must be stored in a format that gives unrestricted and immediate read/write privileges to the data scientist

- important to store data on servers or storage media that keeps the data secure and also prevents the data mining algorithm from unnecessarily searching for pieces of data scattered on different servers or storage media

Mining Data

- data mining step covers data analysis methods, including parametric and non-parametric methods, and machine-learning algorithms (parametric : 매개변수의, 파라미터의)

- multidimensional views of the data using the advanced graphing capabilities of data mining software are very helpful in developing a preliminary understanding of the trends hidden in the data set

Evaluating Mining Results

- formal evaluation could include testing the predictive capabilities of the models on observed data to see how effective and efficient the algorithms have been in reproducing data

- the results are shared with the key stakeholders for feedback, which is then incorporated in the later iterations of data mining to improve the proocess

Data mining and evaluating the results becomes an iterative process such that the analysts use better and imporved algorithms to improve the quality of results generated in light of the feedback received from the key stakeholders.

Stakeholders라고 다음 강의에서도 되게 자주 나왔던 단어.

stakeholder : 이해 당사자

기업에 대해 이해관계를 갖는 개인 또는 그룹.

주주나 사채권자 외에도 노동자, 소비자, 하청업체 등도 기업의 이해관계자

Practice Quiz : Data Mining

Q. What is one of the key considerations when setting up goals for data mining?

A. The level of accuracy expected from the results

Q. What is the purpose of data preprocessing in data mining?

A. To ensure the integrity of data, deal with missing data, and remove irrelevant attributes

Q. What is the purpose of evaluating data mining results?

A. To conduct an "in-sample forecast" to test the predictive capabilities of models

Lesson Summary : Big Data and Data Mining

Instigating Fundamental Changes (instigate : 실시하다, 선동하다)

- causing transformation of business and industry

- requiring changes to business

- impacting every aspect of the organization

Data Availability

- handling massive data sets requires new technologies

- derive insights related to consumers, risk, profit, productivity

- enhancing business value

Big Data Characteristics

- Value : investment in big data creates value

- Volume : scale of the data

- Velocity : speed it is collected

- Variety : comes from a variety of sources

- Veracity : conforms to facts

Cloud

- enables us to work with big data

- on-demand computing resources

- pay-for-use basis

Cloud Characteristics

- On-demand : access to processing, storage, and network

- network access : resources access via the internet

- resource pooliing : shared resources dynamically assigned

- elasticity : automatically scales resource access

- measured service : only pay for what you use or reserve

Cloud Benefits to Data Science

- addresses computing challenges related to : scalability, collaboration, accessibility, maintenance

- gives instant access to latest technology and tools

Open-Source Big Data Computing Tools

- apache hadoop : distributed storage and processing

- apache hive : provides large data set management

- apache spark : data processing engine

Data Mining Process

1. goal set : identify key questions

2. select data : identify data sources

3. preprocess : clean the data

4. transform : determine storage needs

5. data mine : determine methods and analyze

6. evaluate : assess outcomes, share results

Practice Quiz : Big Data and Data Mining

Q. What was the key discovery made by the HOuston Rockets NBA team through the analysis of video tracking data?

A. Two-point dunks from inside the two0point zone and three-point shots from outside the three-point line provided the bewst opportunities for high scores

Q. What is one of the key advantages of using the Cloud for data scientists?

A. It allows collaboration among multiple teams on the same data

Q. What is Hadoop primarily known for in the context of handling data?

A. Data analysis