일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | |
7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 | 15 | 16 | 17 | 18 | 19 | 20 |
21 | 22 | 23 | 24 | 25 | 26 | 27 |
28 | 29 | 30 |
- programmers
- AI Mathematics
- 코테
- 코딩테스트
- Coursera
- softeer
- Python
- string
- 부스트캠프
- 오블완
- data science methodology
- 클린코드
- 코세라
- 파이썬
- 깨끗한 코드
- IBM
- Boostcamp AI
- Clean Code
- 데이터 사이언스
- 자바
- 알고리즘
- Data Science
- 클린코드 파이썬
- 데이터사이언스
- 문자열
- Java
- 소프티어
- 프로그래머스
- 데이터과학
- 티스토리챌린지
- Today
- Total
떼닝로그
Tools for Data Science - Data Science Tools 본문
Tools for Data Science
Data Science Tools
Categories of Data Science Tools
Data Science Categories
- raw data must pass through Data Science categories : such as Data Management, Data Integration and Transformation, Data Visualization, Model Building, Model Deployment, Model Monitoring and Assessment
- to do these tasks, data asset management and code asset management, execution environments, and development environments are needed
Data Management
- Collecting, persisting, and retrieving data securely, efficiently, and cost-effectively
- Data is collected from many sources
Data Integration and Transformation
- Extract, Transform, and Load (ETL)
- Extract the data and save it in a central repository
- Data Transformation is the process of transforming the values, structure and format of data
- Transformed data is loaded back to the data warehouse
Data Visualization
- Graphical representation of data and information
- In the form of charts, plots, maps, and animations
- Data visualization conveys data more effectively
Data Visualization Examples
- Bar charts : which compares the size of each component
- Treemap : which displays hierarchy data
- Line chart : which plots a series of data points over time
- Map chart : which displays data by location. can also be applied to other locations like websites
Model Building
- step where you train the data and analyze patterns using suitable machine learning algorithms
- create machine learning models using IBM Watson machine learning
Model Deployment
- process of integrating a model into a production environment
- uses APIs to enable data-based decisions
- eg. SPSS Collaboration and Deployment Services
Model Monitoring and Assessment
- Model Monitoring tracks deployed models
- Model Assesment checks for accuracy, fairness, and robustness monitoring
- IBM Watson Open Scale is a popular Model Monitoring and Assessment tool
Code Asset Management
- is a unified view where you manage an inventory of assets
- developers use versioning to track and manage changes to a software project's code
- collaboration allows diverse people to share and update the same project together
- eg. Github
Data Asset Management
- Platform for organizing and managing the data
- supports replication, backup, and access right management
Development Environments
- Integrated development environments(IDEs) provide a workspace and tools to work on source code
- IDEs like IBM Watson Studio provide testing and simulation tools to emulate the real world so you can see how your code will behave after it is deployed
Exectuion Environment
- has libraries for code compiling and system resources to execute and verify code
- cloud-based exectuion environments aren't tied to specific hardware or software
- IBM Watson Studio has tools for data preprocessing model training and deployment
Fully-Integrated visual tools
- cover all tooling components, and can be used to develop deep learning and machine learning models
- eg. Watson Studio by IBM, IBM Cognos Dashboard Embedded by IBM
Recap
- Data science Task categories : data mangement, data integration and transformation, data visualization, model building, model deployement, model monitoring and assessment
- data science tasks are supported by : data asset management, code asset management, execution environments, devlopment environments
**
볼드표시한 것들은 여기가 중요하다!!도 어느정도 있지만
그냥 내가 새로 알게 된...? 그런 부분들에도 볼드를 치는 편...
**
Open Source Tools for Data Science - Part 1
Data Management
- Open-source data management tools are relational databases like : PostgreSQL, MongoDB, Elasticsearch, Hadoop HDFS, ...
Data integration and transformation
- data integration and transformation in the classic data warehousing world is for ETL or ELT
- also termed Data Refinery and Cleansing
- eg. Apache Airflow, Kubeflow, Kafka, NodeRED
Data Visualization
open source tools :
- supported by programming libraries where you need to use code
- containing a user interface
Model monitoring and assessment
- tools to keep track of machine learning model's prediction performance to maintain outdated models
- eg. ModelDB, Prometheus, IBM AI Fairness 360 Open Source Toolkit
Code Asset Management
- tools for code asset management, also known as version management or version control
- eg. git, Github, GitLab, BitBucket
Data Asset Management
- tools for data asset management, also known as data governance or data lineage
- eg. Apache Atlas, ODPi EGERIA
**
그냥 이런이런 게 있다 나열형...
안 듣고 넘어갔어도 됐을듯
다른 설명보다 IBM tools에 대해 굉장히 설명이 길어지는 게 좀 웃기다ㅋ
**
Open Source Tools for Data Science - Part 2
Jupyter
- supports more than a hundred different programming languages through "kernels"
- encapsulates the exectuion environment for the different programming languages
- key property of Jupyter Notebook is to unify documentation, code, output from code, shell commands, and visualizations in a single document
RStudio
- oldest development environments for statistics and data science
- exclusively runs R and associated libraries
- Enables Python development
- provides optimal user experience when tightly integrated in the tool
- unifies : programming, execution, debugging, remote data access, data exploration, visualization
Apache Spark
- provides cluster execution environment
- provides linear scalability - more the number of servers in a cluster more the performance
**
이것도 이것저것 나열...
일단 해보기 전까지는 모르지 않을까요 ㅠ
**
Commercial Tools for Data Science
Data Management
- eg. Oracle Database, MSSQL, IBM DB2
- commercial supports delivered by : software vendors, influential partners, support networks
Data Integration and Transformation
- Extract, Transform, and Load (ETL) tools : Informatica, IBM InfoSphere DataStage
- support the design and deployment of ETL data processing pipelines through a graphical interface
- bring along connectors to most of the commercial and open-source target information systems
Data Visualization
- focus of these tools is to create visual reports and live dashboards
- eg. Tableau, MS PowerBI, IBM Cognos Analytics
- visualization can show relationships between different columns in a table
Model Building
- should use data mining product
- eg. SPSS Modeler, SAS enterprise modeler
Model Deployement
- tightly integrated into the model-building process
- commercial software can export models in an open format
Model monitoring and code asset management
- open source : model monitoring, code asset management
Data asset management
- functions include : data governance, data versioned and annotated, data dictionary, data lineage, data privacy and retention
Development environments (Watson Studio Desktop)
- fully integrated devleopment environment for data scientists
- combines Jupyter Notebooks with graphical tools
- fully integrated visual tools
Cloud Based Tools for Data Science
Fully integrated visual tools and platforms
Large-scale execution of data science workflows happens in compute clusters :
- composed of multiple server machines
- Watson Studio and Watson OpenScale, cover the complete development life cycle for all data science, machine learning, and AI tasks
- fully cloud-hosted offering supports the complete development life cycle of all data science, machine learning, and AI tasks
H2O Driverless AI:
- a product that you download and install
- one-click deployment for the common cloud service providers
- cloud provider does not do operations and maintenance
Data Management
- comprises software as a service (SaaS) versions of existing open source and commercial tools
- eg. Amazon DynamoDB : allows storage and retireving data in a key-value or a document store format
Data Integration and Transformation
- Comprises ETL and ELT tools
- transformation steps are pushed toward the domain of the data scientist or data engineer
Data Refinery:
- is part of IBM Watson Studio
- allows transforming large amounts of raw data into consumable, quality information in a spreadsheet
Data Visualization
- 3D bar chart : to visualize a target alue on the vertical dimension, whcih is dependent on two other values in the horizontal dimensions
- Hierarchical Edge bundling depicts correlations and affiliations between entities
- 2D scatter plot with a heat map : shows two dependent data field on the y-axis with different color intensities
- Tree map : shows the distribution of subsets within a set
- Word cloud : pops out significant terms in a document corpus
Model Deployment
- tightly integrated into the model-building process
- commercial software can export models in an open format, such as Predictive Markup Model Language(PMML)
Model monitoring and assessment
- a cloud tool to monitor deployed machine learning and deep learning models continuously
- eg. Amazon SageMaker Model Monitor, IBM Watson OpenScale
**
위에랑 겹치는 내용이 생각보다 많았음...
그리고 약간 결론은 IBM 자체 SW가 짱이다!!! 으로 몰아가는듯
역시 자기들이 만든 강의다 이거지...
**
Practice Quiz - Data Science Tools
Q. Which Data Science category do you extract, transform, and load data?
A. Data Integration and Transformation
Q. Which open-source tool is the standard used for code asset management?
A. Git
Q. Which open source tool supports more than a hundred different programming languages through "kernels"
A. Jupyter
Q. Which commercial tool can be used to define and execute data integration processes in a spreadsheet-style?
A. Watson Studio Desktop
Q. Which service offers a document data structure format?
A. JSON
'Coursera > IBM Data Science' 카테고리의 다른 글
Tools for Data Science - Libraries, APIs, Datasets and Models (2) | 2023.12.14 |
---|---|
Tools for Data Science - Languages of Data Science (0) | 2023.12.14 |
What is Data Science - Data Literacy for Data Science (2) (2) | 2023.12.05 |
What is Data Science - Data Literacy for Data Science (1) (1) | 2023.12.04 |
What is Data Science - Applications and Careers in Data Science (2) (1) | 2023.12.04 |