떼닝로그

Tools for Data Science - Data Science Tools 본문

Coursera/IBM Data Science

Tools for Data Science - Data Science Tools

떼닝 2023. 12. 12. 05:58

Tools for Data Science

Data Science Tools

Categories of Data Science Tools

Data Science Categories

- raw data must pass through Data Science categories : such as Data Management, Data Integration and Transformation, Data Visualization, Model Building, Model Deployment, Model Monitoring and Assessment

- to do these tasks, data asset management and code asset management, execution environments, and development environments are needed

 

Data Management

- Collecting, persisting, and retrieving data securely, efficiently, and cost-effectively

- Data is collected from many sources

 

Data Integration and Transformation

- Extract, Transform, and Load (ETL)

- Extract the data and save it in a central repository

- Data Transformation is the process of transforming the values, structure and format of data

- Transformed data is loaded back to the data warehouse

 

Data Visualization

- Graphical representation of data and information

- In the form of charts, plots, maps, and animations

- Data visualization conveys data more effectively

 

Data Visualization Examples

- Bar charts : which compares the size of each component

- Treemap : which displays hierarchy data

- Line chart : which plots a series of data points over time

- Map chart : which displays data by location. can also be applied to other locations like websites

 

Model Building

- step where you train the data and analyze patterns using suitable machine learning algorithms

- create machine learning models using IBM Watson machine learning

 

Model Deployment

- process of integrating a model into a production environment

- uses APIs to enable data-based decisions

- eg. SPSS Collaboration and Deployment Services

 

Model Monitoring and Assessment

- Model Monitoring tracks deployed models

- Model Assesment checks for accuracy, fairness, and robustness monitoring

- IBM Watson Open Scale is a popular Model Monitoring and Assessment tool

 

Code Asset Management

- is a unified view where you manage an inventory of assets

- developers use versioning to track and manage changes to a software project's code

- collaboration allows diverse people to share and update the same project together

- eg. Github

 

Data Asset Management

- Platform for organizing and managing the data

- supports replication, backup, and access right management

 

Development Environments

- Integrated development environments(IDEs) provide a workspace and tools to work on source code

- IDEs like IBM Watson Studio provide testing and simulation tools to emulate the real world so you can see how your code will behave after it is deployed

 

Exectuion Environment

- has libraries for code compiling and system resources to execute and verify code

- cloud-based exectuion environments aren't tied to specific hardware or software

- IBM Watson Studio has tools for data preprocessing model training and deployment

 

Fully-Integrated visual tools

- cover all tooling components, and can be used to develop deep learning and machine learning models

- eg. Watson Studio by IBM, IBM Cognos Dashboard Embedded by IBM

 

Recap

- Data science Task categories : data mangement, data integration and transformation, data visualization, model building, model deployement, model monitoring and assessment

- data science tasks are supported by : data asset management, code asset management, execution environments, devlopment environments

 

**

볼드표시한 것들은 여기가 중요하다!!도 어느정도 있지만

그냥 내가 새로 알게 된...? 그런 부분들에도 볼드를 치는 편...

**

 

Open Source Tools for Data Science - Part 1

Data Management

- Open-source data management tools are relational databases like : PostgreSQL, MongoDB, Elasticsearch, Hadoop HDFS, ...

 

Data integration and transformation

- data integration and transformation in the classic data warehousing world is for ETL or ELT

- also termed Data Refinery and Cleansing

- eg. Apache Airflow, Kubeflow, Kafka, NodeRED

 

Data Visualization

open source tools : 

- supported by programming libraries where you need to use code

- containing a user interface

 

Model monitoring and assessment

- tools to keep track of machine learning model's prediction performance to maintain outdated models

- eg. ModelDB, Prometheus, IBM AI Fairness 360 Open Source Toolkit

 

Code Asset Management

- tools for code asset management, also known as version management or version control

- eg. git, Github, GitLab, BitBucket

 

Data Asset Management

- tools for data asset management, also known as data governance or data lineage

- eg. Apache Atlas, ODPi EGERIA

 

**

그냥 이런이런 게 있다 나열형...

안 듣고 넘어갔어도 됐을듯

다른 설명보다 IBM tools에 대해 굉장히 설명이 길어지는 게 좀 웃기다ㅋ

**

 

Open Source Tools for Data Science - Part 2

Jupyter

- supports more than a hundred different programming languages through "kernels"

- encapsulates the exectuion environment for the different programming languages

- key property of Jupyter Notebook is to unify documentation, code, output from code, shell commands, and visualizations in a single document

 

RStudio

- oldest development environments for statistics and data science

- exclusively runs R and associated libraries

- Enables Python development

- provides optimal user experience when tightly integrated in the tool

- unifies : programming, execution, debugging, remote data access, data exploration, visualization

 

Apache Spark

- provides cluster execution environment

- provides linear scalability - more the number of servers in a cluster more the performance

 

**

이것도 이것저것 나열...

일단 해보기 전까지는 모르지 않을까요 ㅠ

**

 

Commercial Tools for Data Science

Data Management

- eg. Oracle Database, MSSQL, IBM DB2

- commercial supports delivered by : software vendors, influential partners, support networks

 

Data Integration and Transformation

- Extract, Transform, and Load (ETL) tools : Informatica, IBM InfoSphere DataStage

- support the design and deployment of ETL data processing pipelines through a graphical interface

- bring along connectors to most of the commercial and open-source target information systems

 

Data Visualization

- focus of these tools is to create visual reports and live dashboards

- eg. Tableau, MS PowerBI, IBM Cognos Analytics

- visualization can show relationships between different columns in a table

 

Model Building

- should use data mining product

- eg. SPSS Modeler, SAS enterprise modeler

 

Model Deployement

- tightly integrated into the model-building process

- commercial software can export models in an open format

 

Model monitoring and code asset management

- open source : model monitoring, code asset management

 

Data asset management

- functions include : data governance, data versioned and annotated, data dictionary, data lineage, data privacy and retention

 

Development environments (Watson Studio Desktop)

- fully integrated devleopment environment for data scientists

- combines Jupyter Notebooks with graphical tools

- fully integrated visual tools

 

Cloud Based Tools for Data Science

Fully integrated visual tools and platforms

Large-scale execution of data science workflows happens in compute clusters : 

- composed of multiple server machines

- Watson Studio and Watson OpenScale, cover the complete development life cycle for all data science, machine learning, and AI tasks

- fully cloud-hosted offering supports the complete development life cycle of all data science, machine learning, and AI tasks

 

H2O Driverless AI:

- a product that you download and install

- one-click deployment for the common cloud service providers

- cloud provider does not do operations and maintenance

 

Data Management

- comprises software as a service (SaaS) versions of existing open source and commercial tools

- eg. Amazon DynamoDB : allows storage and retireving data in a key-value or a document store format

 

Data Integration and Transformation

- Comprises ETL and ELT tools

- transformation steps are pushed toward the domain of the data scientist or data engineer

Data Refinery:

- is part of IBM Watson Studio

- allows transforming large amounts of raw data into consumable, quality information in a spreadsheet

 

Data Visualization

- 3D bar chart : to visualize a target alue on the vertical dimension, whcih is dependent on two other values in the horizontal dimensions

- Hierarchical Edge bundling depicts correlations and affiliations between entities

- 2D scatter plot with a heat map : shows two dependent data field on the y-axis with different color intensities

- Tree map : shows the distribution of subsets within a set

- Word cloud : pops out significant terms in a document corpus

 

Model Deployment

- tightly integrated into the model-building process

- commercial software can export models in an open format, such as Predictive Markup Model Language(PMML)

 

Model monitoring and assessment

- a cloud tool to monitor deployed machine learning and deep learning models continuously

- eg. Amazon SageMaker Model Monitor, IBM Watson OpenScale

 

**

위에랑 겹치는 내용이 생각보다 많았음...

그리고 약간 결론은 IBM 자체 SW가 짱이다!!! 으로 몰아가는듯

역시 자기들이 만든 강의다 이거지...

**

 

Practice Quiz - Data Science Tools

Q. Which Data Science category do you extract, transform, and load data?

A. Data Integration and Transformation

 

Q. Which open-source tool is the standard used for code asset management?

A. Git

 

Q. Which open source tool supports more than a hundred different programming languages through "kernels"

A. Jupyter

 

Q. Which commercial tool can be used to define and execute data integration processes in a spreadsheet-style?

A. Watson Studio Desktop

 

Q. Which service offers a document data structure format?

A. JSON

Comments