Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
Data Science is related to data mining, machine learning and big data.
Data Science has emerged to be one of the most popular fields and highest paying fields of the 21st Century.
Data Scientists use various tools for extracting, manipulating, pre-processing and generating predictions out of data. In this article, we will share some of the most popular Data Science Tools used by Data Scientists today. For this we will make use of Kaggle's State of the Machine Learning and Data Science 2019 Survey which is said to be the most comprehensive dataset available on the state of Machine Learning and Data Science today.
Python
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. It is the language of choice for Data Scientists all over the world and is one of the most popular programming languages.
R
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
Jupyter
Project Jupyter is a nonprofit organization created to "develop open-source software, open-standards, and services for interactive computing across dozens of programming languages". Project Jupyter has developed and supported the interactive computing products Jupyter Notebook, JupyterHub, and JupyterLab, the next-generation version of Jupyter Notebook. Jupyter Notebook is a defacto standard interactive environment for data scientists today.
RStudio
RStudio is an integrated development environment for R. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.
PyCharm
PyCharm is an integrated development environment used in computer programming, specifically for the Python language. It provides code analysis, a graphical debugger, an integrated unit tester, integration with version control systems, and supports data science with Anaconda.
Visual Studio/Visual Studio Code
Visual Studio is an integrated development environment from Microsoft. It includes a code editor supporting IntelliSense as well as code refactoring.
Visual Studio Code is a free source-code editor made by Microsoft. Its features include support for debugging, syntax highlighting, intelligent code completion, snippets, code refactoring, and embedded Git.
In both Visual Studio and Visual Studio Code users can change the theme, keyboard shortcuts, preferences, and install extensions that add additional functionality.
In the Stack Overflow 2019 Developer Survey, Visual Studio Code was ranked the most popular developer environment tool, with 50.7% of 87,317 respondents reporting that they use it.
Scikit-learn
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Keras
Keras is an open-source library that provides a Python interface for artificial neural networks. Keras contains numerous implementations of commonly used neural-network building blocks such as layers, objectives, activation functions, optimizers, and a host of tools to make working with image and text data easier to simplify the coding necessary for writing deep neural network code.
XGBoost
XGBoost is an open-source software library which provides a gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting (GBM, GBRT, GBDT) Library". It has gained much popularity and attention recently as the algorithm of choice for many winning teams of machine learning competitions.
Tensorflow
TensorFlow is a free and open-source software library for machine learning. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.
Tensorflow is a symbolic math library based on dataflow and differentiable programming. It was developed by the Google Brain team for internal Google use. It was released under the Apache License 2.0 in 2015.
Amazon Web Services (AWS)
Amazon Web Services (AWS) is a subsidiary of Amazon providing on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide a variety of basic abstract technical infrastructure and distributed computing building blocks and tools.
Google Cloud Platform (GCP)
Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, file storage, and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning. Google Cloud Platform provides infrastructure as a service, platform as a service, and serverless computing environments.
Microsoft Azure
Microsoft Azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers. It provides software as a service (SaaS), platform as a service (PaaS) and infrastructure as a service (IaaS) and supports many different programming languages, tools, and frameworks, including both Microsoft-specific and third-party software and systems.
Matplotlib
Matplotlib is a plotting library for the Python programming language. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state machine, designed to closely resemble that of MATLAB, though its use is discouraged.
ggplot2
ggplot2 is a data visualization package for the statistical programming language R. ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics%u2014a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers. ggplot2 can serve as a replacement for the base graphics in R and contains a number of defaults for web and print display of common scales. Since 2005, ggplot2 has grown in use to become one of the most popular R packages.
Weka
Waikato Environment for Knowledge Analysis (Weka) is free software. It contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions.
pandas
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
Scrapy
Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler. It is widely used for Data Mining activities.
SQL
SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling and manipulating structured data.
MATLAB
MATLAB is a proprietary multi-paradigm programming language and numerical computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages.
Comments