Data science has permeated into every industry thanks to its promise of more innovative solutions and the availability of its fuel, the data. Irrespective of the complexity of the business problem, if apt data is available, then there is no limit to what solutions you can think up. But as you know, not all datasets are of the same scale and can be categorized as
While small datasets can be easily processed using Pandas, NumPy, and scikit-learn, problems surface when the data cannot be contained within the system RAM or persistent storage. Surely, paging and parallel computing can help. But as a Python programmer, you would know that the language is not well suited for multiprocessing. As your dataset size goes beyond the capacity of the computing system, which is pretty common these days, the golden trio of Pandas, NumPy, and scikit-learn no longer remain the best choice. Dask was introduced to answer this problem and bring scalable computing to the Python stack.
What is Dask?
Dask is a lightweight open-source library that enables parallel execution to scale the golden trio in situations where they would have otherwise failed to operate. The scaling is possible since a Dask data frame is made up of Pandas data frames. Similarly, a Dask array consists of NumPy arrays and Dask ML maps to scikit-learn methods. Moreover, a Dask type called Dask bag scales the list type and offers functionalities like mapping, filtering, folding, and grouping. The smaller data frames and arrays that result from Dask-enabled splitting can be moved between machines and thus facilitate parallel execution on larger datasets. Understanding its architecture will give a clearer picture of how does Dask work.
Dask framework consists of three components: the scheduler that coordinates the executions, low-level APIs that are Dask objects, and high-level APIs that abstract Dask objects for scaling to familiar interfaces of the golden trio and lists.
Dask can operate on a standalone system as well as in cluster mode. In both cases, the scheduler does the task orchestration and interfaces between the user and the workers. Locally, each CPU core corresponds to a Dask worker, while for a Dask cluster, the workers are flexibly configurable elements in the form of different machines. If we consider the cluster mode of Dask, the scheduler and the workers belong to the same network and may be protected by a cluster firewall. The same can be hosted on a platform like AWS cloud. A user could submit a job to the Dask scheduler through a script run on a notebook server like Jupyter. The task scheduling would then be done by splitting the job and allocating the resulting chunks to workers of the cluster. On completing tasks handled by each worker, the scheduler would consolidate and communicate the result to the user.
As seen in the image, the client can be any interface that allows a user to view the job results. It should also be remembered that such scalability will require workers to have similar dev environments. While manual configuration is acceptable, Docker images can be defined to set a blueprint for uniform settings among workers, especially if there are hundreds or thousands of them.
Implementing Dask using an Example
Let us dive right into some code examples of one of the popular MLOps tools.
Dask can be easily installed from pip or conda distributions or built from its source library at Github. You can read further information on installation from Dask docs. Once installed, it is ready to be used without any additional configuration requirements.
Pandas is the primary choice for analyzing structured data among data scientists but stumbles when it comes to scalability. Dask data frame API provides a wrapper to split large data frames into smaller partitions. These smaller data frames can then be distributed across CPUs to leverage Pandas functionalities. For those unfamiliar with the concept of data frames, it is a structured data format consisting of rows and columns. While the row corresponds to a single observation, for example, an individual in the case of personal records, the columns capture various attributes, also known as features. The columns of personal records may capture an individual's name, age, occupation, and so on.
To show how to use Dask and the seamless transition between Dask and Pandas, we generated a people dataset using Dask. You may also generate a time series. You can also read more about artificial datasets and handling.
Dask Bag is a parallelized Python list specific to Dask and can be easily translated to a Pandas data frame. Readers who have worked with Pandas before will find Dask data types extremely simple. Since the generated Dask bag for this example was large, we can use the library-offered option to split and write it across multiple .csv files. This extracted data frame can be written to other formats like JSON and excel as well.
Now specific files can be read back into either Pandas or Dask data frame.
As you can see, both the code outputs are extremely similar. As mentioned earlier, Dask also offers the advantage of efficient memory use as the data is parsed directly from the files. So it is wiser to use Pandas data frame when you need faster calculations, and data chunk size is suitable to load in the system RAM. Otherwise, you can save on the RAM usage but sacrifice computation time. To better appreciate the Dask vs. Pandas comparison, let us do a simple groupby operation on the two types of data frames to find the average age for each occupation.
Recommended Reading: Dask vs Pandas vs Apache Spark
The two code runs show how Pandas data frame operation is faster since it uses RAM to load the dataset. Still, it is only suitable if the data size can be contained within the system capacity. The Dask data frame, on the other hand, works through a similar API and gives the same result while saving on the RAM usage. This function is highly desirable when the data size is not within the system’s capacity.
Thank you for reading. I hope we could stoke your curiosity about Dask and get you familiar with how to delve deeper into this powerful library.