Ten Research Challenge Areas In Data Science

Overview

We propose 10 areas of challenge for researchers to focus on in order to drive data science research forward. Because data science encompasses many disciplines and applications from all areas of society, as well as methods drawn from computer science and statistics, the challenge areas are representative of the broad range and breadth that can be faced in this field. Our enumeration begins with meta-questions on whether data science can be considered a discipline. Each of the 10 problem areas are then described. This article aims to stimulate discussion about what might constitute a foundation for data science research. However, it is important to recognize that data science is still in its infancy.

Although data science relies on information from other disciplines like computer science and engineering, statistics, and mathematics, it is still a field of unique interest with many mysteries to be solved: fundamental scientific problems and urgent social issues.

We list 10 areas for research to help advance data science. We want to spark a conversation about the basis of a research agenda for data science. However, we also recognize that the field is still in its infancy.

Before we begin to discuss this enumeration of data, we ask, but do not answer, a meta question: Is datascience a discipline or is it? This meta-question continues to be debated in this journal. This article proposes additional meta-questions as a frame for the discussion.

Data Science is a discipline?

Data science is an area of study. A data scientist can obtain a degree and then get hired to do data science research. Is data science a discipline? Is it possible that data science will become a distinct discipline? Below are some meta-questions regarding whether data science should be considered a separate discipline.

– Is there a driving question in data science? What are these deep questions? Each scientific discipline typically has one or several “deep” questions driving its research agenda. What is biology’s origin? What is computable in computer science? Are data science’s deep questions inherited from other disciplines or are they its own?

What role does the domain play in data science? Many academics claim that data science differs from other fields in that it’s not about just methods. They also emphasize the importance of using those methods within a domain. A domain is the area where the data are being collected and analyzed. Is there a way to define data science without including a domain? Method-based disciplines such as statistics, mathematics, or computer science are often used in contexts other than data science. These domains are also influenced by problems that are encountered there. It is possible to study data sciences in the same manner as in mathematics, statistics and computer science without having to understand its context. Does the domain (more integral) make data science more interesting? What is the unique way that domains are included in data science research?

Data science is what makes it data science. What is the problem that data science cannot solve? When is a collection or analysis of methods, results or methods considered data science, rather than just results or methods in statistics (or maths)? Should all the methods, analyses, or results from all of these disciplines be considered data science?

Data science is still a new field of study. It’s too early to know the answers to these meta-questions. They will likely find new answers as the field develops and the members of the established disciplines contribute their knowledge and perspectives. These questions are open to debate by the data scientists as they work towards addressing more concrete scientific, societal, and technological challenges that arise from the abundance of data, methods of data science, and applications of data sciences.

Ten Fields of Study

Let’s now ask a simpler question that can be asked in any area of study: What research challenges drive data science study? Here’s a listing of 10. Here is a list of 10. These are not called challenge areas but challenge questions. Each area can be used to suggest many questions. They may not be the “top 10,” but they are good starting points for the community to consider what a broad research agenda could look like in data sciences.
We have already discussed how they overlap with problems in statistics, computer science and the social sciences. They are written from the perspective of computer scientists, due to the author’s previous experience. The challenges relate to science, technology and society.

1. Scientific Understanding of Learning, Particularly Deep Learning Algorithms.
While deep learning is a marvellous success, scientists still don’t understand why it works well. We are not able to understand the mathematical properties or the outputs of deep learning models. We don’t know why deep learning models produce different results. We do not know how resistant or fragile models can be to changes in input data distributions. Deep learning cannot be used to prove that it will do the job well with new input data. We don’t know how we can measure and characterize uncertainty in model results. Deep learning’s computational limits are not known (Thompson and colleagues, 2020). At what point is more data or more computation not helpful? Deep learning is a case in point where experimentation is more important than any type of theoretical understanding. It’s not the only example of learning. Random forests (Biau & Scornet, 2015) as well as high-dimensional sparse statistic (Johnstone & Titterington, 2009) have wide applicability to large-scale information. This is because there are often gaps between their performance in real life and what theory can explain.

2. Making conclusions based on cause and effect.
Machine learning allows you to discover patterns and explore correlations within large data sets. Although machine learning has been a boon for many fields, such as medicine, sociology, and public health, it is not able to solve causal questions. The study of causal inference when large quantities of data are a growing field of research is a rich and exciting area. Economists are innovating new ways to use the large amount of data they have available in order to improve their mainstay causal reasoning techniques. For example, instrumental variables can be used to make causal Inference estimation more flexible and efficient (Athey, 2016; Taddy 2019). Multiple causal inference is being explored by data scientists, not only to challenge some of the assumptions that univariate causality makes, but also because many real-world observations are caused by multiple factors that interact (Wang & Blei 2019, 2019). Data scientists are developing synthetic control to replace natural experiments in economics and social sciences. As more commercial and government data becomes public, data scientists are inspired by natural experiments in economics and the socio-sciences. 2019).

3. Precious Data
Three reasons data can be very valuable are: they are expensive to collect or contain a rare event (low-signal-to-noise rate); or they are artisanal small, task-specific and/or target a restricted audience. Large, expensive, one-off scientific instruments can be a great source of expensive data. Rare event data can be derived from sensors that monitor physical infrastructures, such as tunnels and bridges. While sensors provide a lot in raw data, it is not possible to predict the catastrophic event they are predicting. It is also possible to pay a lot for rare data. The tens of thousands of Chinese court decisions that China has made public online since 2014 (Liebman (2018)) and the two-plus millions of documents declassified by the U.S. government’s History Lab (Connelly 2019). Are these examples of artisanal datasets? We need to develop new data science algorithms and methods for each type of precious data. This should take into account the data’s intended uses and users.

4. Multiple, heterogeneous data sources
Some problems can be solved by combining data from multiple sources. We might use 3-D cell line data from mice to determine the efficacy of a specific treatment for cancer. However, this is not the best way to model the cost of the treatment.
Multiscale spatiotemporal models of climate simulations also simulate interactions among multiple physical system, each representing different data sources. These data sources could be considered precious data (see Challenge no. 3). The state-of-the art data science methods are not capable of combining heterogeneous data sources to create a single accurate model. When a model is built from multiple data sources, it can make it more difficult to limit the uncertainty. Standardization of data types, formats and data types can reduce undesirable or unnecessary heterogeneity. The extraordinary results of combining multiple data sources will be achieved through focused research.

5. Inferring from noisy and/or incomplete data
We don’t always have all the information we need about the real world. Data scientists seek to infer and predict from these data. This is a long-standing problem within statistics. Abowd, 2018, and Hawes 2020 are two great examples of how this problem has been reformulated. In this case, noise is intentionally added to queries to preserve the privacy of those who participate in the census. Researchers working in small geographical areas like census blocks need to be able to deal with “deliberate” noise. The noise can render the data less informative at these levels of aggregation. How can social scientists make inferences about this noisy data? The ability of machine learning to distinguish noise from signal can increase the accuracy and efficiency of these inferences.

6. Reliable Artificial Intelligence
Rapid deployment of systems that use artificial intelligence and machine-learning in critical domains like autonomous vehicles, criminal justice and health care, as well as hiring, housing, human resources management and law enforcement has been a common occurrence. Decisions made by AI agents directly affect human lives. It is becoming increasingly difficult to trust that these decisions will be fair and ethical. (see Challenge no. 10), interpretable. 9), secure, reliable and robust (see Challenge no. Many of these properties can be derived from long-standing research on Trustworthy Computing (National Research Council in 1999), but AI takes it to the next level (Wing in 2020). Reasoning about machine learning models seems to be inseparable, as does reasoning about available data and unseen data. Machine learning models are also probabilistic. Adadi & Berrada 2018; Chen et.al., 2018 and Murdoch et.al., 2019, respectively; Turek 2016, 2016). This can be used to build trust. When the results can be understood and trusted by the user, the model will have more trust. Formal methods are another way to verify that a model meets a particular property. There are new trust properties that allow machine-learn models to be more accurate, reliable, and fair. Trustworthy models are accessible to multiple audiences. They can be trusted by model developers, human and machine users, as well as model customers.

7. Computing Systems For Data-Intensive Applications
Computer systems are designed for speed and power. Applications that run faster than others can be programmed to run faster. The data is the core of all applications today, particularly in the sciences (e.g. science, astronomy or climate science) Large data centers are home to a variety of special-purpose processors such as GPUs, FPGAs and TPUs. Domain-specific accelerators that are optimized for deep-learning, as well as general-purpose processors, offer performance improvements of up to ten times (Dally, et al. 2020). Even with all these data, it can still be weeks before accurate predictive models are built. Applications from both science and industry require real-time prediction.
The sharing of data, computing, or models can help with scale, reliability, privacy, but it runs into practical limits such as latency and bandwidth. Deep learning, for instance, is an example of a data-hungry algorithm. In order to measure performance, we must not only consider time and space, but also energy consumption. We must rethink the design of computer systems from the beginning, with data (not computations) being the primary focus. New computing systems must take into account heterogeneous processing and efficient layouts for massive data volumes to allow for rapid access, communication and networking capability, energy efficiency, target domain, applications, and even tasks.

8. Automating the Front-End Stages in the Data Lifecycle
Deep learning has made data science exciting. However, machine learning is only part of the equation. Before we can start using machine learning algorithms, we have to prepare the data to be analysed. Wing 2019, the first stage of the data cycle, is still tedious and labor-intensive. Data scientists must use both statistical and computational tools to develop automated methods for data collection, cleaning, and wrangling. However, they need to preserve other desirable properties of the final model, such as accuracy, precision, robustness, and robustness. Data Analysis Baseline Library is a framework to automate data cleaning and visualization, as well as model building and interpretation. Snorkel addresses data labeling, which can be tedious (Ratner et.al. 2018, 2018). Trifacta is a university-spin-out company that specializes in data wrangling. Commercial services can also be used to address these needs. These include the automating of building machine learning models.

9. Secrecy
Many applications require more data to build a better model. To get more data, you can share it. This allows multiple parties to pool their data to build a better model together than any one person could. Due to privacy or regulations, there are many situations where we must keep the data sets of all parties confidential. In order to determine if someone has a specific disease, this is an example. It would be possible to build a better model by sharing patient records from multiple hospitals. However, this is not possible due to HIPAA privacy regulations. Now, we are exploring practical and scalable models that allow multiple parties to access their data. Both industry and government already use these techniques and concepts (Abowd (2018); Kamara (2014); Ion et. al. 2017, 2017). These methods can also be applied to the simpler case where data from one entity must remain private before analysis.

10. Moral principles
Data science presents new ethical challenges. They can be framed on three axes: (1) Data ethics: How data are recorded and shared; (2) Ethics of algorithms: How artificial intelligence, computer learning and robots interpret data; (3) Ethics of Practices: Developing responsible innovation and codes of ethics to guide emerging science (Floridi & Taddeo 2016); and to determine criteria for data-specific institutional review boards (IRB). Wing et al. (2018). The Belmont Report (Belmont Report in 1979) and Menlo Reports (Dittrich & Kenneally 2011) provide a foundation for identifying the new ethical issues that data science technology can raise. Respect for Persons is an ethical principle that chatbots should inform people. The ethical principle o Beneficence requires an analysis of the risk/benefit decision that a self drivable car makes regarding who it will not harm. We must ensure that the risk assessment tools used to hire and in the court system are fair. Data scientists face new ethical problems as a result.

Closing remarks

Many colleges and universities are creating data science schools, institutes and centers (Wing and others).
It is worth looking at data science in its entirety (2018). Data science will become a separate discipline, or a field that can be applied to all disciplines. Computer science, mathematics, statistics all share a commonality. They each have their own disciplines, but can be applied to almost every discipline.

What will data-science look like in 10, 50, and 20 years? This question will be answered by the next generation of educators and researchers. Data science can only be studied and advanced if you are willing to study the language, methods, and tools in multiple disciplines. Although it can be frustrating, it is possible to combine and apply these knowledge. Today’s students, researchers, postdoctoral fellows, undergraduates and graduate students will benefit from this: You will shape the field through the data science research issues you choose!

Author

  • julissabond

    Julissa Bond is an educational blogger and volunteer. She works as a content and marketing specialist for a software company and has been a full-time student for two years now. Julissa is a natural writer and has been published in several online magazines. She holds a degree in English from the University of Utah.

julissabond

julissabond

Julissa Bond is an educational blogger and volunteer. She works as a content and marketing specialist for a software company and has been a full-time student for two years now. Julissa is a natural writer and has been published in several online magazines. She holds a degree in English from the University of Utah.

You may also like...