8 Big Trends In Big Data Analytics

The advancements in big data technology and practices are rapid. This is how you can stay ahead of the curve.

Bill Loconzolo (Vice President of Data Engineering at Intuit) jumped with both his feet into a lake of data. Dean Abbott, chief scientist for Smarter Remarketer’s data lake, made a run for it. Both say that the cloud is the future of big data and analytics. Data lakes are used to store large amounts of data in their native format. Even though technology options remain vastly unproven, there is no way to delay.

Loconzolo said that while the tools are still being developed, the promise and potential of the [Hadoop] platform does not have the business value it should. Companies must adapt to the rapid changes in big data and analytical fields. Otherwise, they risk being left behind. He says that in the past, it took years for emerging technologies to mature. “People now iterate, drive solutions in just a few months or weeks. What are the most important emerging trends and technologies that you should keep an eye on – or test in your lab? Computerworld asked IT professionals, consultants, and analysts from the industry to comment. Here’s the list.

1. Big data analytics in cloud

Hadoop, a framework for large-scale data processing, originally was designed to work with clusters of machines. This is now different. Brian Hopkins, analyst at Forrester Research, says that there are now more technologies available to process data in the cloud. Amazon’s Redshift hosted BI warehouse, Google BigQuery data analytics service and IBM’s Bluemix platform are all examples. He believes that big data’s future will be hybrid on-premises/cloud.

Smarter Remarketer, which provides SaaS-based retail analytics and segmentation services, has recently changed from its internal Hadoop and MongoDB database infrastructures to Amazon Redshift. This cloud-based data warehouse is ideal for Smarter Remarketer. The Indianapolis-based company collects customer data from brick-and mortar and online sales. They also analyze real-time behavioral data. This data is used to create targeted messaging for retailers to get the desired response.

Abbott stated that Redshift was a more cost-effective option for Smart Remarketer because of its extensive reporting capabilities. Redshift is scalable and can be used in a cloud-based environment. He claims that virtual machines can be scaled up more easily than physical machines.

Intuit, Mountain View, Calif. based, is cautiously moving toward cloud analytics. Because it requires a secure environment that can be audited and remains stable, Intuit prefers to use cloud analytics. For the moment, Intuit Analytics Cloud is where everything is kept. Loconzolo states that the company is collaborating with Cloudera and Amazon to develop a public-private cloud for analytic services that is highly available, secure, and accessible. Intuit, which makes cloud-based products for its customers, is bound to make the move. He believes that it will eventually become cost prohibitive to transfer all of the data to private clouds.

2. Hadoop is the new enterprise data operation system

Hopkins says that MapReduce and other distributed analytic frameworks are now becoming distributed resource manager systems, making Hadoop more general-purpose. These systems can be used to perform many data manipulations and analysis operations, Hopkins says. They can also be plugged into Hadoop, which is a distributed file storage system.

What does this all mean for enterprises? Hadoop will soon be more used by businesses as their enterprise data hub, with SQL, MapReduce, in-memory and stream processing all being supported. Hopkins states, “Hadoop will be an inexpensive and general-purpose data hub that allows you to run many different kinds [queries] or data operations against your data.”

Intuit’s Hadoop foundation is being built. Loconzolo said that Intuit is leveraging the Hadoop Distributed File System as a long term strategy to enable all types interaction with people and products.

3. Big data reservoirs

Traditional database theory states that data sets should be designed before any data is entered.
A data lake, also called an enterprise data lake or enterprise data hub, turns that model on its head, says Chris Curran, principal and chief technologist in PricewaterhouseCoopers’ U.S. advisory practice. He states, “It tells us that we’ll take all these data sources and dump them into a huge Hadoop repository. And we won’t attempt constructing a data modeling beforehand.” It provides tools to help people analyze data and a high-level description of the data in the lake. “People add views to the data as and when they need them. Curran says it’s an organic and incremental method of building large-scale databases. However, it is not for everyone.

“People add views as they go.” Chris Curran from PwC says it is a gradual, organic method of building large-scale databases.

Intuit Analytics Cloud includes a data pool that includes clickstream data and enterprise data. But the goal is to “democratize” these tools so business people can use them effectively. Loconzolo states that one of the problems with Hadoop’s data lakes is its inability to be used by enterprises. He said, “We need the same capabilities that traditional enterprise database systems have had for decades”

4. More sophisticated predictive analysis

Hopkins states that big data gives analysts more data and the ability to process large amounts of records with multiple attributes. Machine learning traditionally relies on statistical analysis of a small sample of a larger data set. He states that it is now possible to perform large numbers and large numbers attribute per record, increasing predictability.

Big data combined with compute power also allows analysts to look at new behavioral data throughout their day such as the websites visited and locations. Hopkins refers to this as “sparse information”, because it is necessary to sort through large amounts of irrelevant data in order find the right data. It was computationally impossible to apply traditional machine-learning algorithms to this kind of data. He says that we now have the ability to bring low-cost computational power to solve the problem. Abbott states that speed and memory are no longer critical factors in problem formulation. You can use huge computing resources to analyze the problem and find the best variables. This is truly a game-changer.

Loconzolo explains that it is important to be able to do real-time analysis, as well predictive modeling using the same Hadoop cluster. Hadoop took 20 times longer than other technologies to answer questions. This is a problem. Intuit has been testing Apache Spark and Spark SQL, a large-scale, data processing engine. Spark has the ability to perform interactive queries as well as streaming and graph services. Loconzolo said that it keeps Hadoop’s data, but gives enough performance to bridge the gap.

5. SQL on Hadoop is faster and more efficient

If you are a mathematician and a coder, you can drop in data and do analysis on any Hadoop item. Gartner analyst Mark Beyer believes that’s the promise. He states, “I need someone to convert it into a format/language structure that I can understand.” Beyer says SQL for Hadoop products are the best solution. However, any language that is familiar could be used. Business users can use the same techniques that are used to query SQL with tools that support SQL-like queries. SQL on Hadoop “opens Hadoop’s door,” Hopkins said. Businesses don’t have to spend money on data scientists and business analysts with JavaScript or Python skills, which is something Hadoop users had traditionally required.

These tools have been around for a while. Apache Hive, a structured and SQL-like query syntax for Hadoop, has been around for a while. Cloudera, Pivotal Software and IBM offer commercial alternatives that are faster than the Apache Hive. The technology is well-suited for “iterative analytical,” in which an analyst asks one question and receives an answer. This type work required the construction of a data store. SQL on Hadoop will not replace data warehouses. Hopkins states that SQL on Hadoop does however offer options to expensive software and appliances to support certain types of analytics.

6. NoSQL: More and better

NoSQL databases are quickly becoming popular as alternatives to traditional SQL-based relational database systems. They can be used in specific types of analytic apps and this momentum will only grow, Curran says. Curran estimates there are approximately 15 to 20 open source NoSQL databases, each offering a different specialization. ArangoDB, an open-source NoSQL product, has graph database capabilities, which allows customers and salespeople to see the connections faster than with a relational.

Curran states that although open-source SQL databases are “not new”, they are gaining momentum due to the kind of analysis people require. PwC has installed sensors on shelves to monitor the products, their handling time and how long shoppers are standing in front of specific shelves. Curran says these sensors produce streams of data that will multiply exponentially. Because it is lightweight, special-purpose and high-performance, a NoSQL key value pair database is the best choice.

7. Artificial intelligence using neural networks

Hopkins states that while deep learning is still in development, it has tremendous potential to solve business problems. “Deep Learning. . . enables computers recognize items and deduce relationships from large amounts of binary and unstructured data.

Hopkins points out that one deep learning algorithm, which used Wikipedia data as its source, discovered on its own that California is part of the U.S.A.

Hopkins says that big data will be able to process large amounts of unstructured text, using advanced analytic methods like deep learning. This will allow for us to better understand the world. It could, for example, recognize many kinds of data. Such as images that contain cats, shapes and colors. “This notion is cognitive engagement, advanced analytics, and all that it implies. . . Hopkins states that future trends are important,” Hopkins said.

8. Analyzing data in RAM

Beyer says in-memory storage is becoming more common and can greatly benefit analytic processing. In fact, many businesses are already leveraging hybrid transaction/analytical processing (HTAP) – allowing transactions and analytic processing to reside in the same in-memory database.

Beyer states that HTAP has been overused by businesses due to the hype surrounding it. In-memory costs more if the system is used for data that the user requires to see in a consistent way throughout the day.

While HTAP can speed up analytics, transactions must all be stored in the same database. Beyer says that analytics today tend to be about connecting transactions from multiple systems. “Putting it all in one database does not solve the problem. It is still a disproven belief that HTAP can only be used for analytics if all transactions are in one place. It is still necessary to integrate multiple data.

In-memory databases are also easier to manage, secure and integrate.

Intuit says Spark has made it less tempting to embrace the in-memory systems. Loconzolo explains that Spark infrastructure can solve 70% and in-memory systems could solve 100%. “So, we will prototype, test it and then pause internal in-memory systems.

Staying one Step ahead

IT organizations need to provide conditions that allow data scientists and analysts to explore the many new trends in big analytics and big data. Curran says, “You need a method to evaluate and prototype these technologies before you integrate them into your business.”

Beyer says that “IT managers and implementers can’t use inexperience as an excuse for halting experimentation.” Only a handful of people, such as data scientists and analysts with the highest skills, will be able to experiment at first. These advanced users and IT will then decide when new resources should be made available to the rest. IT should also be involved in this decision-making process.
Analysts who are determined to go full-throttle should not be restricted. Beyer believes IT should collaborate with analysts to “put variable-speed throttles on these high-powered new tools.”

Author

  • julissabond

    Julissa Bond is an educational blogger and volunteer. She works as a content and marketing specialist for a software company and has been a full-time student for two years now. Julissa is a natural writer and has been published in several online magazines. She holds a degree in English from the University of Utah.

julissabond

julissabond

Julissa Bond is an educational blogger and volunteer. She works as a content and marketing specialist for a software company and has been a full-time student for two years now. Julissa is a natural writer and has been published in several online magazines. She holds a degree in English from the University of Utah.

You may also like...