2017 – Year of FAST Data

Knoldus Blogs

As we approach 2017, there is a strong focus on Fast Data. This is a combination of data at rest and data in motion and the speed has to be remarkably fast. In the deck that follows, we at Knoldus present to you how we have implemented a complex multi scale solution for a large bank on the Fast Data Architecture philosophy. As we partner with Databricks, Lightbend, Confluent and Datastax, we bring in the best practices and tooling needed for the platform.

Just think about it, if Google was always indexing and giving you results from the data at rest then you would never be able to Google for breaking or trending news!

As you enjoy the deck below, we would standby to listen from you on your next fast data project. Have a wonderful New Year 2017


View original post

Posted in Scala | Leave a comment

Migration From Spark 1.x to Spark 2.x

Knoldus Blogs

Hello Folks,

As we know that we have latest release of Spark 2.0, with to much enhancement and new features. If you are using Spark 1.x and now you want to move your application with Spark 2.0 that time you have to take care for some changes which happened in the API. In this blog we are going to get an overview of common changes:

  1. SparkSession : We were earlier developing SparkContext and SqlContext separately but in the Spark 2.0 we have SparkSession which is the entry point to programming Spark with the Dataset and DataFrame API. We can get SparkContext (sc) and SqlContext both in the SparkSession.

    Eg. SparkSession.builder().getOrCreate()

  2. DataFrame variable replace with Dataset[Row] : DataFrame is not available in the Spark 2.0, We are using Dataset[Row]. Where ever we are using DataFrame we will replace it with Dataset[Row] for Spark SQL or Dataset[_] for MLIB.
    Eg. In Spark…

View original post 164 more words

Posted in Scala | Leave a comment

Finding the Impact of a Tweet using Spark GraphX

Knoldus Blogs

Social Network Analysis (SNA), a process of investigating social structures using Networks and Graphs, has become a very hot topic nowadays. Using it, we can answer many questions like:

  • How many connections an individual have ?
  • What is the ability of an individual to influence a network?
  • and so on…

Which can be used for conducting marketing research studies, running ad campaigns, and finding out latest trends. So, it becomes very crucial to identify the impact of an individual or individuals in a social network, so that we can identify key individuals, or Alpha Users (term used in SNA), in a social network.

In this post we are going to see how to find the impact of an individual in a Social Network like Twitter, i.e., How many Twitter Users an individaul can influence via his/her Tweet upto N number of level, i.e., Followers of Followers of Followers… and so on. For, this…

View original post 572 more words

Posted in Scala | Leave a comment

KnolX: Introduction to Apache Spark 2.0

Knoldus Blogs

Knoldus organized a KnolX session on Friday, 23 September 2016. In that one hour session we got an introduction of Apache Spark 2.0 and its API(s).

Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we looked at some improvements that were made in Spark 2.0. Also, in this KnolX we got an introduction to some new features in Spark 2.0 like SparkSession API and Structured Streaming.

The slides for the session are as follows:

Below is the Youtube video for the session.


View original post

Posted in Scala | Leave a comment

Spark Session: New Entry point in Spark 2.0

Knoldus Blogs

Finally, after a long wait, Apache Spark 2.0 got released on 26 July 2016, Tuesday. This release is built upon the feedback got from industry, in past two years, regarding Spark and its APIs. This means it has all what Spark developers loved to use and all that which was not liked by developers has been removed.

Since, Spark 2.0 is a major release of Apache Spark, it contains major changes to APIs and libraries of Spark. In order to understand the changes in Spark 2.0, we will be looking at them one by one. So, lets start with Spark Session API.

For a long time, Spark developers were confused between SQLContext and HiveContext, i.e., when to use what. Since, HiveContext was more rich in features than SQLContext, many developers were in favor of using it, but HiveContext required many dependencies to run so, some favored SQLContext.

To end this confusion founders of Spark came up…

View original post 180 more words

Posted in Scala | Leave a comment

Boost Factorial Calculation with Spark

Knoldus Blogs

We all know that, Apache Spark is a fast and a general engine for large-scale data processing. It can process data up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

But, is that the only task (i.e., MapReduce) for which Spark can be used ? The answer is: No. Spark is not only a Big Data processing engine. It is a framework which provides a distributed environment to process data. This means we can perform any type of task using Spark.

For example, lets take Factorial. We all know that calculating Factorial for Large numbers is cumbersome in any programming language and on top of that, CPU takes a lot of time to complete the calculations. So, what can be the solution ?

Well, Spark can be the solution to this problem. Lets see that in form of code.

First, we will try to implement Factorial using only Scala in a Tail…

View original post 97 more words

Posted in Scala | Leave a comment

A sample ML Pipeline for Clustering in Spark

Knoldus Blogs

Often a machine learning task contains several steps such as extracting features out of raw data, creating learning models to train on features and running predictions on trained models, etc.  With the help of the pipeline API provided by Spark, it is easier to combine and tune multiple ML algorithms into a single workflow.

Whats is in the blog?

We will create a sample ML pipeline to extract features out of raw data and apply K-Means Clustering algorithm to group data points.

The code examples used in the blog can be executed on spark-shell running Spark 1.3 or higher.

Basics of Spark ML pipeline API


DataFrame is a Spark SQL datatype which is used as Datasets in ML pipline. A Dataframe allows storing structured data into named columns. A Dataframe can be created from structured data files, Hive tables, external databases, or existing RDDs.


A Transformer converts a Dataframe into another Dataframe…

View original post 276 more words

Posted in Scala | Leave a comment