Structured Streaming: Philosophy behind it

Knoldus Blogs

In our previous blogs:

  1. Structured Streaming: What is it? &
  2. Structured Streaming: How it works?

We got to know 2 major points about Structured Streaming –

  1. It is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.
  2. It treats the live data stream as a table that is being continuously appended/updated which allows us to express our streaming computation as a standard batch-like query as on a static table, whereas Spark runs it as an incremental query on the unbounded input table.

In this blog post, we will talk about the philosophy or the programming model of the Structured Streaming. So, let’s get started with the example that we saw in the previous blog post.

View original post 453 more words

Posted in Scala | 1 Comment

Structured Streaming: What is it?

Knoldus Blogs

spark-logo-croppedWith the advent of streaming frameworks like Spark Streaming, Flink, Storm etc. developers stopped worrying about issues related to a streaming application, like – Fault Tolerance, i.e., zero data loss, Real-time processing of data, etc. and started focussing only on solving business challenges. The reason is, the frameworks (the ones mentioned above) provided inbuilt support for all of them. For example:

In Spark Streaming, by just adding checkpoint directory path, like it is done in below code snippet, recovery from failure(s) became easy.

And in Flink, we just have to enable checkpointing in the execution environment, like it is done in below code snippet.

View original post 397 more words

Posted in Scala | Leave a comment

KnolX: Understanding Spark Structured Streaming

Knoldus Blogs

Hello everyone,

Knoldus organized a session on 05th January 2018. The topic was “Understanding Spark Structured Streaming”. Many people attended and enjoyed the session. In this blog post, I am going to share the slides & video of the session.


View original post 14 more words

Posted in Scala | Leave a comment

A Beginner’s Guide to Deploying a Lagom Service Without ConductR

Knoldus Blogs

lagomHow to deploy a Lagom Service without ConductR? This question has been asked and answered by many, on different forums. For example, take a look at this question on StackOverflow – Lagom without ConductR? Here the user is trying to know whether it is possible to use Lagom in production without ConductR or not. To which the best answer that came up was – “Yes, it is possible!”. Similarly, there are other forums too where we can find an answer to this question.

However, most of them just give us a hint about it or redirect us to Lagom’s documentation, i.e., Lagom’s Production Overview. But none of them provide a one-stop solution to it which is easy to use and is as simple as running a java program from a command line interface.

So, we decided to find a solution for it and share it. In this blog post, we…

View original post 285 more words

Posted in Scala | Leave a comment

Spark Structured Streaming: A Simple Definition

Knoldus Blogs

“Structured Streaming”, nowadays we are hearing this term in Apache Spark ecosystem quite a lot, as it is being preached as next big thing in scalable big data world. Although, we all know that Structured Streaming means a stream having structured data in it, but very few of us knows what exactly it is and where we can use it.

So, in this blog post we will get to know Spark Structured Streaming with the help of a simple example. But, before we begin with the example lets get to know it first.

Structured Streaming is a scalable and fault-tolerant stream processing engine built upon the strong foundation of Spark SQL. It leverages Spark SQL’s powerful APIs to provide a seamless query interface which allows us to express our streaming computation in the same way we would express a SQL query over our batch data. Also, it optimizes the execution…

View original post 454 more words

Posted in Scala | Leave a comment

Apache Spark: 3 Reasons Why You Should Not Use RDDs

Knoldus Blogs

Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDDs, i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this happened because of easy to use API in Spark, lesser operational cost due to efficient use of resources, compatibility with a lot of existing technologies like YARN/Mesos, fault tolerance, and security.

Due to these reasons a lot of organizations have migrated their Big Data applications to Spark and the first thing they do is to learn – how to use RDDs. Which makes sense too as RDD is the building block of Spark and the whole idea of Spark is based on RDD. Also, it is a perfect replacement…

View original post 754 more words

Posted in Scala | Leave a comment

Partition-Aware Data Loading in Spark SQL

Knoldus Blogs

Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code:

val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF =
.jdbc("jdbc:postgresql:dbserver", "schema.table", connectionProperties)

In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are there in table.

Here is an example of jdbc implementation:

val df ="jdbcUrl", "person", connectionProperties)

In the code above, the data will be loaded into Spark Cluster. But, only one worker will iterate over the table and try to load the whole data in its memory, as only one partition is created, which might work if table contains few hundred thousand records. This fact can be confirmed by following snapshot of Spark UI:


But, what will happen if the table, that needs to be…

View original post 293 more words

Posted in Scala | Leave a comment

2017 – Year of FAST Data

Knoldus Blogs

As we approach 2017, there is a strong focus on Fast Data. This is a combination of data at rest and data in motion and the speed has to be remarkably fast. In the deck that follows, we at Knoldus present to you how we have implemented a complex multi scale solution for a large bank on the Fast Data Architecture philosophy. As we partner with Databricks, Lightbend, Confluent and Datastax, we bring in the best practices and tooling needed for the platform.

Just think about it, if Google was always indexing and giving you results from the data at rest then you would never be able to Google for breaking or trending news!

As you enjoy the deck below, we would standby to listen from you on your next fast data project. Have a wonderful New Year 2017


View original post

Posted in Scala | Leave a comment

Migration From Spark 1.x to Spark 2.x

Knoldus Blogs

Hello Folks,

As we know that we have latest release of Spark 2.0, with to much enhancement and new features. If you are using Spark 1.x and now you want to move your application with Spark 2.0 that time you have to take care for some changes which happened in the API. In this blog we are going to get an overview of common changes:

  1. SparkSession : We were earlier developing SparkContext and SqlContext separately but in the Spark 2.0 we have SparkSession which is the entry point to programming Spark with the Dataset and DataFrame API. We can get SparkContext (sc) and SqlContext both in the SparkSession.

    Eg. SparkSession.builder().getOrCreate()

  2. DataFrame variable replace with Dataset[Row] : DataFrame is not available in the Spark 2.0, We are using Dataset[Row]. Where ever we are using DataFrame we will replace it with Dataset[Row] for Spark SQL or Dataset[_] for MLIB.
    Eg. In Spark…

View original post 164 more words

Posted in Scala | Leave a comment

Finding the Impact of a Tweet using Spark GraphX

Knoldus Blogs

Social Network Analysis (SNA), a process of investigating social structures using Networks and Graphs, has become a very hot topic nowadays. Using it, we can answer many questions like:

  • How many connections an individual have ?
  • What is the ability of an individual to influence a network?
  • and so on…

Which can be used for conducting marketing research studies, running ad campaigns, and finding out latest trends. So, it becomes very crucial to identify the impact of an individual or individuals in a social network, so that we can identify key individuals, or Alpha Users (term used in SNA), in a social network.

In this post we are going to see how to find the impact of an individual in a Social Network like Twitter, i.e., How many Twitter Users an individaul can influence via his/her Tweet upto N number of level, i.e., Followers of Followers of Followers… and so on. For, this…

View original post 572 more words

Posted in Scala | Leave a comment