Structured Streaming: Philosophy behind it

Knoldus Blogs

In our previous blogs:

  1. Structured Streaming: What is it? &
  2. Structured Streaming: How it works?

We got to know 2 major points about Structured Streaming –

  1. It is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.
  2. It treats the live data stream as a table that is being continuously appended/updated which allows us to express our streaming computation as a standard batch-like query as on a static table, whereas Spark runs it as an incremental query on the unbounded input table.

In this blog post, we will talk about the philosophy or the programming model of the Structured Streaming. So, let’s get started with the example that we saw in the previous blog post.

View original post 453 more words

Posted in Scala | 1 Comment

Structured Streaming: What is it?

Knoldus Blogs

spark-logo-croppedWith the advent of streaming frameworks like Spark Streaming, Flink, Storm etc. developers stopped worrying about issues related to a streaming application, like – Fault Tolerance, i.e., zero data loss, Real-time processing of data, etc. and started focussing only on solving business challenges. The reason is, the frameworks (the ones mentioned above) provided inbuilt support for all of them. For example:

In Spark Streaming, by just adding checkpoint directory path, like it is done in below code snippet, recovery from failure(s) became easy.

And in Flink, we just have to enable checkpointing in the execution environment, like it is done in below code snippet.

View original post 397 more words

Posted in Scala | Leave a comment

KnolX: Understanding Spark Structured Streaming

Knoldus Blogs

Hello everyone,

Knoldus organized a session on 05th January 2018. The topic was “Understanding Spark Structured Streaming”. Many people attended and enjoyed the session. In this blog post, I am going to share the slides & video of the session.

Slides:

View original post 14 more words

Posted in Scala | Leave a comment

A Beginner’s Guide to Deploying a Lagom Service Without ConductR

Knoldus Blogs

lagomHow to deploy a Lagom Service without ConductR? This question has been asked and answered by many, on different forums. For example, take a look at this question on StackOverflow – Lagom without ConductR? Here the user is trying to know whether it is possible to use Lagom in production without ConductR or not. To which the best answer that came up was – “Yes, it is possible!”. Similarly, there are other forums too where we can find an answer to this question.

However, most of them just give us a hint about it or redirect us to Lagom’s documentation, i.e., Lagom’s Production Overview. But none of them provide a one-stop solution to it which is easy to use and is as simple as running a java program from a command line interface.

So, we decided to find a solution for it and share it. In this blog post, we…

View original post 285 more words

Posted in Scala | Leave a comment

Spark Structured Streaming: A Simple Definition

Knoldus Blogs

“Structured Streaming”, nowadays we are hearing this term in Apache Spark ecosystem quite a lot, as it is being preached as next big thing in scalable big data world. Although, we all know that Structured Streaming means a stream having structured data in it, but very few of us knows what exactly it is and where we can use it.

So, in this blog post we will get to know Spark Structured Streaming with the help of a simple example. But, before we begin with the example lets get to know it first.

Structured Streaming is a scalable and fault-tolerant stream processing engine built upon the strong foundation of Spark SQL. It leverages Spark SQL’s powerful APIs to provide a seamless query interface which allows us to express our streaming computation in the same way we would express a SQL query over our batch data. Also, it optimizes the execution…

View original post 454 more words

Posted in Scala | Leave a comment

Apache Spark: 3 Reasons Why You Should Not Use RDDs

Knoldus Blogs

Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDDs, i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this happened because of easy to use API in Spark, lesser operational cost due to efficient use of resources, compatibility with a lot of existing technologies like YARN/Mesos, fault tolerance, and security.

Due to these reasons a lot of organizations have migrated their Big Data applications to Spark and the first thing they do is to learn – how to use RDDs. Which makes sense too as RDD is the building block of Spark and the whole idea of Spark is based on RDD. Also, it is a perfect replacement…

View original post 754 more words

Posted in Scala | Leave a comment

Partition-Aware Data Loading in Spark SQL

Knoldus Blogs

Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code:

val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.table", connectionProperties)

In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are there in table.

Here is an example of jdbc implementation:

val df = spark.read.jdbc("jdbcUrl", "person", connectionProperties)

In the code above, the data will be loaded into Spark Cluster. But, only one worker will iterate over the table and try to load the whole data in its memory, as only one partition is created, which might work if table contains few hundred thousand records. This fact can be confirmed by following snapshot of Spark UI:

blog-single-stage-new

But, what will happen if the table, that needs to be…

View original post 293 more words

Posted in Scala | Leave a comment