Spark Structured Streaming: A Simple Definition

Knoldus

“Structured Streaming”, nowadays we are hearing this term in Apache Spark ecosystem quite a lot, as it is being preached as next big thing in scalable big data world. Although, we all know that Structured Streaming means a stream having structured data in it, but very few of us knows what exactly it is and where we can use it.

So, in this blog post we will get to know Spark Structured Streaming with the help of a simple example. But, before we begin with the example lets get to know it first.

Structured Streaming is a scalable and fault-tolerant stream processing engine built upon the strong foundation of Spark SQL. It leverages Spark SQL’s powerful APIs to provide a seamless query interface which allows us to express our streaming computation in the same way we would express a SQL query over our batch data. Also, it optimizes the execution…

View original post 454 more words

Advertisements
Posted in Scala | Leave a comment

Apache Spark: 3 Reasons Why You Should Not Use RDDs

Knoldus

Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDDs, i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this happened because of easy to use API in Spark, lesser operational cost due to efficient use of resources, compatibility with a lot of existing technologies like YARN/Mesos, fault tolerance, and security.

Due to these reasons a lot of organizations have migrated their Big Data applications to Spark and the first thing they do is to learn – how to use RDDs. Which makes sense too as RDD is the building block of Spark and the whole idea of Spark is based on RDD. Also, it is a perfect replacement…

View original post 754 more words

Posted in Scala | Leave a comment

Partition-Aware Data Loading in Spark SQL

Knoldus

Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code:

val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.table", connectionProperties)

In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are there in table.

Here is an example of jdbc implementation:

val df = spark.read.jdbc("jdbcUrl", "person", connectionProperties)

In the code above, the data will be loaded into Spark Cluster. But, only one worker will iterate over the table and try to load the whole data in its memory, as only one partition is created, which might work if table contains few hundred thousand records. This fact can be confirmed by following snapshot of Spark UI:

blog-single-stage-new

But, what will happen if the table, that needs to be…

View original post 293 more words

Posted in Scala | Leave a comment

2017 – Year of FAST Data

Knoldus

As we approach 2017, there is a strong focus on Fast Data. This is a combination of data at rest and data in motion and the speed has to be remarkably fast. In the deck that follows, we at Knoldus present to you how we have implemented a complex multi scale solution for a large bank on the Fast Data Architecture philosophy. As we partner with Databricks, Lightbend, Confluent and Datastax, we bring in the best practices and tooling needed for the platform.

Just think about it, if Google was always indexing and giving you results from the data at rest then you would never be able to Google for breaking or trending news!

As you enjoy the deck below, we would standby to listen from you on your next fast data project. Have a wonderful New Year 2017

hny2017

View original post

Posted in Scala | Leave a comment

Migration From Spark 1.x to Spark 2.x

Knoldus

Hello Folks,

As we know that we have latest release of Spark 2.0, with to much enhancement and new features. If you are using Spark 1.x and now you want to move your application with Spark 2.0 that time you have to take care for some changes which happened in the API. In this blog we are going to get an overview of common changes:

  1. SparkSession : We were earlier developing SparkContext and SqlContext separately but in the Spark 2.0 we have SparkSession which is the entry point to programming Spark with the Dataset and DataFrame API. We can get SparkContext (sc) and SqlContext both in the SparkSession.

    Eg. SparkSession.builder().getOrCreate()

  2. DataFrame variable replace with Dataset[Row] : DataFrame is not available in the Spark 2.0, We are using Dataset[Row]. Where ever we are using DataFrame we will replace it with Dataset[Row] for Spark SQL or Dataset[_] for MLIB.
    Eg. In Spark…

View original post 164 more words

Posted in Scala | Leave a comment

Finding the Impact of a Tweet using Spark GraphX

Knoldus

Social Network Analysis (SNA), a process of investigating social structures using Networks and Graphs, has become a very hot topic nowadays. Using it, we can answer many questions like:

  • How many connections an individual have ?
  • What is the ability of an individual to influence a network?
  • and so on…

Which can be used for conducting marketing research studies, running ad campaigns, and finding out latest trends. So, it becomes very crucial to identify the impact of an individual or individuals in a social network, so that we can identify key individuals, or Alpha Users (term used in SNA), in a social network.

In this post we are going to see how to find the impact of an individual in a Social Network like Twitter, i.e., How many Twitter Users an individaul can influence via his/her Tweet upto N number of level, i.e., Followers of Followers of Followers… and so on. For, this…

View original post 572 more words

Posted in Scala | Leave a comment

KnolX: Introduction to Apache Spark 2.0

Knoldus

Knoldus organized a KnolX session on Friday, 23 September 2016. In that one hour session we got an introduction of Apache Spark 2.0 and its API(s).

Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we looked at some improvements that were made in Spark 2.0. Also, in this KnolX we got an introduction to some new features in Spark 2.0 like SparkSession API and Structured Streaming.

The slides for the session are as follows:

Below is the Youtube video for the session.

KNOLDUS-advt-sticker

View original post

Posted in Scala | Leave a comment