Partition-Aware Data Loading in Spark SQL

Knoldus

Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code:

val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.table", connectionProperties)

In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are there in table.

Here is an example of jdbc implementation:

val df = spark.read.jdbc("jdbcUrl", "person", connectionProperties)

In the code above, the data will be loaded into Spark Cluster. But, only one worker will iterate over the table and try to load the whole data in its memory, as only one partition is created, which might work if table contains few hundred thousand records. This fact can be confirmed by following snapshot of Spark UI:

blog-single-stage-new

But, what will happen if the table, that needs to be…

View original post 293 more words

Posted in Scala | Leave a comment

2017 – Year of FAST Data

Knoldus

As we approach 2017, there is a strong focus on Fast Data. This is a combination of data at rest and data in motion and the speed has to be remarkably fast. In the deck that follows, we at Knoldus present to you how we have implemented a complex multi scale solution for a large bank on the Fast Data Architecture philosophy. As we partner with Databricks, Lightbend, Confluent and Datastax, we bring in the best practices and tooling needed for the platform.

Just think about it, if Google was always indexing and giving you results from the data at rest then you would never be able to Google for breaking or trending news!

As you enjoy the deck below, we would standby to listen from you on your next fast data project. Have a wonderful New Year 2017

hny2017

View original post

Posted in Scala | Leave a comment

Migration From Spark 1.x to Spark 2.x

Knoldus

Hello Folks,

As we know that we have latest release of Spark 2.0, with to much enhancement and new features. If you are using Spark 1.x and now you want to move your application with Spark 2.0 that time you have to take care for some changes which happened in the API. In this blog we are going to get an overview of common changes:

  1. SparkSession : We were earlier developing SparkContext and SqlContext separately but in the Spark 2.0 we have SparkSession which is the entry point to programming Spark with the Dataset and DataFrame API. We can get SparkContext (sc) and SqlContext both in the SparkSession.

    Eg. SparkSession.builder().getOrCreate()

  2. DataFrame variable replace with Dataset[Row] : DataFrame is not available in the Spark 2.0, We are using Dataset[Row]. Where ever we are using DataFrame we will replace it with Dataset[Row] for Spark SQL or Dataset[_] for MLIB.
    Eg. In Spark…

View original post 164 more words

Posted in Scala | Leave a comment

Finding the Impact of a Tweet using Spark GraphX

Knoldus

Social Network Analysis (SNA), a process of investigating social structures using Networks and Graphs, has become a very hot topic nowadays. Using it, we can answer many questions like:

  • How many connections an individual have ?
  • What is the ability of an individual to influence a network?
  • and so on…

Which can be used for conducting marketing research studies, running ad campaigns, and finding out latest trends. So, it becomes very crucial to identify the impact of an individual or individuals in a social network, so that we can identify key individuals, or Alpha Users (term used in SNA), in a social network.

In this post we are going to see how to find the impact of an individual in a Social Network like Twitter, i.e., How many Twitter Users an individaul can influence via his/her Tweet upto N number of level, i.e., Followers of Followers of Followers… and so on. For, this…

View original post 572 more words

Posted in Scala | Leave a comment

KnolX: Introduction to Apache Spark 2.0

Knoldus

Knoldus organized a KnolX session on Friday, 23 September 2016. In that one hour session we got an introduction of Apache Spark 2.0 and its API(s).

Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we looked at some improvements that were made in Spark 2.0. Also, in this KnolX we got an introduction to some new features in Spark 2.0 like SparkSession API and Structured Streaming.

The slides for the session are as follows:

Below is the Youtube video for the session.

KNOLDUS-advt-sticker

View original post

Posted in Scala | Leave a comment

Spark Session: New Entry point in Spark 2.0

Knoldus

Finally, after a long wait, Apache Spark 2.0 got released on 26 July 2016, Tuesday. This release is built upon the feedback got from industry, in past two years, regarding Spark and its APIs. This means it has all what Spark developers loved to use and all that which was not liked by developers has been removed.

Since, Spark 2.0 is a major release of Apache Spark, it contains major changes to APIs and libraries of Spark. In order to understand the changes in Spark 2.0, we will be looking at them one by one. So, lets start with Spark Session API.

For a long time, Spark developers were confused between SQLContext and HiveContext, i.e., when to use what. Since, HiveContext was more rich in features than SQLContext, many developers were in favor of using it, but HiveContext required many dependencies to run so, some favored SQLContext.

To end this confusion founders of Spark came up…

View original post 180 more words

Posted in Scala | Leave a comment

Boost Factorial Calculation with Spark

Knoldus

We all know that, Apache Spark is a fast and a general engine for large-scale data processing. It can process data up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

But, is that the only task (i.e., MapReduce) for which Spark can be used ? The answer is: No. Spark is not only a Big Data processing engine. It is a framework which provides a distributed environment to process data. This means we can perform any type of task using Spark.

For example, lets take Factorial. We all know that calculating Factorial for Large numbers is cumbersome in any programming language and on top of that, CPU takes a lot of time to complete the calculations. So, what can be the solution ?

Well, Spark can be the solution to this problem. Lets see that in form of code.

First, we will try to implement Factorial using only Scala in a Tail…

View original post 97 more words

Posted in Scala | Leave a comment