Thursday, May 24, 2018

Streaming moving averages with Apache Spark

Target audience: Intermediate
Estimated reading time: 5'

Processing data in real-time is not without challenges. Streaming component of
Apache Spark as seen significant improvements since its formal introduction in version 1.0.
The creation of models through supervised learning from a given training sets usually requires the application of some pre-processing algorithm such as smoothing, filtering or extrapolating existing data for missing values. 

Filtering by Moving Average

Computing the moving average of a time series is a simple and convenient technique used to filter out noise and reduce the impact of outliers on trend lines. For this post, we consider the simplest form of moving average over a period of p values, as defined in the following formula \[\tilde{x_{t}}= \frac{1}{p}\sum_{j=t-p+1}^{t}x_{j}\]
The iterative (or recursive) form is defined as \[\tilde{x_{t}}=\tilde{x_{t-1}}+\frac{1}{p}(x_{t}-x_{t-p})\]

Spark streaming