Data Visualization – Rukshar's Datascience Projects

EDA and NLP with Disaster Tweets

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies).But, it’s not always clear whether a person’s words are actually announcing aContinue reading “EDA and NLP with Disaster Tweets”

Vector Auto Regression for Multivariate Time Series on Air Quality Data

Multivariate time series has more than one time-dependent variable where each variable depends both on its past values and on other variables. Example. A dataset has perspiration percent, dew point, wind speed, cloud cover percentage, etc. along with temperature values for the past two years. So, there’re multiple variables for optimal temperature prediction. For multivariateContinue reading “Vector Auto Regression for Multivariate Time Series on Air Quality Data”

Grid Search ARIMA Model for Best Parameters for US Border Crossing Data

Github code: https://github.com/rukshar69/Datascience-Projects/tree/master/border%20crossing In this article, we predict/forecast the number of crossings that arrive in the USA from Canada and Mexico borders for a particular period of time. The dataset used is the Bureau of Transportation Statistics (BTS) Border Crossing Data which provides summary statistics for inbound crossings at the U.S.-Canada and the U.S.-Mexico borderContinue reading “Grid Search ARIMA Model for Best Parameters for US Border Crossing Data”

Time Series Forecasting Covid-19 By Using ARIMA

Github Code: https://github.com/rukshar69/Datascience-Projects/tree/master/Covid%20Arima Here, we predict the cumulative number of confirmed COVID19 cases/fatalities in various countries using the data of COVID infection/fatalities gathered from 22th January to 31st March by using the ARIMA model for time series analysis. As can be seen, there are 6 columns of which we focus on ‘Country_Region’, ‘Date’, ‘ConfirmedCases’, andContinue reading “Time Series Forecasting Covid-19 By Using ARIMA”

Particle Level Prediction Using LSTM

Github: https://github.com/rukshar69/Datascience-Projects/tree/master/Air%20Particles%20Madrid Long Short Term Memory networks (LSTM), a type of Recurrent Neural Networks (RNN) and used for long-term dependencies, makes training on time series problems successful. They can search for trends in the data and predict relevant outputs. So, LSTM can be used for predicting pollution levels and related trends. By implementing LSTM hereContinue reading “Particle Level Prediction Using LSTM”

Econometric Time Series Analysis of Home Sales

Github: https://github.com/rukshar69/Datascience-Projects/tree/master/Home%20Sales%20Time%20Series%20Sarima We analyze a time series data called ‘City Home Price Sales’ with Trend and Seasonal components. Time series data, a sequence of data points measured over time intervals, is a function of time f(t) = y. Data points are measured hourly, daily, weekly, etc.The statistical properties of the data are modeled by using an Econometric approach. The objective is to forecast future sales given a time period. We utilize SARIMA (Seasonal ARIMA) to model seasonality and trend together. The Data: The data is a monthly home sales index collected from 20 major US cities over the course of 20 years, from 2000 to 2019. The link to the data: (https://fred.stlouisfed.org/series/SPCS20RPSNSA)There are 2 different versions of the data: seasonally-adjusted and not seasonally-adjusted. Seasonally adjusted is without seasonal component. The version we use is not seasonally adjusted because we want to model both the seasonality and the trend. Seasonally adjusted data is used in the industry to find the true effect of economic events on particular data, overlapping with seasons. Seasonality influences the effect by hiding or underestimating/overestimating it. For instance: Heating oil producers wish to learn the effect of decreasing petrol prices on heating oil prices. But, heating oil prices may increase in the winter. Since heating oil is a subproduct of petrol, the fact of price increase may be surprising. The reduction in petrol price should reflect on the reduction of heating oil price. But there is a rise in demand in winter for heating. If we remove the seasonal effect from the time series data, we just see a decreasing trend for heating oil prices. The slight price increase in the winter is the seasonal effect. A glimpse of the data: There are 247 entries in the dataset. Observing the time series plot, we see an increasing trend from 2000 to 2006, a decreasing trend from 2007 to 2012 corresponding to the financial crisis and an increasing trend till 2019. Seasonality is also observed here because the housing market is not very active at the beginning of a year. But the sales usually go high in mid-year and by the end of the year sales slow down. From this seasonal observation, we can state that Spring and Summer are good seasons for the housing market.From the Autocorrelation plot( the correlation of time series data with its own lagged values), we see autocorrelation at lags=12 and 24 are significant. Autocorrelation at lag=1, shows the correlation of y_t and y_t-1. At lag=2 corr(y_t, y_t-2). At lag=12 corr(y_t, y_t-12). Here, data points at time t has high correlation with t-12, meaning a data point is correlated with the data at this month last year. Blue shaded area at autocorrelation plot depicts significance level. Within the shaded area, the correlation coefficients indicate weak correlation at those lags, marking them as insignificant.The partial autocorrelation (PACF) shows the partial correlation of a time series with its own lagged values. PACF gives the pure correlation between two lags by removing the correlation contribution of other lags. We use ACF and PACF of stationary time series to choose a correct order for MA(q) and AR(p) values of ARIMA model respectively. For AR, we look at PACF plot and choose lag value(p) which has significant correlation value before correlations get insignificant. For MA, we look at ACF plot and do the same. There is a high correlation at lag 1 and 12, as can be seen from the scatterplots. For our model, the training data contains home sales data from 2000 to 2018 and test data contains data from 2018 to 2019. Random sampling isn’t possible because the temporal behavior of time series data is to be kept intact. Time series components: Trend, Seasonality, Cycles, Residuals Time series data, Y is a combination of Trend, Cycles, Seasonality and Residuals. Trend – longterm updward or downward movement. Cycle – periodic variation due to economic movements and is different from seasonal variation because cycles occur at intervals such as every 6 to 10 years, whereas seasonal variation occurs in shorter time intervals. Continue reading “Econometric Time Series Analysis of Home Sales”

Titanic: Data Visualization and Prediction using XGBoost

Here, we visualize the data of Titanic dataset from kaggle to understand how they may effect the chance of survival for a passenger. We then use predictive machine learning models such as XGBoost to predict, given the attributes, whether a passenger would survive the tragic ordeal of Titanic’s sinking. GitHub Link: https://github.com/rukshar69/Datascience-Projects/tree/master/Titanic We draw theContinue reading “Titanic: Data Visualization and Prediction using XGBoost”

Sentiment Analysis with Transformers

Github Link: https://github.com/rukshar69/Transformers/blob/master/rukshar_transformer_sentiment_analysis.ipynb Here, we fine-tune a BERT Machine Learning model to build a Sentiment Classifier using Google Play app reviews dataset and the Transformers library by Hugging Face! A glimpse at the dataset. Here we use the ‘content’ and the ‘score’ column for sentiment analysis. The dataset’s shape is: (15746, 11) meaning, it hasContinue reading “Sentiment Analysis with Transformers”

Loan Dataset: Data Visualization and Classification

Visualizing a dataset for customer loans and using logistics regression, decision trees and random forest models to create and compare prediction models based on this dataset.

Playing with Iris-Dataset

Visualizing the Iris dataset and implementing various classification models with and without dimensionality reduction using PCA