Big data is the collection of large datasets from traditional and digital sources to identify trends and patterns. The quantity and variety of computer data are growing exponentially for many reasons. For example, retailers are building vast databases of customer sales activity. Organizations are working on logistics financial services, and public social media are sharing a vast quantity of sentiments related to sales price and products. Challenges of big data include volume and variety in both structured and unstructured data. In this paper, we implemented several machine learning models through Spark MLlib using PySpark, which is scalable, fast, easily integrated with other tools, and has better performance than the traditional models. We studied the stocks of 10 top companies, whose data include historical stock prices, with MLlib models such as linear regression, generalized linear regression, random forest, and decision tree. We implemented naive Bayes and logistic regression classification models. Experimental results suggest that linear regression, random forest, and generalized linear regression provide an accuracy of 80%–98%. The experimental results of the decision tree did not well predict share price movements in the stock market.

Big data consists of massive structured and unstructured information that can be manipulated by machine models, unlike earlier structures such as relational databases [

Spark provides a basic machine learning library with tools such as scalability, design, and languages such as Scala, R, and Python. To use machine learning in a sequential environment has many weaknesses, such as long execution time with large datasets, task-dependency, non-scalability, and limited memory. Spark provides tools to address these problems, and it facilitates data engineering and science [

External factors such as social media [

Most investors analyze market and company information before buying stocks. Market-related information is available on social media and from news, blogs, and companies’ customer reviews. Many investors analyze such information to predict the movements of stock prices. The manual analysis of this information is prone to errors [

We present models to predict market trends. Machine learning algorithms have different approaches. All are applied in the Spark framework using Python. Databricks platforms are also used. We discuss both structured and unstructured datasets.

Machine models such as linear regression, decision tree, random forest, and generalized linear regression are applied to the structure of datasets. The main inputs are closing prices and adjusted closing prices, which are used to predict the behavior of stock prices. Logistical regression and naive Bayes models are applied to unstructured datasets such as messages and reviews of customers and investors.

We compare results to predict the intrinsic values of share prices using the above models. We use existing and new techniques of the machine learning classification model with modern platforms Databricks and PySpark with Python. We use a 10-year history dataset to predict stock price movements. We include daily news and social media (Yahoo! Finance, Twitter) datasets for use in sentiment analysis to predict the intrinsic values of share prices.

The remainder of this paper is organized as follows. Section 1 discusses the background and previous work. Section 2 discusses the proposed model and defines the datasets (structured and unstructured) and Spark machine learning libraries. Section 3 discusses the performance and results of the models and compares their results. Section 4 discusses our conclusions and future work.

Forecasting is a challenge as investors seek to accurately predict market values, and many models have been proposed. Every researcher tries to accurately predict the market. Stock market prediction is based on various techniques related to structured and unstructured data [

Khan [

Hong mentioned that scientific analysis applies to historical data where mathematical approaches are used to forecast the movements of the stock market [

Peng [

Misra et al. [

Seif et al. [

In this article, we describe artificial intelligence, natural network, and machine learning algorithms for predicting movements of stock prices. This article divides a method into two parts: prediction and classification. Prediction techniques like ANN, CNN, Naive Bayes, NN, and Digital Signature Standard (DSS) are used to predict a model. The second part describes some classification methods like filtering, fuzzy-based optimization, and KNN methods applied to some datasets to evaluate the value of the stock market. The neural network method of DSS is used in the hybrid model to predict the movements of stock prices. Yang et al. [

Yang et al. proposed a model to forecast the market value with big data. The researchers collected data from real-time processing using social media [

In our study, we rely on different sources of data–-both structured and unstructured data–-to predict the future movements of stock prices. We have employed several models that produced better results compared with the models applied in the literature.

We propose a model to help investors decide which shares to buy and sell. The model aims to predict stock price movements.

This model applies to historical data, Twitter, and news related to different companies. The historical news of different companies’ datasets is used to predict the future values of stocks.

We use two types of datasets. First, we collect the historical datasets of companies in the last 15 years. The second dataset is collected from news, blogs, Twitter, Yahoo! Finance, and reviews and messages about different companies, along with sources such as Google Dataset Search and

Yahoo! Finance [

Stock symbol | Company name |
---|---|

AAPL | Apple |

Yahoo | Yahoo! |

AMZN | Amazon |

Gold | Barrick Gold Corp. |

FB | |

IBM | International Business Machines |

DELL | Dell Technologies |

GOOG | Alphabet |

NFLX | Netflix |

The adjusted closing price (Adj-Close) is the price at the end of the trading day or session adjusted for companies’ actions such as stock dividends or stock splits. Researchers may use the Adj-Close value to examine expected movements of share prices. Many articles use Adj-Close to forecast the market’s exact value.

Date | Open | High | Low | Close | Adj-close | Volume |
---|---|---|---|---|---|---|

9/1/2019 | 50.0505 | 52.08208 | 48.02803 | 50.22022 | 50.22022 | 4465900 |

9/2/2019 | 50.5555 | 54.59459 | 50.3003 | 54.20921 | 54.20921 | 22824300 |

Data for sentiment analysis can be collected from Twitter, Yahoo! Finance messages,

Investors view companies’ profiles, historical trading data, news, analysts’ opinions, financial statements, and messages to identify which company share values have increased or decreased. Machine learning techniques are applied to these datasets to predict the movements of stock prices. Analysis of messages helps investors to plan their trading.

We collect data from Yahoo! Finance and

The X value is a better solution to fill the missing value in the movement of the market. A high-low price percentage is an important formula to forecast the stock market.

Both formulae apply the ML-lib model to predict the future values of stock prices.

We apply some text processing techniques to unstructured datasets such as social media and news. We cannot use unstructured datasets in machine models; hence, techniques such as tokenization and text processing are employed to remove spam. In Databricks, we used Spark with Python to process datasets. A dataset can be read using a Spark context. Datasets are converted to resilient distributed datasets (RDDs), a Spark structure that allows data to be divided into clusters.

We remove unwanted columns and change non-numeric values to numeric datasets.

We have employed several feature-selection and feature-extraction techniques to avoid unnecessary complexity in our model. We can select features that increase the accuracy of a model. We convert RDD to a dense vector function, after which data frames have two-column labels and features. These can be divided into training and testing data frames. We apply MLlib in Spark with Python.

Sentiment analysis can be applied to unstructured datasets such as messages, news, and data extracted from social media. Sentiment analysis can be applied to news and messages from the stock market to help investors identify share price movements.

ML algorithms can be used to forecast the next value of the stock market. Linear regression, NB, and DT are used to forecast the movement in stock prices. These models are applied in the Spark framework using Python. Databricks platforms are used when these models can be applied to datasets to find models that give accurate values of the stock index.

Data are analyzed using ML-LibMLlib support models such as naive Bayes and Linear Support Vector Machine. Before using the model, we must process the datasets.

We use Spark with Python to process datasets in Databricks. A dataset can be read in a Spark context. Datasets are converted to resilient distributed datasets (RDDs), a Spark structure by which data can be divided into clusters. We can then convert the RDD to a vector function. We construct data frames consisting of two-column labels and messages. Message columns may be news or social-media text relevant to the stock market.

Tokenization splits text into tokens to which classification models apply. We apply a function to a dataset after removing unnecessary words.

We remove unnecessary colons and stop-words that can affect the model results. This can be done using the NLP toolkit, where each word can be compared from the dictionary, and word matches removed.

Term frequency (TF) and inverse document frequency (IDF) tools have been used in the analysis. It is a feature function that identifies the weight of a words and tells the relationship between them. After applying all the functions in the data frame, we can disunite the information frame into training and testing. We apply the machine learning library using Spark MLlib with Python. We next describe the machine model.

LR is used to find the state between the independent and dependent labels, which help to forecast values. LR works on more than one independent label. We have employed multiple linear regression to examine the associations between the independent and dependent labels. Suppose that a and b are independent or dependent labels; the following is the regression equation:

A similar concept can be used in LR to find the exact value for Spark. The LR model depends on the supervised machine. It forecasts the value of the stock price. The model targets values based on self-directed or dependent changeable values. The LR model forecasts prices that depend on independent values. This model can be applied to different companies’ datasets to forecast the future values of stock prices. The result of Apple (AAPL) is presented in

We have also used the decision tree (DT) model. This model depends on supervised machine learning algorithms. In this model, we have split the data into multiple classes and features in each dataset. The decision tree model is supervised, and it works on both regression and classification tasks. It cannot outperform the random forest. We use this model with Spark to make the data ready for analysis.

Generalized linear regression (GLR) is more adaptable than other LR models. Unlike linear regression, this model does not require data to follow a normal distribution, The GLR model with Spark provides more accurate estimation compared to a linear regression model.

Random forest (RF) models consist of supervised machine algorithms. The RF model is similar to the decision tree model (DM). Nevertheless, it can measure multiple trees with the same datasets and calculate the value of the forecast of every single tree.

The Naive Bayes (NB) morpheme is a classification that uses the naive Bayes model. We apply this model to analyze text data collected from the Yahoo! Finance database to predict the movements of stock prices based on investors’ reviews. The results of the NB model are more accurate than the results of the logistic-R model.

Logistic regression is a model that works on a discrete set of values and text. It is a simple model that works with probability, and it is similar to the NB model, which is used for classification. We use this model to predict the stock market prices using Spark MLB. We apply text classification and tokenization before applying this model. The logistic model works better than other models in predicting the movements of stock prices.

In this section, we compare the results of all models and highlight the model that produced the most accurate results in predicting the future values of stock prices.

We have employed several models in predicting the future values of stock prices.

The above result shows which machine models are more accurate in predicting stock price movements. Their accuracy can be measured by R2 and root mean square error (RMSE), which help to identify the best model that predicts the movement of the stock prices.

Model | R2 | RMSE | Accuracy (%) |
---|---|---|---|

0.098 | 3.143 | 95 | |

0.63 | 5.224 | 37 | |

0.995 | 6.677 | 89 | |

0.998 | 3.114 |

The results show that Spark MLlib models using big data produce more accurate results than the other models, and that LR and GLR are more accurate than other models. We apply these techniques to larger historical datasets, and we obtain better accuracy.

Models/companies | APPL (%) | Yahoo (%) | Gold (%) | FB (%) | IBM (%) | MSFT (%) | DELL (%) | NFIX (%) | GOOG (%) |
---|---|---|---|---|---|---|---|---|---|

95 | 83 | 99 | 98 | 91 | 99 | 95 | 94 | 98 | |

37 | 17 | 48 | 65 | 75 | 54 | 68 | 53 | 86 | |

89 | 45 | 92 | 91 | 80 | 93 | 93 | 95 | 99 | |

97 | 81 | 99 | 97 | 91 | 99 | 93 | 93 | 97 |

We define the results of the message/customer reviews in the APPL stock price. These datasets help investors to identify increases and decreases in stock prices.

We can apply the naive Bayes and logistic regression classifier models to news and messages to predict the movement of stock prices. Naive Bayes gives approximately 60%–70% accuracy, and so both models give close accuracy. Nevertheless, the logistics classifier model is better to predict the value of stock price movements, as it provides 60% to 80% accuracy.

Model | APPL (%) | DELL (%) |
---|---|---|

Naive Bayes | 80 | 79 |

Logistic Regression | 70 | 77 |

We employed several machine learning models to predict stock price movements through the Spark big data framework. We used Spark MLlib to predict stock price movements We applied Machine learning libraries on historical data for 10 companies The results indicate that linear regression, random forest, and generalized linear regression produced more accurate results than the decision tree model. Naive Bayes and logistics regression applied to the texture of data results show approximately 77% to 80% accuracy ratios. We suggest employing deep learning models through LSTM for future research.