A Music Recommendation System Based on logistic regression and eXtreme Gradient Boosting

A Music Recommendation System Based on logistic

regression and eXtreme Gradient Boosting

Haoye Tian

School of Big Data and Software

Engineering,Chongqing

University

Chongqing, China

haoyetian@cqu.edu.cn

Haini Cai

School of Big Data and Software

Engineering,Chongqing

University

Chongqing, China

hainicai@cqu.edu.cn

Junhao Wen

School of Big Data and Software

Engineering,Chongqing

University)

Chongqing, China

jhwe[email protected].cn

Shun Li

School of Big Data and Software

Engineering,Chongqing

University

Chongqing, China

lishun@cqu.edu.cn

Yingqiao Li

School of Big Data and Software

Engineering,Chongqing

University

Chongqing, China

yingqiaoli@cqu.edu.cn

Abstract—With the rapid growth of music industry data, it is

difficult for people to find their favorite songs in the music

library. Therefore, people urgently need an efficient music

recommendation system to help them retrieve music. Traditional

collaborative filtering algorithms are applied to the field of music

recommendation. However, collaborative filtering does not

handle data sparse problems very well when new items are

introduced. To solve this problem, some people use the logistic

regression method as a classifier to predict the user’s music

preferences to recommend songs. Logistic regression is a linear

model that does not handle complex non-linear data features. In

this paper, we propose a hybrid LX recommendation algorithm

by integrating logistic regression and eXtreme Gradient

Boosting(xgboost). A series of experiments are conducted on a

real music dataset to evaluate the effectiveness of our proposed

LX model. Our results show that the error and AUC of our LX

model are better than other methods.

Keywords—recommendation system, logistic regress, xgboost,

music preferences

I. INTRODUCTION

With the rapid development of big data on the Internet, the

problem of information overload has become increasingly

serious. For example, the Spotify music library already has 30

million songs and the number of songs grows at a rate of

20,000 songs per day. Therefore, it is difficult for people to

find their favorite songs from such a rich library. In order to

solve this problem, a personalized music recommendation is

designed to discover suitable music for users.

Traditional collaborative filtering algorithm is widely used

in music recommendation [1]. For example, Last.fm, a

professional music site, uses collaborative filtering to

recommend songs to their users. In the music field, a large

amount of new music is added to the music library every day.

Therefore, cold start problems can occur frequently. However,

collaborative filtering is not good at handling cold start issues.

To address this problem, some content-based recommendation

methods are proposed by extracting music content features [2],

[3]. Although these methods solve the cold start problem to

some extent, they still fail to make full use of these high-

dimensional data.

In order to extract powerful features from high-dimensional

data, some advanced recommendation models are designed by

exploiting the user and item information. Bartz et al. applied

logistic regression(LR) to recommend a sponsored search term

[4]. Their work shows that the comprehensive performance of

LR is slightly better than traditional collaborative filtering. LR

model is a simple linear model and does not handle non-linear

features very well. To solve this problem, the tree model was

introduced because of its non-linear property. For example, the

eXtreme Gradient Boosting(xgboost) proposed by Chen and

Guestrin is widely used [5]. Xu and Liu et al. improved Chen’s

work to recommend products to users on the shopping site.

Their results have a better performance with higher f1-score

than the baseline and have a high time efficiency [6]. However,

this tree model is easy to overfit if there are few samples or too

many features.

In this paper, we design an improved tree and LR fusion

model to avoid the above problems. Our approach enhances

LR expression capabilities by extracting non-linear features

with a tree model. In the tree structure, we choose xgboost

model because of its low complexity. In addition, in order to

reduce over-fitting of xgboost and improve the generalization

ability of the fusion model, we optimize the fusion part of the

model structure by shrinking the dimensions of output data of

xgboost. At the feature level, we construct the user profile from

his/her historical listening behavior and cross the profile with

song features to learn more potential information. Finally, we

use these features to train the xgboost and LR fusion model to

predict the user’s music preferences.

II. R

ELATED WORK

A. Logistic regression

Since the LR method was proposed, it has been applied to

many fields. Bender et. employed LR models to analyze real

IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

Personal use is p ermitted, but republication/distribution requires IEEE permission.

See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

paper N-19514.pdf

Authorized licensed use limited to: University of Luxembourg. Downloaded on January 15,2022 at 17:56:59 UTC from IEEE Xplore. Restrictions apply.

retinal lesion data in medical research [7]. The work in [8]

reduces the empirical risk of SVM output by using the results

of logistic regression analysis and successfully predicts

corporate financial distress in the financial domain. Maranzato

et al. exercised logistic regression and step-by-step

optimization for fraud detection in reputation systems [9].

In the field of the recommendation system, LR is also one

of the most popular models. Quevedo et al. proposed a tag

candidate set ranking method based on LR model. The results

illustrate that it performs significantly better than previous

methods [10]. Wang et al. merged LR and another model to

increase f1-score by 2%-36% on the dataset provided by

Alibaba's mobile shopping department [11].

B. Xgboost

The Tree models are often applied to overcome weaknesses

in linear models. Some people solved some predictive

problems by building a model with the decision tree [12], [13].

In 2001, a gradient boosting decision tree(GBDT) proposed by

Friedman caused widespread interest [14]. After that, some

people have achieved good results in many areas by using or

improving GBDT. For example, Xie and Coggeshall made

GBDT as a classifier in data mining competition to predict

hospital transfer and mortality, and finally, his team ranked

second in two tasks [15]. Zhang et al. proposed a two-set

gradient-enhanced decision tree model to predict the gap

between taxi demand and supply [16]. Experiments on real

large-scale dataset prove that this method is effective on sparse

data and superior to other state-of-the-art methods.

Tianqi Chen made some improvements based on GBDT. In

terms of algorithms, xgboost improves the generalization

ability of the model by optimizing the loss function and

regularization. In engineering, xgboost supports parallel

computing, which improves the speed of the model. These

advantages made xgboost shine in many recent competitions.

III. T

HE DATASET

n this paper, our dataset is provided in The Million Song

Dataset(MSD) [17]. In the field of music recommendation,

MSD is a very professional data website created by the

National Science Foundation. It collects a lot of audio features

and metadata of the million contemporary popular music

tracks. Moreover, it also includes some supplementary

datasets, such as Last.fm dataset, Taste Profile subset, etc.

Therefore, this dataset has rich music data, such as audio,

lyrics, tag, user behavior and so on.

Because of the copyright issues related to music, we can't

directly get the original audio of music. So MSD provided a

releasable version by extracting the relevant features from

audio. And, Some other contributors also extracted some of the

more plentiful features as a compliment. For example, the

musiXmatch dataset has a feature mapping on lyrics so that we

can get important information about the song text. Because of

the contributions of these researchers, we can do a lot of

interesting experiments on MSD. The datasets used in this

paper is as follows:

• The Echo Nest Taste Profile Subset: 1 million users,

380,000 songs, 1.4 million user play record.

• The Last.fm Dataset: 500,000 song tags

• T

he Millionsongsubset_full: 20,000 audio features

In order to reduce the impact of the lack of audio features on

the model, we use more than 20,000 samples containing audio

features to train.

IV. I

MPROVE LR WITH XGBOOST

A. L

R model

As mentioned earlier, LR was widely used as a classic

classifier in the field of recommendation system. In this article,

we first preprocess the data to get a sample format that can be

trained by our method. Then, we extract the user and song

features from the dataset and perform feature engineering such

as transformation and intersection on these features. Finally,

we use LR model to predict the user’s music preferences and

compare results with our method.

The LR model is a generalized linear model. The

continuous random variable X is subject to the logistic

distribution, i.e. the distribution function F(x) and density

function f(x) are as follows:

()/

() ( )

Fx PX x

μγ

−−

=≤=

(1)

()/

()/2

() ()

(1 )

fx Fx

μγ

−−

′

(2)

where μ is the positional parameter and γ > 0 is the shape

parameter. The F(x) and f(x) are shown in Figure 1 and Figure

Fig. 1. The density function of LR

IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

paper N-19514.pdf

- 2 -

Authorized licensed use limited to: University of Luxembourg. Downloaded on January 15,2022 at 17:56:59 UTC from IEEE Xplore. Restrictions apply.

Fig. 2. The distribution function of LR

We define a vector with n user or song features

(, , , )

xxx x= 

. Let the conditional probability

(1|)

Py x==

be the probability of an event(y=1) occurring

under x condition. Then the LR model can be expressed as

()

−

(3)

where

011

()

xwwx wx=+ +



, w is the weight of x. Now,

suppose there are m samples, the labels are

,,,

yy y



. The

probability that the user like the song (

y =

) under given

conditions is defined as

(1|)

iii

Py x==

(4)

Similarly, the probability that the user does not like the song

(

y =

) is

−

, So the label probability of the sample can be

defined as

() (1 )

ii i

Py p p

−

=−

(5)

Because each sample is independent each other, their joint

distribution is the product of the distribution of the edges,

which is defined as

() (1 )

Lw p p

−

=−

∏

(6)

Then we apply the maximum likelihood estimation method to

estimate the model parameters. We redefine (7) by employ

logarithm and get

()

() (1 ) (1 )

ii i i

nL w y Inp y In p

=+−−



(7)

The opposite of the above formula is our objective

function. The derivation process also demonstrates that the LR

model is a linear expression of the data. In the next section, we

try to use our method to improve the non-linear expression

ability of the LR model.

B. LX model

In order to exploit non-linear features in the data, people

often use more complex model such as decision tree. However,

only one decision tree cannot reach a very high level. So, some

tree-based boosting algorithms were developed. Although these

algorithms can learn the non-linear characteristics of the data,

they are also subject to the over-fitting problem of the tree

model. To take advantage of both the tree model and the LR

model, a fusion model was proposed [18]. This fusion model

utilizes tree-based boosting algorithms to extract non-linear

features and then feeds these features to the LR model for

training. The fusion structure is shown in Figure 3.

Fig. 3. Fusion model structure

There are two trees in Figure 3. S is an input sample. After

traversing two trees, sample s falls in the leaf nodes of two

trees. Each leaf node corresponds to the LR one-dimensional

feature. The new vector value of the structure is 0 or 1. For

example: for input s, suppose it falls in the second node of the

left tree, coded as [0,1,0], and the first node in the right tree

encodes [1,0], so the overall encoding is [0,1,0,1,0].

In this structure, the author uses the gradient boosting

decision tree (GBDT) to extract non-linear features. However,

in the tree-based boosting algorithms, Tianqi Chen's xgboost

model is considered to be more effective. The main

improvements of xgboost to GBDT are as follows:

•

Xgboost adds an explicit regularization to the objective

function to avoid over-fitting. In the following formula,

is the objective function of xgboost, which is defined

(,) ( )

ii k

ly y f=+Ω



(8)

where

(,)

ly y

is the loss of the model.

is the

regularization term and is defined as

()

γλ

Ω=+

(9)

where T is the number of trees and w is the weight of

the leaf nodes. When we minimize the objective

function,

reduces the complexity of the model by

A Music Recommendation System Based on logistic regression and eXtreme Gradient Boosting

paper N-19514.pdf

- 3 -

Authorized licensed use limited to: University of Luxembourg. Downloaded on January 15,2022 at 17:56:59 UTC from IEEE Xplore. Restrictions apply.

reducing the number of trees T and weakening the

weight of the leaf nodes w.

•

The first derivative of the objective function is used in

the GBDT to calculate the pseudo-residue for

generating the next tree model. Xgboost uses not only

the first derivative but also the second derivative on the

objective function by Taylor expansion. This allows the

model to optimize the objective more quickly.

•

The metric for finding the best segmentation point in

the CART regression tree is to minimize the mean

square error. The criterion for xgboost is to maximize

the information gain. Xgboost will split the leaf nodes if

information gain is greater than a certain threshold,

which further reduces the over-fitting problem.

We propose the LX model by integrating LR and xgboost.

And the framework of LX model is as follows. First of all,

based on the advantages of xgboost mentioned before, we

choose xgboost instead of GBDT to extract the non-linear

features of the data. Salgado et al. proposed Logistic regression

and Takagi-Sugeno fuzzy models, which use the method of

separating binary from numerical features in the modeling

process to increase the performance of a single model using

both types of features together [19]. So we use a similar

approach when training the model. The decision tree in

xgboost is the regression tree and it is more suitable for

processing continuous data. So we input the continuous

features of the data into xgboost, and then splice the output of

xgboost with the discrete features of the original data. Then,

the output layer of xgboost is a sparse matrix, which is not

conducive to the training of the LR model. Therefore, after

splicing the data features, we discard the dimension with

coverage below 90%. At the same time, in order to learn the

information obtained by xgboost from the data, we add the

predicted probability score of xgboost to the feature set. Finally,

we use the feature set as the input of the LR model. The

process is shown in Figure 4.

V. E

XPERIMENTS

In this section, we conduct some experiments on the MSD

dataset to evaluate our method. We compare our LX model

with the LR and xgboost models. In section A and B, in order

to show the best performance of these models, we do some

feature engineering and study the hyper parameters. In the last

section, we compare the experimental results of the three

models and draw our conclusions.

A. Feature engineering

In order to extract the non-linear feature in the data, we

construct the user’s preference profile which contains some

important features. First, we obtain the user's profile that

represents the user's preference for the artist and the genre of

songs. For example, we define user

u 'preference for rock as

(, ) (, ) ( , )

iijj

sSgrock

Preference u rock T u s C g rock

∈=

=∗



(10)

where

is the

th song of the song set S,

is the genre of

the song

, (, )

Tu s is the times that

listens to

, and

(, )

Cg rock is the confidence of

g rock=

. Then, the user

profile feature is crossed with the artist or genre of the current

song. In this way, we get the user's artist and genre preference

features for the current song: prefer_artist, prefer_genre.

We analyze all the features using the feature importance

tool in xgboost. The result proves that the features we construct

play an important role in the model. The result is shown in

Figure 5. The y-axis is the feature name and the x-axis is the

feature importance score. In the future, we will try to dig more

user personalized features, such as user age, gender, and user

preference for rhythm.

Fig. 4. The framework of LX model

Fig. 5. Xgboost feature importance

IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

paper N-19514.pdf

- 4 -

Authorized licensed use limited to: University of Luxembourg. Downloaded on January 15,2022 at 17:56:59 UTC from IEEE Xplore. Restrictions apply.

B. Model hyper parameter

The setting of the hyper parameter may have a relatively

large impact on the performance of the model, so we apply the

grid-search method to search the optimal hyper parameter. In

the logistic regression model, the hyper parameter C is the

reciprocal of the regularization coefficient that is used to

control the complexity of the model. We first observe the

impact of the inverse of regularization strength C on the LR

model. As shown in Figure 6.

Fig. 6. Hyper parameter C on AUC of the LR

It can be seen that with the change of the hyper parameters

C, the AUC of the LR model does not change significantly.

Therefore, we can infer that the impact of the hyper parameters

C on the LR model is small. We also observe the effect of

other hyper parameters on LR, and the results are similar.

Therefore, the performance of the LR model cannot be greatly

improved by adjusting the hyper parameters.

In addition, we view the impact of the maximum depth of

the tree on the xgboost model. As shown in Figure 7.

Fig. 7. The maximum depth of the tree on AUC of the xgboost

From the figure, we can clearly see that when the

maximum depth of the tree is less than 5, as the maximum

depth of the tree increases, the AUC of the model increases as

well. However, when the tree depth is greater than 5, the AUC

starts to drop very quickly. This proves that the grid-search

method is helpful for the xgboost model. We employ this

method to traverse the various hyper parameters of the model

in the open source xgboost package of the Chen Tianqi team

and find the optimal set. Some important hyper parameters

results are shown in Table I.

TABLE I. O

PTIMAL HYPER PARAMETERS OF XGBOOST MODEL

Parameters Value

objective binary:logistic

n_estimators 140

learning_rate 0.1

max_depth 5

gamma 0.3

After that, we compare the error rate of xgboost on the

training and test set. As shown in Figure 8.

Fig. 8. The error of xgboost on the training and test set

As the iterations (the number of trees) increase, the error

rate of xgboost on training and testing steadily decline. When

the iterations reach around 140, the model converges and the

test error no longer drops. Xgboost's error rate achieves good

results by applying the early-stopping method. In addition, the

gap between the training set and the test set error is small at the

beginning of the training. However, as the iterations increase,

the gap becomes very large. This proves that the xgboost is

over-fitting and its generalization ability is weak.

C. Model evaluation

Our experimental metrics are error and AUC. The results

are shown in the following Table II.

A Music Recommendation System Based on logistic regression and eXtreme Gradient Boosting

paper N-19514.pdf

- 5 -

Authorized licensed use limited to: University of Luxembourg. Downloaded on January 15,2022 at 17:56:59 UTC from IEEE Xplore. Restrictions apply.

TABLE II. E

XPERIMENTAL RESULTS OF THREE MODELS

Model Error AUC

LR 0.3062 0.7268

xgboost 0.2723 0.7663

LX 0.2376 0.8087

From the table, we can see that the AUC score of LR is

lower than the scores of the other two models, which means the

LR model achieve a poor prediction result. The reason behind

this phenomenon is that it’s difficult for LR to extract non-

linear features without manual construction of large-scale

feature engineering. The error and AUC of xgboost are

obviously better than the LR model, which shows that xgboost

has good performance under the current simple feature

engineering. This proves that xgboost can handle non-linear

data very well. However, as shown in Figure 7, xgboost is

over-fitting, and the generalization ability on the test set does

not perform well. Among these three models, the performance

of the LX model is optimal, and the AUC even reaches a high

score of 0.80. This demonstrates that the LX model not only

extracts the non-linear features of the data but also overcomes

the over-fitting problem of xgboost. Therefore, our LX model

combines the advantages of the xgboost and LR models and is

a high-level classifier.

VI. C

ONCLUSION AND FUTURE WORK

This paper studies the prediction of users' music

preferences in the music recommendation field. We adopt the

fusion model of xgboost and LR as our classifier and optimize

the fusion part. In terms of features, we do some feature

engineering on the user's profile and proved effective by

xgboost. Comparing the comprehensive performance of

different models on the test set, we find that our LX model has

the best results, which proves that our method has strong

practicability in the field of music recommendation.

This article only studies the user's behavior information and

does not combine the user's social network information.

Currently, social network-based methods have proven to be

effective to improve prediction accuracy in recommendation

systems. So, In the future, we will try to mine the user's social

relationships and combine it with our LX model to recommend

music to users.

CKNOWLEDGMENT

This work was supported in part by the National Natural

Science Foundation of China under Grant 6167060382,

61602070, in part by the New Academic Seedling Cultivation

and Exploration Innovation Project under Grant [2017]5789-

21.

EFERENCES

[1] Sánchez-Moreno D., González A. B. G., Vicente M. D. M., Batista, V.

F. L., and García M. N. M., “A collaborative filtering method for music

recommendation using playing coefficients for artists and users,” Expert

Systems with Applications, 2016, vol. 66, pp. 234-244.

[2] Li Q., Myaeng S. H., and Kim B. M., “A probabilistic music

recommender considering user opinions and audio features,”

Information processing & management, 2007, vol. 43, no. 2, pp. 473-

487.

[3] Van den Oord A., Dieleman S., and Schrauwen B., “Deep content-based

music recommendation,” Advances in neural information processing

systems. 2013, pp. 2643-2651.

[4] Bartz K., Murthi V., and Sebastian S., “Logistic regression and

collaborative filtering for sponsored search term recommendation,”

Second workshop on sponsored search auctions. 2006, pp. 5.

[5] Chen T., Guestrin C., “Xgboost: A scalable tree boosting system,”

Proceedings of the 22nd acm sigkdd international conference on

knowledge discovery and data mining. ACM, 2016, pp. 785-794.

[6] Xu L., Liu J., and Gu Y., “A Recommendation System Based on

Extreme Gradient Boosting Classifier,” 2018 10th International

Conference on Modelling, Identification and Control (ICMIC). IEEE,

2018, pp. 1-5.

[7] Bender R., Grouven U., “Ordinal logistic regression in medical

research,” Journal of the Royal College of physicians of London, 1997,

vol. 31, no. 5, pp. 546-551.

[8] Hua Z., Wang Y., Xu X., Zhang B., and Liang L., “Predicting corporate

financial distress based on integration of support vector machine and

logistic regression,” Expert Systems with Applications, 2007, vol. 33,

no. 2, pp. 434-440.

[9] Maranzato R., Pereira A., do Lago A. P., and Neubert M., “Fraud

detection in reputation systems in e-markets using logistic regression,”

Proceedings of the 2010 ACM symposium on applied computing. ACM,

2010, pp. 1454-1455.

[10] Quevedo J. R., Montañés E., Ranilla J., and Díaz I., “Ranked tag

recommendation systems based on logistic regression,” International

Conference on Hybrid Artificial Intelligence Systems. Springer, Berlin,

Heidelberg, 2010, pp. 237-244.

[11] Wang Y., Feng D., Li D., et al. “A mobile recommendation system

based on logistic regression and Gradient Boosting Decision Trees,”

IJCNN. 2016, pp. 1896-1902.

[12] Medina F., Aguila S., Baratto M. C., et al. “Prediction model based on

decision tree analysis for laccase mediators,” Enzyme and microbial

technology, 2013, vol. 52, no. 1, pp. 68-76.

[13] Lee S., Lee S., and Park Y. “A prediction model for success of services

in e-commerce using decision tree: E-customer’s attitude towards online

service,” Expert Systems with Applications, 2007, vol. 33, no. 3, pp.

572-581.

[14] Friedman J. H. “Greedy function approximation: a gradient boosting

machine,” Annals of statistics, 2001, pp. 1189-1232.

[15] Xie J., Coggeshall S., “Prediction of transfers to tertiary care and

hospital mortality: A gradient boosting decision tree approach,”

Statistical Analysis and Data Mining: The ASA Data Science Journal,

2010, vol. 3, no. 4, pp. 253-258.

[16] Zhang X., Wang X., Chen W., et al. “A Taxi Gap Prediction Method via

Double Ensemble Gradient Boosting Decision Tree,” Big Data Security

on Cloud (BigDataSecurity), IEEE International Conference on High

Performance and Smart Computing (HPSC), and IEEE International

Conference on Intelligent Data and Security (IDS), 2017 IEEE 3rd

International Conference on. IEEE, 2017, pp. 255-260.

[17] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul

Lamere. The million song dataset. In Proceedings of the 11th

International Conference on Music Information Retrieval (ISMIR),

2011.

[18] He X., Pan J., Jin O., et al. “Practical lessons from predicting clicks on

ads at facebook,” Proceedings of the Eighth International Workshop on

Data Mining for Online Advertising. ACM, 2014, pp. 1-9.

[19] Salgado C. M., Fernandes M. P., Horta A, et al. “Multistage modeling

for the classification of numerical and categorical datasets,” IEEE

International Conference on Fuzzy Systems. IEEE, 2017, pp. 1-6.

IJCNN 2019. International Joint Conference on Neural Networks. Budapest, Hungary. 14-19 July 2019

paper N-19514.pdf

- 6 -

Authorized licensed use limited to: University of Luxembourg. Downloaded on January 15,2022 at 17:56:59 UTC from IEEE Xplore. Restrictions apply.