이전의 결과물인 Matrix factorization으로 d차원 축소한 벡터들로 logistic regression 문제를 풀었다.

파이썬으로 logistic regression을 구현할 때에는 아래의 블로그를 많이 참고 하였다.

Logistic Regression 개인정리 (파이썬 코드 구현)

// 앞서 배운 로지스틱 회귀를 파이썬 코드로 구현해봅시다. [구현 순서] 1. training 데이터 준비 : slicing 또는 list comprehension 등을 이용하여 입력 x와 정답 t를 numpy 타입으로 분리합니다.(t는 0 또는 1

wiserloner.tistory.com

마지막 결과는 train의 비율에 따라 f1-score의 향상을 나타내는 그래프이다. 데이터의 수가 적기 때문에 오버피팅이 일어나지 않도록 10번씩 돌린것의 평균을 사용했다.

1. Matrix Factorization for Network Embedding ( From previous report)¶

In [1]:

import pandas as pd
import numpy as np
import random
import networkx as nx
from matplotlib import pyplot as plt 
np.random.seed(15)


#Load data
adjlist = nx.read_adjlist("karate_club.adjlist", nodetype=int)
karate_label = np.loadtxt("karate_label.txt")
Graph = nx.read_adjlist("karate_club.adjlist", nodetype=int)
node_number = nx.to_pandas_adjacency(Graph).columns


adj = nx.to_numpy_array(adjlist)
label = karate_label[:,-1]

print(adj.shape)
print(label.shape)

#match the label
fix_label = []
for i in node_number:
    tem = karate_label[i][-1]
    fix_label.append(tem)
    
    
#defining P, Q for matrix factorizaiton
d= 4
P = np.random.random((4,34))
Q = np.random.random((4,34))


zuzv = np.dot(P.T,Q)
zuzv.shape


# loss function
def loss(a,b):
    return np.sum((a-b)**2)


loss(zuzv,adj)

epoch = 1000
lr = 0.001

#Updating params 
loss_list = [0 for _ in range(epoch)]
for i in range(epoch):
    P -= lr *  np.dot(zuzv-adj,Q.T).T
    Q -= lr *  np.dot(zuzv-adj,P.T).T
    
    
    loss_list[i] = loss(zuzv,adj)
    zuzv = np.dot(P.T,Q)
    
    
#plotting the loss
plt.plot(loss_list)

(34, 34)
(34,)

Out[1]:

[<matplotlib.lines.Line2D at 0x194a2002940>]

T-SNE (From previous report)¶

the membership number are located nearly when they have many relationship
it differs quite a lot when perplexity changes
unlike the figure, label doesn't mean anyting (expressed by the color)

In [2]:

ans = np.dot(adj,P.T)

In [3]:

node_number

Out[3]:

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 17, 19, 21, 31,
            30,  9, 27, 28, 32, 16, 33, 14, 15, 18, 20, 22, 23, 25, 29, 24,
            26],
           dtype='int64')

In [4]:

node_number
label
label_fix = []
for i in node_number:
    tem = label[i]
    label_fix.append(tem)

In [5]:

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

model = TSNE(learning_rate=100,perplexity=5)
transformed = model.fit_transform(ans)
xs = transformed[:,0]
ys = transformed[:,1]

for i in range(len(xs)):
    plt.scatter(xs[i],ys[i],c = np.array(34))
    plt.text(xs[i],ys[i],node_number[i])
plt.scatter(xs,ys,c=label_fix)
#plt.text(xs,ys)

plt.show()

In [6]:

# define data_x, label y
x_data = ans
t_data = np.array(label_fix).reshape(34,1)
print(x_data.shape)
print(t_data.shape)

(34, 4)
(34, 1)

Functions¶

In [7]:

def sigmoid(x):
    return 1/(1+np.exp(-x))
    
def loss_func(x, t):
    delta = 1e-7
    z=np.dot(x,w)+b
    y=sigmoid(z)
    return -np.sum(t*np.log(y+delta)+(1-t)*np.log((1-y)+delta))

In [8]:

def numerical_derivative(f, x):
    delta_x = 1e-4 
    grad = np.zeros_like(x) # cal grads
    
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    
    while not it.finished:
        idx=it.multi_index
        
        tmp_val = x[idx]
        x[idx] = float(tmp_val) + delta_x
        fx1 = f(x)
        
        x[idx] = tmp_val - delta_x
        fx2 = f(x)
        
        grad[idx] = (fx1 - fx2) / (2*delta_x)
        
        x[idx] = tmp_val
        it.iternext()
        
    return grad

def predict(x):
    z = np.dot(x,w) + b
    y = sigmoid(z)
    
    if y > 0.5: #0.5를 초과하면 1로 True, 이하면 0으로 False
        result = 1
    else:
        result = 0
        
    return result

In [9]:

#Param Setting
learning_rate = 1e-2
f = lambda x : loss_func(x_data, t_data)

Running 10 Times with different portion of labeled data¶

In [14]:

np.random.seed(2)
#Param Setting
learning_rate = 1e-2
# GET 10 time results


# define data_x, label y
x_data = ans
t_data = np.array(label_fix).reshape(34,1)
total = np.concatenate([x_data,t_data],axis = 1)

#Portion setting
portion_set = [0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85]
portion_f1 = []




for pro in portion_set:
    right_10 = []
    accuracy_10 = []
    precision_10 = []
    recall_10 = []
    f1_10 = []
    for _ in range(20):
    
        #Shuffle Data
        np.random.shuffle(total)
        x_data = total[:,:-1]
        t_data = total[:,-1]
        t_data = np.array(t_data).reshape(34,1)
    
        
    
        #Data Split
        portion = pro
        tem = int(34 * portion)
        
        x_test = x_data[tem:]
        t_test = t_data[tem:]  
        
        x_data = x_data[:tem]
        t_data = t_data[:tem]

    
        #initalizing
        w = np.random.rand(4,1)
        b = np.random.rand(1)
    
    
        loss = []
        for step in range(10000):
            w -= learning_rate *numerical_derivative(f,w)
            b -= learning_rate *numerical_derivative(f,b)
            if(step%10==0):
                l = loss_func(x_data, t_data)
                loss.append(l)
    
        #make prediction
        pred = []
        for i in range(len(x_test)):
            tem = predict(x_test[i])
            pred.append(tem) 
        
        
        #prediction
        y = t_test
        y= y.flatten()
        p = np.array(pred) 


        accuracy = np.mean(np.equal(y,p))
        right = np.sum(y * p == 1)
        precision = right / np.sum(p)
        recall = right / np.sum(y)
        f1 = 2 * precision*recall/(precision+recall)

        accuracy_10.append(accuracy)
        right_10.append(right)
        precision_10.append(precision)
        recall_10.append(recall)
        f1_10.append(f1)
        
    portion_f1.append(np.mean(np.array(f1_10)))
    
print('accuracy',np.mean(np.array(accuracy)))
print('precision', np.mean(np.array(precision)))
print('recall', np.mean(np.array(recall)))
print('f1', np.mean(np.array(f1)))

<ipython-input-14-0687019041e6>:74: RuntimeWarning: invalid value encountered in long_scalars
  precision = right / np.sum(p)

accuracy 1.0
precision 1.0
recall 1.0
f1 1.0

In [15]:

#sample loss
plt.plot(loss)

Out[15]:

[<matplotlib.lines.Line2D at 0x194a4212280>]

In [ ]:

Results¶

Loss decreases and gather at some point
f1 score have high variance
it seems to have high variance due to the random sampling (sample size is too small)

In [16]:

#Plotting f1 score with different portion
plt.plot(portion_set, portion_f1)

Out[16]:

[<matplotlib.lines.Line2D at 0x194a416ca90>]

'Data Science > NLP' 카테고리의 다른 글

[NLP] Deepwalk + logistic Regression with python (0)	2021.04.29
[NLP] Matrix Factorization 구현하기 with 파이썬 (1)	2021.04.10
[NLP] 파이썬으로 backpropagation 구현하기 (with different hidden layers) (0)	2021.03.31
[NLP] 파이썬으로 backpropagation 구현하기 (without bias) (0)	2021.03.31

Data Science

[NLP] Matrix Factorization + logistic Regression with python

1. Matrix Factorization for Network Embedding ( From previous report)¶

T-SNE (From previous report)¶

Functions¶

Running 10 Times with different portion of labeled data¶

Results¶

'Data Science > NLP' 카테고리의 다른 글

댓글

티스토리툴바

[NLP] Matrix Factorization + logistic Regression with python

1. Matrix Factorization for Network Embedding ( From previous report)¶

T-SNE (From previous report)¶

Functions¶

Running 10 Times with different portion of labeled data¶

Results¶

'Data Science > NLP' 카테고리의 다른 글

관련글

댓글

티스토리툴바