[Python.Pandas] String 관련 함수 총정리 (str.upper, .replace, .isdigit, .contains, .match, .split, .rename, .get

[Python.Pandas] String 관련 함수 총정리 (str.upper, .replace, .isdigit, .contains, .match, .split, .rename, .get_dummies) + one hot encoding

2020. 12. 30. 05:00ㆍPython과 머신러닝/Pandas 데이터 분석

0. 이전 포스트

2020/12/28 - [Python과 머신러닝/Pandas 데이터 분석] - [Python.Pandas] Pivot Table과 CrossTab 사용하기
2020/12/29 - [Python과 머신러닝/Pandas 데이터 분석] - [Python.Pandas] Merge 와 Concat 하여 데이터 붙이기

[Python.Pandas] Pivot Table과 CrossTab 사용하기

1. Pandas의 PivotTable 함수 In [1]:import wget import pandas as pd import dateutil In [2]:url = 'https://www.shanelynn.ie/wp-content/uploads/2015/06/phone_data.csv' wget.download(url) 100% [..........

coding-grandpa.tistory.com

[Python.Pandas] Merge 와 Concat 하여 데이터 붙이기

1. Merge 함수를 사용하여 데이터 합치기 In [1]:import pandas as pd In [2]:raw_data = {'subject_id':['1','2','3','4','5','7','8','9','10','11'], 'test_score':[51,15,15,61,16,14,15,1,61,16]} df_a = pd...

coding-grandpa.tistory.com

1. str.upper() : 대문자로 변경하라

In [1]:import pandas as pd
In [2]:raw_data = {'subject_id':['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], 
                   'first_name':['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
                   'last_name' : ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches', 'Bonder', 'Black', 'Balwner', 'Brice', 'Btisan'], 
                   'desc': ['1-001', '2-003', '3-002', '2-004', '3-003', '3-001', '1-002', '2-001', '1-003', '2-002']} 
       df = pd.DataFrame(raw_data, columns=['subject_id', 'first_name', 'last_name', 'desc']) 
       df
Out[2]:

In [3]:df['first_name'].str[:2]
Out[3]:0 Al 
       1 Am 
       2 Al
       3 Al
       4 Ay
       5 Bi
       6 Br
       7 Br
       8 Br
       9 Be
       Name: first_name, dtype: object
       
In [4]:df['first_name'].str.upper()
Out[4]:0 ALEX
       1 AMY 
       2 ALLEN
       3 ALICE 
       4 AYOUNG
       5 BILLY
       6 BRIAN
       7 BRAN
       8 BRYCE
       9 BETTY
       Name: first_name, dtype: object

기본적으로 Java나 C++ 코딩을 해봤다면 String class가 제공하는 함수들을 잘 알고 있을 것이다 (이제는 없어서는 안 되는 존재들이 되었다).
Pandas에서도 동일하게 제공하는데, .str.upper() 함수를 사용하면 모든 string 변수를 대문자로 변경한다.

2. str.replace('A','B') : A를 B로 대체하라

In [5]:df['first_name'].str.replace("Al", "Ar")
Out[5]:0 Arex 
       1 Amy
       2 Arlen 
       3 Arice
       4 Ayoung
       5 Billy
       6 Brian
       7 Bran
       8 Bryce
       9 Betty
       Name: first_name, dtype: object

str.replace('A', 'B') 함수는 'A'를 찾아서 있을 경우 'B'로 대체하라는 의미이다.
그래서 0~3까지는 Al이 AR로 바뀌었고, 그 외에는 변경이 없는 것을 알 수 있다.

3. str.isdigit() : 문자열이 정수인지를 판단하는 함수

In [6]:df['subject_id'].str.isdigit()
Out[6]:0 True
       1 True
       2 True
       3 True
       4 True
       5 True
       6 True
       7 True
       8 True
       9 True
       Name: subject_id, dtype: bool

문자열이 정수인지를 판단하는 함수이다.
subject_id는 전부 숫자이기 때문에 True를 return 한다

4. str.contains('A') : 'A'라는 String이 원 String에 포함되어 있는지를 반환하는 함수이다.

In [7]:df['last_name'].str.contains('an')
Out[7]:0 False
       1 True
       2 False
       3 False
       4 False
       5 False
       6 False
       7 False
       8 False
       9 True
       Name: last_name, dtype: bool

In [8]:df[df['last_name'].str.contains('an')]
Out[8]:

Last Name에 'an' 이 들어가 있는지를 판단하는 함수이다.
Boolean 값을 반환하기 때문에, Boolean indexing을 사용하면 Out[8]과 같이 'an'을 포함한 Last Name을 가진 학생들의 정보만 추출할 수 있다.
Boolean Indexing과 관련된 내용은 다음 포스트에 정리했다.
2020/12/13 - [Python과 머신러닝/NumPy 데이터 분석] - [Python.NumPy] Boolean Index와 Fancy Index

5. str.match('A') : 'A'를 정규식 표현으로 전달하여, 해당 정규식에 해당하는 index를 찾아준다

In [9]:df[df['last_name'].str.match('A\w+an')]
Out[9]:

정규표현식을 만들어서 원하는 조건을 만들고 나면, .str.match()에 parameter로 전달할 수 있다.
.match 함수는 해당 정규식에 대응하는 값들만 추출하여 True/False로 반환하고, 이를 index로 사용하면 해당하는 값들만 추출한다
Boolean Indexing과 관련된 내용은 다음 포스트에 정리했다.
2020/12/13 - [Python과 머신러닝/NumPy 데이터 분석] - [Python.NumPy] Boolean Index와 Fancy Index
정규식과 관련된 내용은 다음 포스트에 정리했다.
2020/11/13 - [Python과 머신러닝/웹 데이터 추출] - [Python.Web] 정규표현식을 이용한 웹 데이터 파싱 - urllib, regular expression

6. str.split('-') : String 중 '-'을 기준으로 두 개의 string으로 분리한다.

In [10]:df['desc']
Out[10]:0 1-001
        1 2-003
        2 3-002
        3 2-004
        4 3-003
        5 3-001
        6 1-002
        7 2-001
        8 1-003
        9 2-002
        Name: desc, dtype: object

In [11]:df['desc'].str.split('-')
Out[11]:0 [1, 001]
        1 [2, 003]
        2 [3, 002]
        3 [2, 004]
        4 [3, 003]
        5 [3, 001]
        6 [1, 002]
        7 [2, 001] 
        8 [1, 003]
        9 [2, 002]
        Name: desc, dtype: object

In [12]:df['desc'].str.split('-', expand=True)
Out[12]:

str.split('-')을 하면 -를 기준으로 string을 분리해준다.
-(대시) 외에도 . , / _ 등 기호를 구분할 때 자주 사용된다.
In[12]의 expand=True를 지정할 경우, 분리된 string을 0/1 두 개의 Series로 구분하고, 이를 하나의 DataFrame으로 반환한다.
이렇게 변환된 DataFrame을 기존 DataFrame에 다시 합쳐서 통합 데이터로 관리할 수 있는데, 다음 단계에서 알아보자.

7. String Split 후 DataFrame Merge 하기

In [13]:temp_df = df['desc'].str.split('-', expand=True).rename(columns={0:'grade', 1:'studentID'}) 
        temp_df
Out[13]:

In [14]:df = pd.merge(df, temp_df, left_index=True, right_index=True) 
        df
Out[14]:

In[13]은 split(expand=True)로 지정하여 0과 1의 Series를 가진 DataFrame을 만들었다.
0과 1이라는 column name은 무의미하니 .rename을 통해 의미 있는 column명으로 수정했고 그 결과물을 Out[13]에서 볼 수 있다.
In[14]는 기존 df에 temp_df를 다시 merge 하는 command이다.
여기서 left_index=True, right_index=True는 Join 할 key가 index 인지를 알려주는 parameter이다.

8. .get_dummies 함수를 통한 one-hot encoding 하기

In [15]:df['grade'].str.get_dummies()
Out[15]:

str.get_dummies는 해당 변수를 one-hot encoding으로 수정하여 Out[15]와 같은 dataframe을 반환한다.
분류형 데이터 (Categorical Data)를 데이터 분석에 사용할 수 없으니, 1학년이면 '1' Series에 1(True)이라는 값을 갖게 되고, 3학년 '3' Series에 1을 갖게 되는 것을 확인할 수 있다.
이렇게 분류형 데이터를 사용할 때에만 정확하게 학년별 데이터 분석이 가능해지기 때문에 자주 사용하게 되는 함수이다.

'Python과 머신러닝 > Pandas 데이터 분석' 카테고리의 다른 글

[Python.Pandas] Merge 와 Concat 하여 데이터 붙이기 (0)	2020.12.29
[Python.Pandas] Pivot Table과 CrossTab 사용하기 (0)	2020.12.28
[Python.Pandas] Groupby 실습 / 데이터 분석 (Date/Time 데이터 분석) (0)	2020.12.27
[Python.Pandas] Groupby 이해하기 - 2편 (0)	2020.12.24
[Python.Pandas] Groupby 함수 이해하기 - 1편 (0)	2020.12.23

개발자로 취직하기