나만의 데이터 만들기

시리즈와 데이터프레임 직접 만들기

  
s = pd.Series(['banana', 42])
print(s)
-----------------------------
0    banana
1        42
dtype: object

Series 메서드에 리스트를 전달하여 시리즈를 생성

  
s = pd.Series(['Wes MC', 'Createor'], index=['person', 'who'])
print(s)
-------------------------------------------------------------
person      Wes MC
who       Createor
dtype: object

index 인자를 이용해서 인덱스로 사용하고자 하는 문자열을 리스트에 담아 전달하면 된다

Data Frame

  
scientist = pd.DataFrame({
    'Name': ['frankllin', 'willam'],
    'Occupation': ['chemist', 'statistician'],
    'Born': ['1920', '1876']
})
----------------------------------------------
        Name    Occupation  Born
0  frankllin       chemist  1920
1     willam  statistician  1876

  
scientist_col = pd.DataFrame(
    data={'Occupation': ['chemist', 'statistician'],
          'Born': ['1920', '1876']
          },
    index=['Rosaline', 'willam'],
    columns=['Occupation', 'Born'])
----------------------------------------------------
            Occupation  Born
Rosaline       chemist  1920
willam    statistician  1876

인자를 사용해 index 와 colums 를 지정할 수 있다

  
scientist_dict = pd.DataFrame(OrderedDict([
    ('Name', ['Rosaline', 'Willam']),
    ('Occupation', ['chemist', 'statistician']),
    ('Born', ['1920', '1876'])
]))
-----------------------------------------------
       Name    Occupation  Born
0  Rosaline       chemist  1920
1    Willam  statistician  1876

딕셔너리는 데이터의 순서가 보장되지 않는다.
순서가 보정된 딕셔너리를 사용하기 위해 OrderedDirct 를 사용한다

시리즈 다루기 (기초)

데이터프레임에서 시리즈 선택하기

  
scientist = pd.DataFrame(
    data={'Occupation': ['chemist', 'statistician'],
          'Born': ['1920', '1876'],
          'Died': ['1958', '1937']},
    index=['Rosaline', 'William'],
    columns=['Occupation', 'Born', 'Died'])

first_row = scientist.loc['William']
print(type(first_row))
print(first_row)
-------------------------------------------------
<class 'pandas.core.series.Series'>

Occupation    statistician
Born                  1876
Died                  1937
Name: William, dtype: object

loc 속성을 이용해서 Willam의 정보를 얻을 수 있다

index, values 속성과 keys 메서드 사용하기

  
print(first_row.index)
print(first_row.values)
print(first_row.keys())
------------------------
# index
Index(['Occupation', 'Born', 'Died'], dtype='object')
# values
['statistician' '1876' '1937']
# keys()
Index(['Occupation', 'Born', 'Died'], dtype='object')

시리즈 다루기 (응용)

  
scientist = pd.read_csv('../doit_pandas/data/scientists.csv')

ages = scientist['Age']
print(ages.max())
---------------------------------
90

시리즈의 max() min() std() 등의 메서드를 호출해서 해당 열의 통계 수치를 알 수 있다

시리즈와 브로드캐스팅

  
print(ages)
print(ages + ages)
print(ages * ages)

같은 길이의 벡터로 더하기 연산과 곱하기 연산 수행이 가능하다
스칼라 값을 곱하거나 더해도 벡터의 모든 값에 스칼라를 적용하여 브로드캐스팅 된다

  
print(ages + pd.Series([1, 100]))
----------------------------------
   38.0
  161.0
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
dtype: float64

길이가 다른 벡터를 연산하면 길이를 벗어난 값들은 NaN 이 반환되고 길이가 일치하는 만큼 연산이 이뤄진다

  
rev_ages = ages.sort_index(ascending=False)
print(rev_ages)
-----------------------------------------
  77
  41
  45
  56
  66
  90
  61
  37

sort_index 메서드에 ascending 인자 값을 False로 지정해 인덱스 역순으로 데이터를 정렬가능하다
벡터와 벡터의 연산은 일치하는 인덱스 끼리 연산을 진행한다

시리즈와 데이터프레임의 데이터 처리하기

  
born_datetime = pd.to_datetime(scientist['Born'], format='%Y-%m-%d')
died_datetime = pd.to_datetime(scientist['Died'], format='%Y-%m-%d')

scientist['born_dt'], scientist['died_dt'] = (born_datetime, died_datetime)
print(scientist.head())

scientist['born_dt'] 을 사용해서 새로운 열을 만들 수 있다

  
scientist_dropped = scientist.drop(['Age'], axis=1)

열을 삭제하기 위해 drop()을 사용한다
axis=1을 사용하면 Age 열을 삭제할 수 있다

데이터 저장하고 불러오기

데이터를 피클, CSV, TSV 파일로 저장하고 불러오기

  
names = scientist['Name']
names.to_pickle('output/scientist_name.pickle')

read_name = pd.read_pickle('PATH')

.pickle을 확장자로 가지는 파일을 생성할 수 있다
pickle 파일을 읽어올 때는 read_pickle을 사용해야 한다

Series와 DataFrame (데이터 분석을 위한 판다스 입문)

나만의 데이터 만들기

시리즈와 데이터프레임 직접 만들기

Data Frame

시리즈 다루기 (기초)

데이터프레임에서 시리즈 선택하기

index, values 속성과 keys 메서드 사용하기

시리즈 다루기 (응용)

시리즈와 브로드캐스팅

시리즈와 데이터프레임의 데이터 처리하기

데이터 저장하고 불러오기

데이터를 피클, CSV, TSV 파일로 저장하고 불러오기

Further Reading

Pandas 기초 (데이터 분석을 위한 판다스 입문)

데이터로 그래프 그리기 (데이터 분석을 위한 판다스 입문)

누락값 (데이터 분석을 위한 판다스 입문)