Pandas Tricks
2021, May 01
to rapidly and better dealing with data viz problems.
In [1]:
import pandas as pd
import numpy as np
Advance Filtering¶
In [2]:
covid = pd.read_csv("datasets/covid.csv", parse_dates=['dateRep'])
covid[(covid.dateRep == '2020-05-07') & (covid.continentExp != 'America') & ~(covid.cases == 0)].head(2) # not is tilde
Out[2]:
| dateRep | day | month | year | cases | deaths | countriesAndTerritories | geoId | countryterritoryCode | popData2019 | continentExp | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-05-07 | 5 | 7 | 2020 | 348 | 7 | Afghanistan | AF | AFG | 38041757.0 | Asia |
| 178 | 2020-05-07 | 5 | 7 | 2020 | 67 | 2 | Albania | AL | ALB | 2862427.0 | Europe |
Transpose and Reset_index¶
In [3]:
df = pd.read_excel("datasets/census.xlsx")
df.drop(columns=["Unnamed: 23"], inplace=True)
df.iloc[0:2, 0:5]
Out[3]:
| ANNI | Piemonte | Valle d'Aosta-Vallée d'Aoste | Liguria | Lombardia | |
|---|---|---|---|---|---|
| 0 | 1951 | 3518.00 | 94.000 | 1567.000 | 6566.000 |
| 1 | 1961 | 3914.25 | 100.959 | 1735.349 | 7406.152 |
In [4]:
df = df.transpose()
df.columns = df.iloc[0] # first row is the old first column
df.reset_index(inplace=True) # pop the index as column
df.columns = ['Regione']+ ['year_'+str(int(i)) for i in df.columns[1:]] # just for this df rename the year
df = df.iloc[1:, ] # skip the first raw because is the old ANNI!
df.iloc[0:2, 0:5]
Out[4]:
| Regione | year_1951 | year_1961 | year_1971 | year_1981 | |
|---|---|---|---|---|---|
| 1 | Piemonte | 3518.0 | 3914.250 | 4432.0 | 4479.0 |
| 2 | Valle d'Aosta-Vallée d'Aoste | 94.0 | 100.959 | 109.0 | 112.0 |
Remove Unicode and Types¶
In [5]:
dem = pd.read_csv("datasets/demography.csv")
try: dem['Area (km²)'].astype('float')
except Exception as e: print(e)
dem.head(2) # inside pop there is a '\ux0'
could not convert string to float: '23\xa0864'
Out[5]:
| Region | Capital | Pop (ab) | Area (km²) | Density (ab./km²) | Provinces | Admin divisions | |
|---|---|---|---|---|---|---|---|
| 0 | Lombardia | Milano | 10 103 969 | 23 864 | 423 | Bergamo, Brescia, Como, Cremona, Lecco, Lodi, ... | 1 506 |
| 1 | Lazio | Roma | 5 865 544 | 17 232 | 340 | Frosinone, Latina, Rieti, Roma, Viterbo | 378 |
In [6]:
dem["Pop (ab)"]=dem["Pop (ab)"].str.replace(u'\xa0', '')
dem["Admin divisions"]=dem["Admin divisions"].str.replace(u'\xa0', '')
dem["Area (km²)"]=dem["Area (km²)"].str.replace(u'\xa0', '')
In [7]:
float_df = dem.loc[:, ["Pop (ab)","Area (km²)", "Density (ab./km²)", "Admin divisions"]].astype('float')
dem.loc[:, ["Pop (ab)","Area (km²)", "Density (ab./km²)", "Admin divisions"]]= float_df
dem.head(2)
Out[7]:
| Region | Capital | Pop (ab) | Area (km²) | Density (ab./km²) | Provinces | Admin divisions | |
|---|---|---|---|---|---|---|---|
| 0 | Lombardia | Milano | 10103969.0 | 23864.0 | 423.0 | Bergamo, Brescia, Como, Cremona, Lecco, Lodi, ... | 1506.0 |
| 1 | Lazio | Roma | 5865544.0 | 17232.0 | 340.0 | Frosinone, Latina, Rieti, Roma, Viterbo | 378.0 |
Unroll as Long Dataframe¶
In [8]:
df = pd.read_excel("datasets/census.xlsx",)
df.drop(columns=["Unnamed: 23"], inplace=True)
df = df.transpose() # so to have the year as column to show!
df.columns = df.iloc[0]
df.reset_index(inplace=True)
df.columns = ['Regione']+ ['year_'+str(int(i)) for i in df.columns[1:]]
df = df.iloc[1:, ]
print(df.shape)
df.head(2)
(22, 8)
Out[8]:
| Regione | year_1951 | year_1961 | year_1971 | year_1981 | year_1991 | year_2001 | year_2011 | |
|---|---|---|---|---|---|---|---|---|
| 1 | Piemonte | 3518.0 | 3914.250 | 4432.0 | 4479.0 | 4303.0 | 4214.677 | 4363.916 |
| 2 | Valle d'Aosta-Vallée d'Aoste | 94.0 | 100.959 | 109.0 | 112.0 | 116.0 | 119.548 | 126.806 |
In [9]:
res = []
for columns in df.columns[1:]:
res.append(pd.DataFrame(dict(Region=df.Regione.values.tolist(),
Value=df[columns].values.tolist(),
Year=[str(columns) for i in range(len(df))])))
final = pd.concat(res)
print(final.shape)
final.head(3)
(154, 3)
Out[9]:
| Region | Value | Year | |
|---|---|---|---|
| 0 | Piemonte | 3518.0 | year_1951 |
| 1 | Valle d'Aosta-Vallée d'Aoste | 94.0 | year_1951 |
| 2 | Liguria | 1567.0 | year_1951 |
From Comma to Dot and Types¶
In [10]:
df = pd.read_csv("datasets/countries.csv")
df.head(1)
Out[10]:
| Country | Region | Population | Area (sq. mi.) | Pop. Density (per sq. mi.) | Coastline (coast/area ratio) | Net migration | Infant mortality (per 1000 births) | GDP ($ per capita) | Literacy (%) | Phones (per 1000) | Arable (%) | Crops (%) | Other (%) | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | ASIA (EX. NEAR EAST) | 31056997 | 647500 | 48,0 | 0,00 | 23,06 | 163,07 | 700.0 | 36,0 | 3,2 | 12,13 | 0,22 | 87,65 | 1 | 46,6 | 20,34 | 0,38 | 0,24 | 0,38 |
In [11]:
# replace comma with . and transorm type
for c in df.columns[2:]:
try: df.loc[:,c]=df.loc[:,c].astype('float')
except:
df.loc[:,c]= df.loc[:,c].str.replace(",",".")
df.loc[:,c]=df.loc[:,c].astype('float')
df.head(1)
Out[11]:
| Country | Region | Population | Area (sq. mi.) | Pop. Density (per sq. mi.) | Coastline (coast/area ratio) | Net migration | Infant mortality (per 1000 births) | GDP ($ per capita) | Literacy (%) | Phones (per 1000) | Arable (%) | Crops (%) | Other (%) | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | ASIA (EX. NEAR EAST) | 31056997.0 | 647500.0 | 48.0 | 0.0 | 23.06 | 163.07 | 700.0 | 36.0 | 3.2 | 12.13 | 0.22 | 87.65 | 1.0 | 46.6 | 20.34 | 0.38 | 0.24 | 0.38 |
Count and Fill NA's¶
In [12]:
np.sum(df.isna()).values.tolist()
Out[12]:
[0, 0, 0, 0, 0, 0, 3, 3, 1, 18, 4, 2, 2, 2, 22, 3, 4, 15, 16, 15]
In [13]:
df.fillna(df.mean(), inplace=True) # df.median(), df.std(), 0, 1 .....
df.head(2)
Out[13]:
| Country | Region | Population | Area (sq. mi.) | Pop. Density (per sq. mi.) | Coastline (coast/area ratio) | Net migration | Infant mortality (per 1000 births) | GDP ($ per capita) | Literacy (%) | Phones (per 1000) | Arable (%) | Crops (%) | Other (%) | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | ASIA (EX. NEAR EAST) | 31056997.0 | 647500.0 | 48.0 | 0.00 | 23.06 | 163.07 | 700.0 | 36.0 | 3.2 | 12.13 | 0.22 | 87.65 | 1.0 | 46.60 | 20.34 | 0.380 | 0.240 | 0.380 |
| 1 | Albania | EASTERN EUROPE | 3581655.0 | 28748.0 | 124.6 | 1.26 | -4.93 | 21.52 | 4500.0 | 86.5 | 71.2 | 21.09 | 4.42 | 74.49 | 3.0 | 15.11 | 5.22 | 0.232 | 0.188 | 0.579 |
Get Dummies variable¶
In [14]:
df = pd.read_csv("datasets/countries.csv")
pd.get_dummies(df.iloc[:,1]).head(2)
Out[14]:
| ASIA (EX. NEAR EAST) | BALTICS | C.W. OF IND. STATES | EASTERN EUROPE | LATIN AMER. & CARIB | NEAR EAST | NORTHERN AFRICA | NORTHERN AMERICA | OCEANIA | SUB-SAHARAN AFRICA | WESTERN EUROPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Date_range for time ticks¶
In [15]:
mths = pd.date_range('1948-01-01', periods=12, freq='Y')
months = pd.date_range(start='1949', end='1950', freq='MS')
mths,months
Out[15]:
(DatetimeIndex(['1948-12-31', '1949-12-31', '1950-12-31', '1951-12-31',
'1952-12-31', '1953-12-31', '1954-12-31', '1955-12-31',
'1956-12-31', '1957-12-31', '1958-12-31', '1959-12-31'],
dtype='datetime64[ns]', freq='A-DEC'),
DatetimeIndex(['1949-01-01', '1949-02-01', '1949-03-01', '1949-04-01',
'1949-05-01', '1949-06-01', '1949-07-01', '1949-08-01',
'1949-09-01', '1949-10-01', '1949-11-01', '1949-12-01',
'1950-01-01'],
dtype='datetime64[ns]', freq='MS'))