Pandas的10个常用函数总结

deephub

2021-09-26 08:37

我们大多数人更喜欢 Python 来处理与数据相关的事情，而Pandas是我们是最常用的Python库。我们介绍常用的函数之前，我们需要了解 Pandas 提供的两种主要数据结构：

Series：包含键值对的一维数据结构。它类似于 python 字典。

>>> d = {'a': 1, 'b': 2, 'c': 3}

>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])

>>> ser

a 1

b 2

c 3

dtype: int64

Dataframe：是一种二维数据结构，它基本上是两个或多个Series的组合。它们也可以被认为是数据的电子表格，是我们最常用的数据结构。

>>> d = {'col1': [1, 2], 'col2': [3, 4]}

>>> df = pd.DataFrame(data=d)

>>> df

col1 col2

0 1 3

1 2 4

现在我们知道数据是如何存储的，让我们开始介绍常用的的函数。

注意：我没有解释基本的算术和统计运算，比如 sqrt 和 corr，因为我想在这篇文章中关注更多 Pandas 特定的函数。

read_csv

让我们从读取数据开始。Pandas 可以读取多种类型的文件，如 CSV、Excel、SQL、JSON 等。让我们看看最常用的一种。如果我们想读取名为 data.csv 的文件，Pandas 提供了许多方法，其中一些是：

#Simply read the files as is

>>> pd.read_csv('data.csv')#To import specific columns

>>> pd.read_csv('data.csv', usecols=['column_name1','column_name2'])#To set a column as the index column

>>> pd.read_csv('data.csv',index_col='Name')

类似函数：read_(is the type of file you want to read, eg. read_json, read_excel)

select_dtypes

让我们看看 Pandas 如何帮助我们处理需要处理特定数据类型。

# select all columns except float based

>>> df.select_dtypes(exclude ='float64')# select non-numeric columns

>>> df.select_dtypes(exclude=[np.number])>>> df = pd.DataFrame({'a': [1, 2] * 3,

... 'b': [True, False] * 3,

... 'c': [1.0, 2.0] * 3})

>>> df

a b c

0 1 True 1.0

1 2 False 2.0

2 1 True 1.0

3 2 False 2.0

4 1 True 1.0

5 2 False 2.0>>> df.select_dtypes(include='bool')

0 True

1 False

2 True

3 False

4 True

5 False

类似函数：value_counts，它返回一个包含 DataFrame 中唯一值和总数。

copy

我知道为了在代码中复制一些对象，我们通常写 A= B，但在 Pandas 中，这实际上创建了 B 作为对 A 的引用。所以如果我们改变 B，A 的值也将被改变。因此，我们需要如下复制函数。

s = pd.Series([1, 2], index=["a", "b"])

>>> s

a 1

b 2

dtype: int64>>> s_copy = s.copy()

>>> s_copy

a 1

b 2

dtype: int64

要了解复制的复杂性，您还应该了解浅拷贝和深拷贝之间的区别。

浅拷贝与原始共享数据和索引。

深拷贝创建数据和索引的单独副本。

>>> s = pd.Series([1, 2], index=["a", "b"])

>>> deep = s.copy()

>>> shallow = s.copy(deep=False)

>>> s[0] = 3

>>> shallow[1] = 4

>>> s

a 3

b 4

dtype: int64

>>> shallow

a 3

b 4

dtype: int64

>>> deep

a 1

b 2

dtype: int64

注意在上面的例子中，shallow 是如何随着 s 的变化而变化的，但 deep 保持不变。

map

为了快速更改一组数据，我们可以使用 map。它将系列中的每个值替换为另一个值，该值可能来自函数、字典或另一个Series。下面是一些简单的例子，但 map 在复杂情况下实际上有很大帮助，因为我们可以在单个 map 调用中映射多个事物。

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])

>>> s

0 cat

1 dog

2 NaN

3 rabbit

dtype: object>>> s.map({'cat': 'kitten', 'dog': 'puppy'})

0 kitten

1 puppy

2 NaN

3 NaN

dtype: object>>> s.map('I am a {}'.format, na_action='ignore')

0 I am a cat

1 I am a dog

2 NaN

3 I am a rabbit

dtype: object

apply

在我们的数据集上应用函数的一种更简单的方法是使用 apply，我们可以在函数调用中直接在一行中定义复杂的 lambda表达式。

>>> df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])

>>> df

A B

0 4 9

1 4 9

2 4 9

>>> df.apply(np.sqrt)

A B

0 2.0 3.0

1 2.0 3.0

2 2.0 3.0

>>> df.apply(lambda x: [1, 2], axis=1)

0 [1, 2]

1 [1, 2]

2 [1, 2]

dtype: object

>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

foo bar

0 1 2

1 1 2

2 1 2

类似函数：applymap

isna, isin

isna 和 isin 通过分离 NaN 或定义数据所在的范围来过滤数据。对于满足条件的数据，它们返回 true，否则返回 false。

>>> pd.isna('dog')

False

>>> pd.isna(pd.NA)

True

#to display rows having the value of col1 as NULL

>>> pd.isna(data['col1']

#count the number of missing values

>>> data.isna().sum()

df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},

... index=['falcon', 'dog'])

--------------------------------------------------------------------

>>> df

num_legs num_wings

falcon 2 2

dog 4 0

>>> df.isin([0, 2])

num_legs num_wings

falcon True True

dog False True

类似函数: notna, fillna, isnull

groupby

groupby 操作涉及拆分数据、应用函数和结果的某种组合。一个特定的用例是识别列的相同元素并将这些行的结果分组。

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',

... 'Parrot', 'Parrot'],

... 'Max Speed': [380., 370., 24., 26.]})

>>> df

Animal Max Speed

0 Falcon 380.0

1 Falcon 370.0

2 Parrot 24.0

3 Parrot 26.0

>>> df.groupby(['Animal']).mean()

Max Speed

Animal

Falcon 375.0

Parrot 25.0

nsmallest, nlargest

顾名思义，我们使用它来获取特定列中具有 n 个最小或 n 个最大元素的行。当我们只需要选择几个元素时，这些函数比对整个数据进行排序要好。

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,

... 434000, 434000, 337000, 337000,

... 11300, 11300],

... 'GDP': [1937894, 2583560 , 12011, 4520, 12128,

... 17036, 182, 38, 311],

... 'alpha-2': ["IT", "FR", "MT", "MV", "BN",

... "IS", "NR", "TV", "AI"]},

... index=["Italy", "France", "Malta",

... "Maldives", "Brunei", "Iceland",

... "Nauru", "Tuvalu", "Anguilla"])

>>> df

population GDP alpha-2

Italy 59000000 1937894 IT

France 65000000 2583560 FR

Malta 434000 12011 MT

Maldives 434000 4520 MV

Brunei 434000 12128 BN

Iceland 337000 17036 IS

Nauru 337000 182 NR

Tuvalu 11300 38 TV

Anguilla 11300 311 AI

>>> df.nsmallest(3, 'population')

population GDP alpha-2

Tuvalu 11300 38 TV

Anguilla 11300 311 AI

Iceland 337000 17036 IS

>>> df.nsmallest(3, 'population', keep='last')

population GDP alpha-2

Anguilla 11300 311 AI

Tuvalu 11300 38 TV

Nauru 337000 182 NR

>>> df.nsmallest(3, 'population', keep='all')

population GDP alpha-2

Tuvalu 11300 38 TV

Anguilla 11300 311 AI

Iceland 337000 17036 IS

Nauru 337000 182 NR

merge

基于列或索引合并数据。

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],

... 'value': [1, 2, 3, 5]})

>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],

... 'value': [5, 6, 7, 8]})

>>> df1

lkey value

0 foo 1

1 bar 2

2 baz 3

3 foo 5

>>> df2

rkey value

0 foo 5

1 bar 6

2 baz 7

3 foo 8

>>> df1.merge(df2, left_on='lkey', right_on='rkey')

lkey value_x rkey value_y

0 foo 1 foo 5

1 foo 1 foo 8

2 foo 5 foo 5

3 foo 5 foo 8

4 bar 2 bar 6

5 baz 3 baz 7

类似函数：merge_ordered, merge_asof, join

to_csv

我们的最后一步是保存从所有处理中产生的数据。与读取函数类似，我们有如下写入函数。

>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],

... 'mask': ['red', 'purple'],

... 'weapon': ['sai', 'bo staff']})

>>> df.to_csv(index=False)

'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'

类似函数：to_xxx(与读取文件类似，xxx是写入的文件类型 , 例如. to_json)

总结

现在我已经写完了这篇文章，我可以肯定地说，10个函数太少了，不足以体现 Pandas的好处。但我的目的是让你们习惯这个库，从现在开始用Pandas做所有与数据相关的工作。

作者：Harsh Maheshwari

喜欢就关注一下吧：

点个在看你最好看！