Pandas | Notes

文件读取

1	df67 = pd.read_excel(r'./data/PE-数据总sjk.xlsx', skiprows=1, sheet_name=str(67))

数据处理

选择数据

1 2	#loc：通过行、列的名称或标签来索引 df = df.loc[:, ['EXP_BLSBH','ZY', 'XX']]

#iloc：通过行、列的索引位置来索引
#按index和columns进行切片操作
#读取第2、3行，第3、4列
df = df.iloc[1:3, 2:4]

处理类别数据，将其转成数值

#查看变量值的所有类型
arr = data['YGeKY'].unique()
#构造索引与类别的字典
dict_data = {i: arr[i] for i in range(arr.shape[0])}
#key value对换
new_dict = {value: key for key, value in dict_data.items()}

data['YGeKY'] = data['YGeKY'].map(new_dict)

若一列数据中字符串，数值型字符串等，需要根据数值处理该值

#去除无关符号
data['xxx'] = data['xxx'].str.replace('<', '').str.replace('[', '').str.replace(' ', '')
# 自定义函数，根据条件修改每个值
def custom_transform(value):
    try:
        numeric_value = float(value)
        return '阴性' if numeric_value < 1 else '阳性' if numeric_value >= 1 else str(numeric_value)
    except ValueError:
        return value
    
data['xxx'] = data['xxx'].apply(custom_transform)

填充Nan

1	data['TT'] = data['TT'].fillna('阴性')

去除Nan

pandas.DataFrame.dropna — pandas 2.1.3 documentation (pydata.org)

DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)

thresh 与axis=0连用，对行进行筛选时，若有五列为空，则drop
subset 对某些特定列进行选择，与how连用，how='all'则表明全nan才drop，how='any'表示存在nan则drop

去重

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

文件保存

#这种方法文件需要提前创建，以追加的方式写
writer = pd.ExcelWriter('./data/s92-s101.xlsx', mode='a', engine='openpyxl', if_sheet_exists='replace')
dftmp.to_excel(writer, sheet_name='s92', index=False)
writer.save()