运行这段代码:
import pandas as pd
import numpy as np
# 创建一个示例 DataFrame
data = {
'key': ['key1', 'key2', 'key3', 'key1', 'key2'],
'colA': ['value1A', 'value2A', 'value3A', 'value4A', 'value5A'],
'colB': ['value1B', 'value2B', 'value3B', 'value4B', 'value5B'],
'colC': ['value1C', 'value2C', 'value3C', 'value4C', 'value5C'],
'colD': ['value1D', 'value2D', 'value3D', 'value4D', 'value5D']
}
df = pd.DataFrame(data)
# 创建条件
conditions = [df['key'] == 'key1',
df['key'] == 'key2',
df['key'] == 'key3']
# 根据条件对相应列应用选择
df['colA'] = np.select([df['key'] == 'key1'], [df['colA']], default='NA')
df['colD'] = np.select([df['key'] == 'key1'], [df['colD']], default='NA')
df['colB'] = np.select([df['key'] == 'key2'], [df['colB']], default='NA')
df['colC'] = np.select([df['key'] == 'key3'], [df['colC']], default='NA')
# 显示结果 DataFrame
print(df)
此段代码生成以下输出:
key colA colB colC colD
0 key1 value1A NA NA value1D
1 key2 NA value2B NA NA
2 key3 NA NA value3C NA
3 key1 value4A NA NA value4D
4 key2 NA value5B NA NA
问题:是否有更有效率的方法重写这段代码,以便我不必为每一列都执行numpy.select操作?在本例中,“key”列实质上决定了该行哪些列包含有效数据。如果数据无效,我想将其标记为NA;如果有效,则保留行中的原始值。在我的实际数据集中,“key”列控制了如何映射多列(例如key1控制了colA和colD的映射)。我倾向于使用numpy或其他向量化方法,因为它们通常比其他方法如map更快,但我愿意听取所有建议。