python – 从谷歌云存储中读取csv到pandas数据帧
发布时间:2020-12-20 10:35:09 所属栏目:Python 来源:网络整理
导读:我正在尝试将Google Cloud Storage存储桶中的csv文件读取到熊猫数据框中. import pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlinefrom io import BytesIOfrom google.cloud import storagestorage_client = storage.Cl
我正在尝试将Google Cloud Storage存储桶中的csv文件读取到熊猫数据框中.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from io import BytesIO from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket('createbucket123') blob = bucket.blob('my.csv') path = "gs://createbucket123/my.csv" df = pd.read_csv(path) 它显示以下错误消息: FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist 我做错了什么,我找不到任何不涉及谷歌datalab的解决方案? 解决方法
UPDATE
截至0.24版本的pandas,read_csv支持直接从Google云端存储中读取.只需提供链接到这样的桶: df = pd.read_csv('gs://bucket/your_path.csv') 为了完整起见,我还留下了其他三个选项. >自制代码 我将在下面介绍它们. 艰难的方法:自己动手做代码 我已经写了一些便利功能来从Google存储中读取.为了使其更具可读性,我添加了类型注释.如果你碰巧在Python 2上,只需删除它们,代码将完全相同. 假设您获得授权,它在公共和私人数据集上同样有效.在此方法中,您无需先将数据下载到本地驱动器. 如何使用它: fileobj = get_byte_fileobj('my-project','my-bucket','my-path') df = pd.read_csv(fileobj) 代码: from io import BytesIO,StringIO from google.cloud import storage from google.oauth2 import service_account def get_byte_fileobj(project: str,bucket: str,path: str,service_account_credentials_path: str = None) -> BytesIO: """ Retrieve data from a given blob on Google Storage and pass it as a file object. :param path: path within the bucket :param project: name of the project :param bucket_name: name of the bucket :param service_account_credentials_path: path to credentials. TIP: can be stored as env variable,e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM') :return: file object (BytesIO) """ blob = _get_blob(bucket,path,project,service_account_credentials_path) byte_stream = BytesIO() blob.download_to_file(byte_stream) byte_stream.seek(0) return byte_stream def get_bytestring(project: str,service_account_credentials_path: str = None) -> bytes: """ Retrieve data from a given blob on Google Storage and pass it as a byte-string. :param path: path within the bucket :param project: name of the project :param bucket_name: name of the bucket :param service_account_credentials_path: path to credentials. TIP: can be stored as env variable,e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM') :return: byte-string (needs to be decoded) """ blob = _get_blob(bucket,service_account_credentials_path) s = blob.download_as_string() return s def _get_blob(bucket_name,service_account_credentials_path): credentials = service_account.Credentials.from_service_account_file( service_account_credentials_path) if service_account_credentials_path else None storage_client = storage.Client(project=project,credentials=credentials) bucket = storage_client.get_bucket(bucket_name) blob = bucket.blob(path) return blob gcsfs gcsfs是“用于Google云端存储的Pythonic文件系统”. 如何使用它: import pandas as pd import gcsfs fs = gcsfs.GCSFileSystem(project='my-project') with fs.open('bucket/path.csv') as f: df = pd.read_csv(f) DASK Dask“为分析提供高级并行性,为您喜爱的工具提供大规模性能”.当您需要在Python中处理大量数据时,它非常棒. Dask尝试模仿大部分的pandas API,使其易于用于新手. 这是read_csv 如何使用它: import dask.dataframe as dd df = dd.read_csv('gs://bucket/data.csv') df2 = dd.read_csv('gs://bucket/path/*.csv') # nice! # df is now Dask dataframe,ready for distributed processing # If you want to have the pandas version,simply: df_pd = df.compute() (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |