使用Python NLTK的AWS lambda中的路径

发布时间：2020-12-20 11:58:06 所属栏目：Python 来源：网络整理

导读：我在AWS Lambda中遇到NLTK包的问题.但是我认为这个问题更多地与Lambda中的路径配置不正确有关. NLTK无法找到本地存储的数据库,而不是模块安装的一部分. SO上列出的许多解决方案都是简单的路径配置,可以在这里找到,但我认为这个问题与Lambda中的路径有关： Ho

我在AWS Lambda中遇到NLTK包的问题.但是我认为这个问题更多地与Lambda中的路径配置不正确有关. NLTK无法找到本地存储的数据库,而不是模块安装的一部分. SO上列出的许多解决方案都是简单的路径配置,可以在这里找到,但我认为这个问题与Lambda中的路径有关：

How to config nltk data directory from code?

What to download in order to make nltk.tokenize.word_tokenize work?

还应该提到这也与我在此发布的上一个问题有关
Using NLTK corpora with AWS Lambda functions in Python

但问题似乎更为笼统,因此我选择重新定义问题,因为它涉及如何正确配置Lambda中的路径环境以使用需要外部库(如NLTK)的模块. NLTK将很多数据存储在本地的nltk_data文件夹中,但是在lambda zip中包含此文件夹以供上传,它似乎找不到它.

Lambda func zip文件中还包含以下文件和目录：

nltk_datataggersaveraged_perceptron_taggeraveraged_perceptron_tagger.pickle
nltk_datatokenizerspunktenglish.pickle
nltk_datatokenizerspunktPY3english.pickle

从以下站点看来,var / task /似乎是lambda函数执行的文件夹,我尝试将此路径包含在内. https://alestic.com/2014/11/aws-lambda-environment/

从文档中看来,似乎有许多环境变量可以使用但是我不知道如何将它们包含在python脚本中(来自windows,而不是linux)http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

希望在此处提出这个问题,任何人都有配置Lambda路径的经验.尽管有搜索,我还没有看到很多关于这个特定问题的问题,所以希望解决这个问题可能有用

代码就在这里

import nltk
import pymysql.cursors
import re
import rds_config
import logging
from boto_conn import botoConn
from warnings import filterwarnings
from nltk import word_tokenize

nltk.data.path.append("/nltk_data/tokenizers/punkt")
nltk.data.path.append("/nltk_data/taggers/averaged_perceptron_tagger")

logger = logging.getLogger()

logger.setLevel(logging.INFO)

rds_host = "nodexrd2.cw7jbiq3uokf.ap-southeast-2.rds.amazonaws.com"
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name

filterwarnings("ignore",category=pymysql.Warning)


def parse():

    tknzr = word_tokenize

    stopwords = ['i','me','my','myself','we','our','ours','ourselves','you','your','yours','yourself','yourselves','he','him','his','himself','she','her','hers','herself','it','its','itself','they','them','their','theirs','themselves','what','which','who','whom','this','that','these','those','am','is','are','was','were','be','been','being','have','has','had','having','do','does','did','doing','a','an','the','and','but','if','or','because','as','until','while','of','at','by','for','with','about','against','between','into','through','during','before','after','above','below','to','from','up','down','in','out','on','off','over','under','again','further','then','once','here','there','when','where','why','how','all','any','both','each','few','more','most','other','some','such','no','nor','not','only','own','same','so','than','too','very','s','t','can','will','just','don','should','now','d','ll','m','o','re','ve','y','ain','aren','couldn','didn','doesn','hadn','hasn','haven','isn','ma','mightn','mustn','needn','shan','shouldn','wasn','weren','won','wouldn']

    s3file = botoConn(None,1).getvalue()
    db = pymysql.connect(rds_host,user=name,passwd=password,db=db_name,connect_timeout=5,charset='utf8mb4',cursorclass=pymysql.cursors.DictCursor)
    lines = s3file.split('n')

    for line in lines:

        tkn = tknzr(line)
        tagged = nltk.pos_tag(tkn)

        excl = ['the','i','I'm','Im','U','RT','RTs','its']  # Arg

        x = [i for i in tagged if i[0] not in stopwords]
        x = [i for i in x if i[0] not in excl]
        x = [i for i in x if len(i[0]) > 1]
        x = [i for i in x if 'https' not in i[0]]
        x = [i for i in x if i[1] == 'NNP' or i[1] == 'VB' or i[1] == 'NN']
        x = [(re.sub(r'[^A-Za-z0-9]+' + '()',r'',i[0])) for i in x]
        sql_dat_a,sql_dat = [],[]

输出日志在这里：

**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************: LookupError
Traceback (most recent call last):
  File "/var/task/Tweetscrape_Timer.py",line 27,in schedule
    server()
  File "/var/task/Tweetscrape_Timer.py",line 14,in server
    parse()
  File "/var/task/parse_to_SQL.py",line 91,in parse
    tkn = tknzr(line)
  File "/var/task/nltk/tokenize/__init__.py",line 109,in word_tokenize
    return [token for sent in sent_tokenize(text,language)
  File "/var/task/nltk/tokenize/__init__.py",line 93,in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/var/task/nltk/data.py",line 808,in load
    opened_resource = _open(resource_url)
  File "/var/task/nltk/data.py",line 926,in _open
    return find(path_,path + ['']).open()
  File "/var/task/nltk/data.py",line 648,in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************

解决方法

似乎您当前的Python代码是从/ var / task运行的.我建议尝试(没试过自己)：

nltk.data.path.append("/var/task/nltk_data")

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!