python-3.x – 从列表中的元素中删除尾随空格
我有一个火花数据框,其中给定的列是一些文本.我正在尝试清理文本并用逗号分割,这将输出一个包含单词列表的新列.
我遇到的问题是该列表中的某些元素包含我想删除的尾随空格. 码: # Libraries # Standard Libraries from typing import Dict,List,Tuple # Third Party Libraries import pyspark from pyspark.ml.feature import Tokenizer from pyspark.sql import SparkSession import pyspark.sql.functions as s_function def tokenize(sdf,input_col="text",output_col="tokens"): # Remove email sdf_temp = sdf.withColumn( colName=input_col,col=s_function.regexp_replace(s_function.col(input_col),"[w.-]+@[w.-]+.w+","")) # Remove digits sdf_temp = sdf_temp.withColumn( colName=input_col,"d","")) # Remove one(1) character that is not a word character except for # commas(,),since we still want to split on commas(,) sdf_temp = sdf_temp.withColumn( colName=input_col,"[^a-zA-Z0-9,]+"," ")) # Split the affiliation string based on a comma sdf_temp = sdf_temp.withColumn( colName=output_col,col=s_function.split(sdf_temp[input_col],",")) return sdf_temp if __name__ == "__main__": # Sample data a_1 = "Department of Bone and Joint Surgery,Ehime University Graduate" " School of Medicine,Shitsukawa,Toon 791-0295,Ehime,Japan." " shinyama@m.ehime-u.ac.jp." a_2 = "Stroke Pharmacogenomics and Genetics,Fundació Docència i Recerca" " Mútua Terrassa,Hospital Mútua de Terrassa,08221 Terrassa,Spain." a_3 = "Neurovascular Research Laboratory,Vall d'Hebron Institute of Research," " Hospital Vall d'Hebron,08035 Barcelona,Spain;catycarrerav@gmail.com" " (C.C.). catycarrerav@gmail.com." data = [(1,a_1),(2,a_2),(3,a_3)] spark = SparkSession .builder .master("local[*]") .appName("My_test") .config("spark.ui.port","37822") .getOrCreate() sc = spark.sparkContext sc.setLogLevel("WARN") af_data = spark.createDataFrame(data,["index","text"]) sdf_tokens = tokenize(af_data) # sdf_tokens.select("tokens").show(truncate=False) 产量 |[Department of Bone and Joint Surgery,Ehime University Graduate School of Medicine,Toon,Japan ] | |[Stroke Pharmacogenomics and Genetics,Fundaci Doc ncia i Recerca M tua Terrassa,Hospital M tua de Terrassa,Terrassa,Spain ] | |[Neurovascular Research Laboratory,Vall d Hebron Institute of Research,Hospital Vall d Hebron,Barcelona,Spain C C ] 期望的输出: |[Department of Bone and Joint Surgery,Japan] | |[Stroke Pharmacogenomics and Genetics,Spain] | |[Neurovascular Research Laboratory,Spain C C] 所以在 >第1行:’香椿’ – > ‘香椿’,’日本’ – > ‘日本’. 注意 尾随空格不仅出现在列表的最后一个元素中,它们也可以出现在任何元素中. 解决方法
更新
原始解决方案不起作用,因为trim仅在整个字符串的开头和结尾处操作,而您需要它来处理每个令牌. @PatrickArtner的solution有效,但另一种方法是使用RegexTokenizer. 以下是如何修改tokenize()函数的示例: from pyspark.ml.feature import RegexTokenizer def tokenize(sdf,output_col="tokens"): # Remove email sdf_temp = sdf.withColumn( colName=input_col," ")) # call trim to remove any trailing (or leading spaces) sdf_temp = sdf_temp.withColumn( colName=input_col,col=s_function.trim(sdf_temp[input_col])) # use RegexTokenizer to split on commas optionally surrounded by whitespace myTokenizer = RegexTokenizer( inputCol=input_col,outputCol=output_col,pattern="( +)?,?") sdf_temp = myTokenizer.transform(sdf_temp) return sdf_temp 基本上,在字符串上调用trim来处理任何前导或尾随空格.然后使用RegexTokenizer分割使用模式“()?,?”. >()?:零和无限空格之间的匹配 这是输出 sdf_tokens.select('tokens',f.size('tokens').alias('size')).show(truncate=False) 您可以看到数组的长度(令牌数)是正确的,但所有令牌都是小写的(因为这是Tokenizer和RegexTokenizer所做的). +------------------------------------------------------------------------------------------------------------------------------+----+ |tokens |size| +------------------------------------------------------------------------------------------------------------------------------+----+ |[department of bone and joint surgery,ehime university graduate school of medicine,shitsukawa,toon,ehime,japan] |6 | |[stroke pharmacogenomics and genetics,fundaci doc ncia i recerca m tua terrassa,hospital m tua de terrassa,terrassa,spain]|5 | |[neurovascular research laboratory,vall d hebron institute of research,hospital vall d hebron,barcelona,spain c c] |5 | +------------------------------------------------------------------------------------------------------------------------------+----+ 原始答案 只要您使用Spark 1.5或更高版本,就可以使用
所以一种方法是添加: sdf_temp = sdf_temp.withColumn( colName=input_col,col=s_function.trim(sdf_temp[input_col])) 在tokenize()函数的末尾. 但您可能想要查看 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |