使用PIG读取XML
发布时间:2020-12-16 07:57:41 所属栏目:百科 来源:网络整理
导读:我试图使用PIG从xml文件中读取数据,但我的输出不完整. 输入文件- document urlhtp://www.abc.com//urlcategorySports/categoryusercount120/usercountreviews reviewgood site/reviewreviewThis is Avg site/reviewreviewBad site/review/reviews/document
我试图使用PIG从xml文件中读取数据,但我的输出不完整.
输入文件- <document> <url>htp://www.abc.com/</url> <category>Sports</category> <usercount>120</usercount> <reviews> <review>good site</review> <review>This is Avg site</review> <review>Bad site</review> </reviews> </document> 我正在使用的代码是: register 'Desktop/piggybank-0.11.0.jar'; A = load 'input3' using org.apache.pig.piggybank.storage.XMLLoader('document') as (data:chararray); B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(data,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?<reviews>.*?<review>s*([^>]*?)s*</review>.*?</reviews>.*?</document>')) as (url:chararray,catergory:chararray,usercount:int,review:chararray); 我得到的输出是: (htp://www.abc.com/,Sports,120,good site) 这是不完整的输出.有人请帮助我失踪了吗?
呵呵!终于让它使用cross工作了.我正在使用XPath,如果你愿意,你可以使用正则表达式.我发现,XPath比正则表达式更简单,更清晰.我想,你也可以看到它.不要忘记用XML替换testXML.xml.
XPath方式: DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath(); A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray); B = FOREACH A GENERATE XPath(x,'document/url'),XPath(x,'document/category'),'document/usercount'); C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray); D = FOREACH C GENERATE XPath(review,'review'); E = cross B,D; dump E; 正则表达方式: A = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('document') as (x:chararray); B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'(?s)<document>.*?<url>([^>]*?)</url>.*?<category>([^>]*?)</category>.*?<usercount>([^>]*?)</usercount>.*?</document>')) as (url:chararray,usercount:int); C = LOAD 'testXML.xml' using org.apache.pig.piggybank.storage.XMLLoader('review') as (review:chararray); D = FOREACH C GENERATE FLATTEN(REGEX_EXTRACT_ALL(review,'<review>([^>]*?)</review>')); E = cross B,D; dump E; 输出: (htp://www.abc.com/,Bad site) (htp://www.abc.com/,This is Avg site) (htp://www.abc.com/,good site) 这不是你所期待的吗? (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |