[翻译]大数据处理的趋势-五种开源技术介绍
作者:杨鑫奇 本篇文章是一篇翻译文章,对未来大数据领域的技术进行一些前瞻性的介绍,个人感觉他写的文章还是很好的,推荐的技术也具有的一定的代表性,遂将本篇文章翻译出来,感兴趣的大家能够看看。 大数据领域的处理,我自己本身接触的时间也不长,正式的项目还在开发之中,深受大数据处理方面的吸引,所以也就有写文章的想法的了。 ? 原文链接: http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/ But here’s where it gets interesting. Those initial investments will in turn trigger a domino effectof upgrades and new initiatives that are valued at $34 billion for 2013,per Gartner. Over a 5 year period,spend is estimated at $232 billion. What you’re seeing right now is only the tip of a gigantic iceberg. Did you know that there are over 250K viable open source technologies on the market today? Innovation is all around us. The increasing complexity of systems,in fact,looks something like this:
We have a lot of…choices,to say the least. What’s on our own radar,and what’s coming down the pipe for Fortune 2000 companies? What new projects are the most viable candidates for production-grade usage? Which deserve your undivided attention? We did all the research and testing so you don’t have to. Let’s look at five new technologies that are shaking things up in Big Data. Here is the newest class of tools that you can’t afford to overlook,coming soon to an enterprise near you. Born inside of Twitter,Storm is a “distributed real-time computation system”. Storm does for real-time processing what Hadoop did for batch processing. Kafka for its part is a messaging system developed at LinkedIn to serve as the foundation for their activity stream and the data processing pipeline behind it. Storm,诞生于Twitter,是一个分布式实时计算系统。Storm 设计用于处理实时计算,hadoop主要用于处理批处理运算。 When paired together,you get the stream,you get it in-real time,and you get it at linear scale. Why should you care? 你为什么需要关心? Stream processing solutions like Storm and Kafka have caught the attention of many enterprises due to their superior approach to ETL (extract,transform,load) and data integration. Storm and Kafka are also great at in-memory analytics,and real-time decision support. Companies are quickly realizing that batch processing in Hadoop does not support real-time business needs. Real-time streaming analytics is a must-have component in any enterprise Big Data solution or stack,because of how elegantly they handle the “three V’s” — volume,velocity and variety. Storm and Kafka are the two technologies on the list that we’re most committed to at Infochimps,and it is reasonable to expect that they’ll be a formal part of our platform soon. DRILL AND DREMEL Drill and Dremel put power in the hands of business analysts,and not just data engineers. The business side of the house will love Drill and Dremel. Drill is the open source version of what Google is doing with Dremel (Google also offers Dremel-as-a-Service with its BigQuery offering). Companies are going to want to make the tool their own,which why Drill is the thing to watch mostly closely. Although it’s not quite there yet,strong interest by the development community is helping the tool mature rapidly. The Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From Sawzall to Pig and Hive,many interface layers have been built on top of Hadoop to make it more friendly,and business-accessible. Yet,for all of the SQL-like familiarity,these abstraction layers ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (read: running jobs,or “workflows”). What if you’re not worried about running jobs? What if you’re more concerned with asking questions and getting answers — slicing and dicing,looking for insights? That’s “ad hoc exploration” in a nutshell — if you assume data that’s been processed already,how can you optimize for speed? You shouldn’t have to run a new job and wait,sometimes for considerable lengths of time,every time you want to ask a new question. In stark contrast to workflow-based methodology,most business-driven BI and analytics queries are fundamentally ad hoc,interactive,low-latency analyses. Writing Map Reduce workflows is prohibitive for many business analysts. Waiting minutes for jobs to start and hours for workflows to complete is not conducive to an interactive experience of data,the comparing and contrasting,and the zooming in and out that ultimately creates fundamentally new insights. Some data scientists even speculate that Drill and Dremel may actually be better than Hadoop in the wider sense,and a potential replacement,even. That’s a little too edgy a stance to embrace right now,but there is merit in an approach to analytics that is more query-oriented and low latency. At Infochimps we like the Elasticsearch full-text search engine and database for doing high-level data exploration,but for truly capable Big Data querying at the (relative) seat level,we think that Drill will become the de facto solution. R? R performs complex data science at a much smaller price (both literally and figuratively). R is making serious headway in ousting SAS and SPSS from their thrones,and has become the tool of choice for the world’s best statisticians (and data scientists,and analysts too). Why should you care? 为什么你应该关心? Also,R works very well with Hadoop,making it an ideal part of an integrated Big Data approach. GREMLIN AND GIRAPH Gremlin and Giraph help empower graph analysis,and are often used coupled with graph databases like Neo4j or InfiniteGraph,or in the case of Giraph,working with Hadoop. Golden Orbis another high-profile example of a graph-based project picking up steam. The common analogue for graph-based approaches is Google’s Pregel,of which Gremlin and Giraph are open source alternatives. In fact,here’s a great read on how mimicry of Google technologies is a cottage industry unto itself. Why should you care? 为什么要关新? Big picture,graph databases and analysis languages and frameworks are a great illustration of how the world is starting to realize that Big Data is not about having one database or one programming framework that accomplishes everything. Graph-based approaches are a killer app,so to speak,for anything that involves large networks with many nodes,and many linked pathways between those nodes. The most innovative scientists and engineers know to apply the right tool for each job,making sure everything plays nice and can talk to each other (the glue in this sense becomes the core competence). SAP HANA Hana highly benefits any applications with unusually fast processing needs,such as financial modeling and decision support,website personalization,and fraud detection,among many other use cases. The biggest drawback of Hana is that “in-memory” means that it by definition leverages access to solid state memory,which has clear advantages,but is much more expensive than conventional disk storage. For organizations that don’t mind the added operational cost,Hana means incredible speed for very-low latency big data processing. HONORABLE MENTION: D3 D3 is a javascript document visualization library that revolutionizes how powerfully and creatively we can visualize information,and make data truly interactive. It was created by Michael Bostock and came out of his work at the New York Times,where he is the Graphics Editor. With D3,programmers can create dashboards galore. Organizations of all sizes are quickly embracing D3 as a superior visualization platform to the heads-up displays of yesteryear. Editor’s note: Tim Gasper is the Product Manager at Infochimps,the #1 Big Data platform in the cloud. He leads product marketing,product development,and customer discovery. Previously,he was co-founder and CMO at Keepstream,a social media curation and analytics company that Infochimps acquired in August of 2010. You should follow him on Twitter here. 开始正式的使用Hadoop已经有近一年的时间的了,这期间从百度出来,到初见在到现在的BitWare,在不同的公司,用不同的技术解决问题。但是本质上遇到的问题总是那么几个,当然现在很多公司也开始尝鲜的使用Hadoop的了。这个是大环境是如此,可以理解。 以下说说个人对文章的理解: Drill这个是Apache的开源项目,之前也看了Google Dremel的论文,无奈看不是很懂,现在也没有遇到这样的环境,而且社区才刚刚火起来,所以还没有很多的时间来跟进,暂时先搁置了。 R语言,之前在百度的时候,隔壁各位做的哥们就在使用R语言干活,这个可能是只有大公司能够有能力去真正的挖掘的方面吧,我们现在的业务中基本没有用到过,对于R还是很陌生,不过我个人任务,在不同的环境下使用不同的技术手段,犹如,博士声光电吹盒子,我们架个电风吹,是一样的实现吧。 对于图数据库领域,还真的是没有遇到过详细的应用,还没有机会进入这样的公司,所以还是束之高阁吧。 SPA这个公司,听过名字,但是没有具体的接触过,现在卖解决方案估计也不好过,弄个东西出来提高下知名度还是必须的。现在啃老本的时代已经过去的了。 最后一个可视化的JS类库,兴趣不大,业务现在不去做前端的了,所以也还好。 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |