In response to a number of requests from folks outside the Bay Area to have us record and post the Hadoop User Group presentations, here are the talks from the October meeting which was held this week at the Yahoo! Mission College campus. We had Jun Rao from IBM Almaden Research talk about “Exploiting database join techniques for analytics with Hadoop”. This was followed by an update on Jaql by Kevin Beyer from IBM, who informed us that Jaql is now available as Open Source. The last talk was a lively discussion with Sriram Rao from Quantcast about his “Experiences moving a Petabyte Data Center”. Bay Area Hadoop User Group meetings are usually held on the third Wednesday of each month at Yahoo! Mission College in Santa Clara. Ajay Anand Yahoo! Grid Computing

在10月份的meeting中(talks from the October meeting

首先讲诉了基本的mapreduce的用途,举了facebook的例子,用户与facebook互动之后,留在数据库中的 user table (用户的基本数据 id, age,gender,country等) 和 click-through log 两部分 数据 混合之后 再进行mapreduce进行数据挖掘,比如 可以针对某一个用户进行全方位的调查跟踪,什么时间段在访问什么页面 某个国家的同志们对于什么页面感兴趣 等等 –_-

传统的mapreduce的数据挖掘方面的不足

Deficiency in Repartition Join

Major 1. Has to sort log  2. Has to move log across network  (个人理解 1 log 必须要分类 2 log必须从某台或者多台机器中调用出 再重新分配到各个client去)

Minor 1. Popular key problem (skew) 2 Tagging overhead (大众化的key,无法进行sort进程,tag过多?? 无法处理?? 关于这个tagging overhead 理解有点问题 –_-!!)

新时期的 数据库解决方案

为了避免上面传统的mapreduce的不足

1.避免分类

2.尽量减少网络使用(应该是 反过来讲呢,网络越好,处理性能越好,但是由于现今网络民用的实用的状况是1000M,所以如果能达到网络消耗的减少,也将是一个亮点 –_-!)

为了解决上诉2个问题,解决的目标是:

1. 给DB增加Mapreduce技能

2. 做相应的mapreduce的framework 以实现 preserve fault-tolerance , load balancing 等等

然后是从步骤上一步一步来分析考虑解决问题的办法

首先是maper的问题: init()创建一ID hashtable 而mapper()纯粹只读取数据不sort  数据都是基于HDFS中读取

优点(Pluses): 1. 不分类 2 数据都是从HDFS内部到内部 hmm 看来之前对于网络减少理解有误 ??? 暂时不确定 ???

缺点(Minuses) 如果user data 很大的话……

:1. 完整的user data要copy到每一个mapper

2.完整的user data 得建立个hash table在每个mapper中

<<<<问题: 建立hash table的意义和作用是什么?? >>>>

如何解决user data 数据过大?

看来这里不是说user data数据过大,而是说log数据过大 hmm

所以提出了: Log may reference a small fraction of all users

解决方案: 给user data瘦身 方法是: semi-join  (这个玩意 hmm 难道是日本人做的?)

<<<<<关于 semi-join 的介绍 http://en.wikipedia.org/wiki/Relational_algebra 看那3个图 例子就蛮清楚了…>>>>>

在本文中如何使用semi-join

Phase 1: Extract
-extract unique user IDs referenced in Log

Phase 2: Filter
-filter User data with referenced user IDs

Phase 3: Join
-join Log with filtered User data

hmmm

接着详细的叙述了各个phase的过程

Phase 1 extract unique user IDs referenced in Log

A map-reduce job
-Mapper: Extract user IDs from log records
-Reducer: Accumulate all unique user IDs

看到这网站挂了……hmm 等好了再说吧 hmm

标签:, ,

相关日志


相关博文

评论

Good.Be the first to comment on this entry.

Post comment

comment has COPYRIGHT too!