2004年9月30日

Welcome Hang Chen!

This afternoon Hang Chen phoned me when I was in the class of Science and Technology Philosophy. He told me that he will be sent to Harbin for exercitation on Oct 6.

Great! Hang Chen was one of my undergraduate classmates. He was working at Hua Wei. Last time he said to me that he would be sent to some places for exercitation. Last month he was in Shenyang. Unexpectedly, he would be Harbin in Oct. Good. We could get together again!

2004年9月29日

Hidden Markov Model

Hidden Markov Model was very famous. I heard it on Simply's presentation last year. But I had not understand it totally until this afternoon.

Just these days I was learning the chapter on Markov Model in Pattern Classification. This afternoon, on the class of Natural Language Processing, Dr.Yi Guan introduced Markov Model to us.

Just now I reviewed the content of Pattern Classification and Dr.Yi Guan's teaching materials. I thought I had understood this model. Markov was so great.

2004年9月28日

Useful English Phrase

roger

interj.
Used especially in radio communications to indicate receipt of a message.
知道了,已收到了尤用于无线电通讯中表示收到讯息的答语


Mid-autumn day!

Give my wish to all my friends! Happy mid-autumn day!

So traditional festival in China, I had dinner with my roomates this evening. This was the first time we all got together. We were happy.

2004年9月27日

My birthday!

I am twenty-three years old today! Before twenty-three years I was born today. Thanks for my parents. They gave me life and brought me up.

I was very happy today! Lots of friends sent their bless and well-wishing. my parents, FangWang, Simply, Taozi, Zsq, Carl, Cr999, Heng Cao, Lou, Shuting Wang, Jinzhong Chen, Fan Xu, Heqing Rao, Ligang Long, Boqiang Luo, Chenlin Luo, Yongkun Zhang, Wenji Song, Qi Wang, Hang Chen, Xianjun Li, Xin Zhang, Qingren He, Xuan Hu, Yin Xiao, Kai Zhao, Jun Li, Jingwei Liu, Zhengjing Tang, Hao Wang, and so on. I was so happy!

I was very happy today! This afternoon, I went for the interview about the continuous academic project that involves postgraduate and doctoral study. I started my doctoral study plan from today. It was special for my birthday. I loved my choice. For my doctoral degree and for my ideal, I should be more studious, consider more about my research direction, and learn more.

I was very happy today! I knew a wonderful girl. She was so good. It's my pleasure to have dinner with her.

So happy day!

My new year's life. I could do better.

2004年9月26日

Next plan for Summarization Evaluation

This afternoon, in the colloquium of our lab, I introduced the summarization evaluation task, finished works, and next plan. Dr.Tliu and other seniors gave me lots of suggestion.

My task was very inportant for the Summarization project. I should try my best to achieve the wonderful goal.

Some useful translation

A AA制 Dutch treatment; to go Dutch
B 毕业答辩 thesis/dissertation defence 毕业设计 final project
博士生 a PhD Candidate 报销 to apply for reimbursement
博导 PhD student supervisor 班主任 class tutor
必修/选修课 compulsory/optional courses/modules
辩论队 debate team 辩论赛 debate contest 本命年 one's own Chinese zodiac year
C 成就感 sense of accomplishments/achievements
D 第三产业 the tertiary industry 导师 tutor, supervisor
独立思考能力 capacity for independent thinking 党支部 Party branch
党支部书记 Party branch secretary 调研 research; survey
E 厄尔尼诺现象 El Nino phenomenon 二等奖 the second prize
F 附中 affiliated (high or junior etc) school of ....
附件(email): attachment 房地产 real estate
G 公务员 civil servant (工作)单位 work unit
工学学士/硕士 Bachelor/Master of Science (B.S & M.S)
高考 National College Entrance Examination
国家重点实验室 state key laboratory
股份制 shareholding system; joint-stock system
股份有限公司 Co. Ltd; company/corporation limited: limited corporation
H 户口簿 residence booklet; household register; household registration booklet
获六级证书 obtain a certificate of CET-6
J 甲方乙方 Part A and Party B 基础设施 infrastructure
敬业精神 professional dedication; professional ethics
讲师 lecturer 高级讲师 senior lecturer 技术支持 technical support
精神文明建设 ideological and ethical progress
机电一体化 Electromechanical Integration
激烈的竞争 intense/fierce/bitter competition
九五攻关 The 9th 5-year plan 竞争力 competitiveness
K 可持续发展 sustainable development
考研 take part in the entrance exams for postgraduate schools
课代表 subject representative
L 理论联系实际 to link theory with practice
论文答辩 thesis defence 劳动密集型 labour-intensive
联系方式 contacts;contact details; how to contact;
M 民工 migrant workers/labourers 满分 full mark 面试 interview
P 平面设计 graphic design
Q 全职 full-time
R 人才 talent; talented people 理念 philosophy; value; doctrine
入世 china's accession to the wto; china joins the wto

S 三个代表(论) the Three Represents (Theory) 三等奖 the third prize
双刃剑 double-edged sword 上网 to get on the internet
适者生存 survival of the fittest 私营经济 private sector
事业单位 public institution 私/民营企业 private enterprise
三好学生 merit student; three good student(good in study, attitude and health)
师兄 无准确英译,可表达为'junior or senior (fellow) schoolmate/student
双赢(局面) win-win; a win-win situation 实习 internship 实习生 intern
双学位 double degree/dual degree 手机短信SMS/short message/instant message
上市 to go public; to be listed (in the stock market)
市场营销(活动) marketing (activitiess)
硕博联读 a continuous academic project that involves postgraduate and
doctoral study; a PhD programme
水平一/二 English Proficiency Test I/II (of Tsinghua University)
社会实践 social practice
社会实践优秀个人 excellent individual in social practice

T 团队精神 esprit de corps OR team spirit 特此证明 this is to certify that.
团支部书记 League branch secretary 团委 the Youth League committee
特等奖学金 top class/level scholarship
通过大学四级考试 pass the College English Test Band 4

W 物业管理 asset management, property management
物流 logistics
外联部 liaison department (小的办公室,叫office)
企业的外联部,通常是PR: Public Relations Division/Department

X 性价比 cost performance 学术交流 academic exchange
信息化 adj and n. information v. informatise/informationise
n. informatisation/informationisation
选修课 optional/selective courses/modules
学位课 degree course 学号 student number

Y 营销(学) marketing
优胜互补 (the two parties...) have complementary advantages
优胜劣汰,适者生存 survival of the fittest
院士(见Z中科院条)
与时俱进 to advance/progress with times 研究所 research institute
以人为本 people oriented; people foremost
研一生 first-year graduate student
一等奖学金 first class scholarship 一等奖 first prize
有限公司 limited company; Ltd.

Z
振兴xxx: to rejuventate/revitalise xxx 准考证 admission ticket
知识经济 knowledge economy; knowledge-based economy
知识密集(性) knowledge-intensive
知识产权 intellectual property rights
中科院 the Chinese Academy of Sciences; Academia Sinica
(院士 member, academician)
中国工程院 the Chinese Academy of Engineering
正版 adj. authorised
综合国力 comprehensive national strength
政治面貌 political status
助教 teaching assistant (TA)
自强不息,厚德载物 Self-discipline and Social Commitment
自我评价 self-assessment; self-evaluation

2004年9月25日

成君忆:《从三国到西游:中国传统文化中的人本管理智慧》

上午有幸进入主楼三楼大厅,聆听了成君忆主讲的《从三国到西游:中国传统文化中的人本管理智慧》。
内容提要如下:

1、创作《水煮三国》的6个起因

2、以性格为中心的人力资源管理

  a、什么是性格?
  b、蜘蛛的故事

3、性格的分类

  a、曹操与孙悟空:力量型的杰出代表  专论:孙悟空的“弼马瘟”效应
  b、刘备与猪八戒:活泼型的典型
  c、诸葛亮和唐僧:完美型的化身
  d、孙权与沙和尚:和平型的优秀版本  专论:和平型的人生态度有许多看似矛盾的地方

4、角色演练

  a、作为个人一个关于感情问题处理方式  寓言故事:《孙悟空大闹五庄观》
  b、作为团队的管理者  寓言:《老虎今天吃草》
    专论:为什么要给孙悟空戴金箍儿?
  c、作为合作者和竞争者

5、不同性格类型的优点与缺点

  a、力量型的优点与缺点
  b、活泼型的优点与缺点
  c、完美型的优点与缺点
  d、和平型的优点与缺点

6、人际冲突中的性格变化几个关于人际冲突处理方式的案例?

  ※ 唐僧与孙悟空
  ※ 孙悟空与猪八戒
  ※ 刘备与孙尚香
  ※ 孙权与沙和尚

7、对各种性格类型的忠告

   a、对力量型的忠告
   b、对活泼型的忠告
   c、对完美型的忠告
   d、对和平型的忠告

8、关于《三国演义》、《西游记》若干公案的探讨?孙悟空是从哪里来的?

  ※ 唐僧师徒四人的姓名由来
  ※ 九九八十一难的真实意义
  ※ 解读《心经》

9、结束语:温故而知新

其中我觉得有些话很有道理。如:你在看一个人的时候如果总是挑他的毛病和缺点,那么你将会觉得这个人越来越讨厌,这是客观事实; 反之,如果一个人你总是欣赏他的优点和长处,那么你会越来越认为这个人值得成为朋友,这也是客观事实中发现的结果。如果我们总是抱着一颗阳光的心去看待这个世界,那么观察到的东西总是非常美好的。

还有一个观点:为什么随着两个相互欣赏的人彼此了解的深入,你会发现对方存在这样那样的缺点,时间长了,你会觉得这个人可能会越来越讨厌。了解得越深入,那么就越会出现这种问题。这也是一对恋人或者夫妻之间会出现矛盾的原因所在。出现问题不要紧,要紧的是双方应该抱着一种积极的负责任的态度去解决出现的问题,而不是消极的对待。

对待历史应该有阳光的心态,历史的博大精深才会展现在你的面前,如果总是想着社会的阴暗面和倒退的现象,久而久之也会让一个人的心态变得越来越阴暗。


成君忆现在为止写了两本书《水煮三国》和《孙悟空是个好员工》。这两本书充分体现了他的观点,有时间确实需要读一下。

2004年9月24日

智能科学大餐

2004年9月10 – 12日,由国家自然科学基金委员会信息科学部主办、中国人工智能学会和燕山大学承办的《智能科学技术基础理论重大问题研讨会》在燕山大学举行。来自智能科学、脑科学、认知科学、逻辑、哲学等学科交叉领域的代表50多人参加了会议,作了27个专题报告。

部分会议报告如下,仅供参考:

李衍达: 对智能研究的一些设想(ppt)
钟义信: 智能科学-世纪挑战,百年良机(ppt)
陆汝钤: 研究知识科学,发展知识工程,推进知识产业(ppt)

史忠植: 智能科学的基本问题(ppt)
王守觉: 仿生模式识别与机器形象思维(ppt)
郭爱克: 抉择的自然计算(ppt)
李德毅: 不确定性人工智能(ppt)
许卓群: Web of Distributed ntologies(ppt)
王飞耀: 词计算和语言动力学系统的计算理论框架(ppt)
周志华: 普适机器学习(ppt)
王珏: 机器学习研究回顾(ppt)
林方真: Many uses of classical logic(pdf)
何华灿: 广义智能科学的逻辑基础探讨(ppt)
童天湘: 智能化是信息化的必然趋势(doc)

经过逐之一学习,我感觉从中学习到了一些新的东西。学习体会如下:

1。云模型是新兴的一种理论[1]。我们在统计数学和模糊数学的基础上,用云模型来统一刻画语言原子和数值之间的随机性和模糊性,正向云发生器[2]是用语言值描述的某个基本概念与其数值表示之间的不确定性转换模型。云的数字特征用期望值Ex、熵En和超熵He三个数值表示。它把模糊性和随机性完全集成在一起,构成定性和定量相互间的映射,作为知识表示的基础。因为自然现象中的云也有着不确定的性质,我们就借用“云”来命名数据--概念之间的转换模型。云由许多云滴组成,每个云滴就是这个定性概念映射到数域空间的一个点,即一次带有不确定性的具体实现。模型同时给出这个云滴能代表该定性概念的确定程度。模型可以生成任意多个云滴。

反过来,我们用逆向云模型实现数值和语言值之间的随时转换。数据开采的一个基本问题是先有数据,然后才形成概念;先有连续的数据量,然后才有离散的符号量。

2。中心极限定理从理论上阐述了正态分布的条件,中心极限定理的简单直观说明:
如果决定某一随机事件结果的是大量微小独立的随机因素之和,并且每一因素的单独作用相对均匀的小,没有一种因素起到主导作用,那么这个随机变量服从正态分布。

正态分布是许多重要概率分布的极限分布,许多非正态的随机变量是正态随机变量的函数,正态分布的密度函数和分布函数有各种良好的性质和比较简单的数学形式,这些都使得正态分布在理论和实际中应用分布非广泛。在学习模式识别的数学基础[4]时了解到:“在所有的连续概率密度函数中,如果均值u和方差s(暂用s代替)都取已知的固定值,则使熵达到最大值的将是高斯分布(即正态分布),此时最大熵为H=0.5+log2(sqre(2*pi*s))(比特).”熵具有描述信息含量的特征,正态分布的这种最大熵特性决定了正态分布在自然界的广泛存在。

事实上现实世界中各种因素的单独作用并不是相对均匀的小。许多随机现象不能用正态分布来描绘。如果决定随机现象的因素单独作用不是均匀的小,相互之间并不独立,有一定程度的相互依赖,就不符合正态分布的产生条件,不构成正态分布,或者只能用正态分布来近似处理。概率论用联合分布来处理这类情况,但是通常联合概率分布的确定非常复杂,难以实际应用。李德毅院士提出用云模型来描述这类随机性,将正态分布拓展为泛正态,用一个新的独立参数---超熵,来衡量偏离正态分布的程度,这种处理方法比单纯用正态条件分布更为宽松,同时比联合概率分布简单,易于表示和操作。

3 。不确定性人工智能在研究人类认知活动的切入层次是自然语言层次。无疑这是对自然语言处理研究的一种肯定,也是给与了自然语言处理信心。

4。现在机器学习的研究出现了很多机遇和挑战。下面将以医疗和金融为代表来举几个例子:

例子1:代价敏感
医疗:以乳腺癌诊断为例,“将病人误诊为健康人的代价”与“将健康人误诊为病人的代价”是不同的
金融:以信用卡盗用检测为例,“将盗用误认为正常使用的代价”与“将正常使用误认为盗用的代价”是不同的
传统的ML技术基本上只考虑同一代价
如何处理代价敏感性?
在教科书中找不到现成的答案,例如:
Tom Mitchell, Machine Learning, McGraw-Hill, 1997
Nils J. Nilsson, Introduction to Machine Learning, draft 1996 - 2004

例子2:不平衡数据
医疗:以乳腺癌诊断为例,“健康人”样本远远多于“病人”样本
金融:以信用卡盗用检测为例,“正常使用”样本远远多于“被盗用”样本
传统的ML技术基本上只考虑平衡数据
如何处理数据不平衡性?
在教科书中找不到现成的答案

例子3:可理解
医疗:以乳腺癌诊断为例,需要向病人解释“为什么做出这样的诊断”
金融:以信用卡盗用检测为例,需要向保安部门解释“为什么这是正在被盗用的卡”
传统的ML技术基本上只考虑泛化不考虑理解
如何处理可理解性?
在教科书中找不到现成的答案

个人认为这些挑战的存在是机器学习存在和发展的动力之一。需要大家的努力和解决。三个问题中,我曾经遇到过的是第三个--数据的不平衡性。曾经采用过的方法是将不平衡通过适当裁减变成平衡的数据,但是这样一来会丢失很多的信息。采用决策树算法的时候没有进行裁减,同样可以学习,但是学习得到的结果需要仔细分析。

5。统计机器学习需要满足独立同分布条件,严厉。
对于这个独立同分布的前提条件,我自己的体会并不深刻。在采用神经网络、决策树完成一些任务时前提条件中并没有考察过这个条件满足与否。看到这个提示开始以为是机器学习算法中通常的各个特征之间相互独立的约束条件,后来仔细一想在用决策树的目的不就是要挖掘各种特征之间的相关性吗。 所以这里的不相关性是指前后数据的不相关。每次采样时不受到以往或者以后数据的影响。

写到这里我又想到了正态云模型分析实例中经常提到的评判射击运动员打靶成绩的标准的问题。通常的统计方法都会认为运动员的各次打靶之间是相互独立的,没有任何关系,但是实际上运动员的每次射击都受到前几次打靶成绩的影响,采用正态云模型中的超熵来分析这个问题时,超熵越小,则运动员各次射击之间的影响越小,运动员的心理素质越好,反之亦然。

因此我们在采用机器学习算法来完成一些任务的时候需要仔细分析这个前提假设,如过本来这个假设都布满足,那么随后出现的问题以及解决方案的出现都存在一些偶然因素。

参考文献
1 吕辉军,王晔,李德毅,刘常昱. 逆向云在定性评价中的应用. 计算机学报. 2003,26(8):1009~1014
2 李德毅,孟海军,史雪梅. 隶属云和隶属云发生器. 计算机研究与发展, 1995, 32(6):16~21
3 李德毅,刘常昱.论正态云模型的普适性. 中国工程科学.2004,6(8):28~34
4 Richard O.Duda 等著,李宏东 姚天翔 译. 模式分类. 北京:机械工业出版社.2003.9

2004年9月23日

The Library of Second Campus of HIT

When the morning, I went to the library to find two papers in Chinese Academy of Engineering. But unfortunately there was not any one of this magazine in 2004. After search in the database, the administrator told me to come to the second campus. The two papers were wonderful after I read their abstract. Their titles were "Study on the University of the Normal Cloud Model" and "An Axiomatic Definition of Degree of Greyness pf Grey Number". And the school bus between first and second campuses was once a half hour.

It was nearly 8:30. I ran to the platform and went to second campus.

This was my second time to second campus. After about twenty-five minutes, I got off near by the library.

I wa not familiar with the library of second campus. So I asked some worker for help. The basic setting of this libray was same as that of the first campus but some architecture style. I came to the science and technology magazine reading room and found the magazing soon.

After half an hour I returned back to first campus. The two papers were rare for me. I could study them carefully.

2004年9月22日

Artificial Intelligence with Uncertainty

The link of this ppt is: http://www.intsci.ac.cn/research/lidy04.ppt
When I reviewed the slides about this topic, I was excited. As there was some uncertainty elements related to Grey system theory.

When I studied it, I found lots of useful knowledge. I listed some as follows:

1. Artificial Intelligence with Uncertainty was a new research field in AI.

2. There is a wonderful discuss about the revert theory:

对《还原论》的质疑
如同我们不能从最基础的硅芯片的活动来推测计算机网络上电子邮件的行为一样,我们不可能从分析单个离子、神经元、突触的性质去理解人们欣赏落日美景的感受。因此,我们怎么能够设想从分析单个器官、细胞、基因、蛋白质分子的性质和神经传导就能够推断人脑的认知和思维活动呢?系统论关于系统整体特征不是由低层元素加和而成的原理对还原论提出质疑。

3.It was more difficult playing go with computer than playing chess.

象棋有明确的最终目标状态;围棋没有。
电脑象棋可以从一个目标状态不断搜索最合理的走法(推理)达到下一个目标状态;围棋没有。
电脑象棋可以有目的地向着某一目标状态不断搜索最合理的走法(推理);围棋没有。
围棋想围住对方,在某个状态下应对的步骤比象棋要多得多,更注重形象思维,更大局观。

4. The research layer of Artificial Intelligence with Uncertainty was natural language. So we, NLP researchers, should be happy with this conclusion :)

5. 追求普遍性、深刻性和富有意义,追求真和美,是基础研究和交叉研究的魅力所在。领略不同学科交汇乃至科学和艺术的交汇所产生的奇妙景观是人生的一大享受。

Grey system is of the uncertain problem. So what is the effect of Grey system in Artificial Intelligence with Uncertainty? It is very interesting.




2004年9月21日

The English presentation and discussion

The colloquium of this evening was the first doctoral paper reading colloquium with pure English. During the discussion time, we must speak English without any Chinese words.

The first speaker was Wanxiang Che. The main topic was Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with
Perceptron Algorithms. His presentation was perfect. I think so. When he was speaking, I fell this kind of chance of English presentation was very scarce for us. I should cherish it.

After his presentation we discussed the detail of his presentation.

English was so important for us. I should keep practise my English capability.

2004年9月20日

GMCM(4)

This is the final day. I went to sleep at 4:00 and get up at 7:00. The third problem was more difficult than the above two. We must finish a emluator as the solution of the fourth problem. But as the circumgyratetion of the tapers were not clearly we could not realize it.

The left five problems were be solved by Qiyue Yu and Yu Liu.

There were nine hours after my getting up to hand in our paper. It was so quickly that we could feel it. When we finished our work we had not yet any food.

So tired! We all need a good rest.

2004年9月19日

GMCM(3)

There were only three and a quarter days in this contest. This was the third one. After solved the first problem, as the little more restriction, I had believed that I could solve the second one very easy. But there were lots of problems in it.

At about 23:50, I got the perfect solution. The rule was simple. But I had not used it before.

So tired but must stick to.

2004年9月18日

GMCM(2)

The second day I was not very lucky. The second sub-problem was more difficult with some more strict restriction than the first one. The three schemes we obtained were not perfect in my mind. I discussed it with Yu Liu who was my partner. But still not clearly solution we had got now. This problem occupied my whole day to a certain extent. So it was the bottleneck of my tasks. I must solve it.

Some exciting news were that three more sub-problems were solved by Qiyue Yu. The draft paper on the solved problems was being written by Yu Liu. And one of my friends sent some watermelon to us and incouraged us.

Try, try, and try!

2004年9月17日

GMCM(1)

This was the first day of the first national graduate mathematical contest of modeling(GMCM, shorted by myself). GMCM was little different with UMCM(MCM for Undergraduate). There were four problems of GMCM, But only two for UMCM. The first one was related to positioning system. Lots of related knowledge and sub-problems need to be solved. The second one was the minimum cost of materials problem. The third one was about data mining problem. There were lots of information about the after-sale services of automobiles. The last one was about bi-selection between supervisors and graduates.

After our discussion we chose the first one. It was hard. As it had nine sub-problems. And we spent nearly one whole day to solve the first one. Fortunately, we had got some perfect solution. We were excited. But we had little sleep in the night.

Let us try our best to solve the problems left.

2004年9月16日

Prepare for the first graduate mathematical modeling contest

Tomorrow I will take part in the first graduate mathematical modeling contest. This is another four days.

I will try my best again! Blessing myself!!

2004年9月15日

Biblioscape

Biblioscape is designed with one goal in mind, that is to make the life of researchers easier. After 4 major releases, Biblioscape has evolved from a traditional bibliographic software into a Research Information Manager. Biblioscape will help researchers to get all kinds of information organized in a single place, and link them together to build a knowledge base. It consists of 7 modules addressing different aspects of a researcher's needs.

References module is for storing, managing, and searching for bibliographic references. Bibliographic records from different sources can be imported into a Biblioscape database with the right import filter. References are organized into folders. Several searching tools are provided to query the database.

BiblioWord is an easy to use word processor tightly integrated with the References module. It is also a bibliography maker that can convert temporary citations in a document into formatted citations and bibliography. If you prefer to use Word or WordPerfect, Biblioscape also provide tight integration with these two popular word processors.

Internet module can be used to search online bibliographic database via a web browser. With a single button click, your web search results can be captured into a Biblioscape database. Besides capturing bibliographic records, you can also use the Internet module to capture web pages.

Notes module is designed to collect any free text information that does not fit into the reference database. A note can be your ideas, comment, background information about an author, etc. Notes are organized in a tree structure. A note can be linked to other notes, references, tasks, etc.

Tasks module is a simple To Do list manager that is integrated with the References and Notes modules. It is designed to manage tasks related to your research, so you won't need a separate program for it. Tight integration with other modules of Biblioscape make it an ideal tool to manage research related tasks.

Charts module can be used to draw flow charts, organization charts, etc. A chart object can be linked to the References, Notes, Tasks, and Library modules. You can draw a chart to express ideas and procedures in your research, and use SQL to connect objects in your chart to other modules in Biblioscape.

Library module is for managing a small research library. It could be a researcher's personal library, a department library, even a small corporate library. It includes 7 sub-modules to handle different tasks in library automation. These are: Catalog, Serials, Circulation, Interlibrary Loan, Borrowers, Lenders, and Suppliers.

Biblioscape also includes a web server application, BiblioWeb. With just one button click, your bibliographic database can be published on the Web. Web users can be assigned Read or Write privileges to browse, search, even add and delete bibliographic records using a Web browser. This is the easiest way for a research group to share a common bibliographic database on the Web.

So wonderful software!!

2004年9月14日

The homework of Pattern Classification

There were only four questions of Chapter two of Pattern Classification. The first one was nearly a pure mathematical problem to prove some inequalities. After some reasoning, it was transformed to be discussion of some curve trend.
The second one was integration of some mathematical formula. There was a subsection function which should be drawn. I used the fplot function of Matlab for drawing it. The final picture was wonderful. The third one was simple. Using the proving skill of pigeonhole theorem, the conclusion could be proved easy.

The most difficult one was the final one. Lots of linear algebra knowledge was in essence not clearly in my mind. I found that I had not been very familiar with the mathematical knowledge of Advanced Mathematics, Linear Algebra, and Probability Theory. This was a serious problem. I should review them at some time.

2004年9月13日

The result of LSI

Just now I had realized the LSI approach in my Summarization Evaluation task. But the final result was not good. I decided the K value of 5 in the LSI method. But the whole trend was not same as that of the five students.

Whether is it a prove that good summarization did not mean high coreference coefficient. The detail situation must be discussed with others.

2004年9月12日

Latent Semantic Indexing

Latent Semantic Indexing(LSI) is one method using for text categorization as improvement to VSM. The idea behind the LSI is to map each document and test vector into lower dimensional space which is associated with concepts and compare the documents in this space.

In Dr.Tliu's courseware about Information Retrieval, I saw some example about it. And after reading some papers which were downloaded from CNKI about LSI, I believed it could be used in my evaluation task. I should discuss the situation with Dr.Tliu and senior.

There were some homeworks of our curriculums. I must come back and prepare for them now.

2004年9月11日

Summarization Evaluation1

After some discussion with Yhb and Dr.Tliu, I began to realize the first Summarization Evaluation method. It was based on the thoughts of Documents Classification.

Just now I had finished the programs and analysis the trend of the data. But the effect was bad. There were 100 groups human evaluation scores, but only three had the same trend with my results.

When I came back to my bedroom, I began to think for the reasons. The evaluation method was based on the document vectors using tf*idf values and calculating the similarity between the vectors. This method was considering only the words frequency without more information. But the four summarization systems were not only considering the words frequency. So my methods should be updated.

But how to update? The basic features that I could bethink of were used in the current summarization systems. If I use some of them, the final evaluating system couldt ake sides some systems. So I must use some other methods. What was the correct methods? This was a big problem for me currently.

2004年9月10日

Teacher's Day!

This is teacher's day! I wish all teachers will have more wonderful life and health.

This morning, we sent some little presents to our best teachers. And when our six students sent our present to our Graduate English Teacher, she was at home alone. She was a Buddhist and told lots of stories to us. In her words, I comprehended one thing. Innovation was one thing appearing in your mind at some moment by some stimulate. But this was based on your ample knowledge accumulation.

Teacher is wonderful career. Blessing them!

2004年9月9日

《模式分类》第一章学习体会

刚才学习完毕《模式分类》的第一章《绪论》,现在不得不说这本书是到目前为止我见到过的关于机器学习最好的教材之一。第一章《绪论》中列出了许多模式分类同时也是机器学习中存在的问题。现列举并讨论如下:

模式分类的最终目的和处理方法就是,首先将模型分为几类,然后对感知到的数据进行处理,以滤除干扰(由采样引起而非模型引起)。然后,选择出与感知数据最接近的模型类别。

在模式分类的过程中往往需要看绿一个分类的“总体代价”函数。我们真正的任务是要确定一种决策(decision),使该函数最小。这是决策理论的中心任务,而模式分类可能是其中最重要的一个子领域。

模式分类需要抽取待分类模式的各种特征,特征选择的多少和如何选择特征是一个非常重要的问题。以往我的观点和做法是将能够抽取到的所有特征都送给决策树使用,这样得到的分类器的封闭测试的效果非常好,但是往往“推广能力”(就是开放测试时正确区分模式的能力,亦称“泛化能力”)会较差(这种现象叫做“过度拟合”)。同时特征选取过多可能会给将来在非常高维德空间中进行分类操作埋下了“祸根”。“推广能力”和“复杂度”需要进行折中,二者互为矛盾,但是折中到什么程度需要进一步的讨论。

从根本上说分类判决任务实面向特定任务或特定代价的。例如,假如我们的目的是销售鱼子(酱)的话,我们很可能试图按照鱼的性别进行分类,把雄的和雌的分开。或者,我们想把受损的鱼筛选出(以制备猫食)等等。不同的判决任务将需要不同的特征,其判别边界也与鱼分类问题很不相同。因此,建造一个通用的,能够精确的执行各种各样的分类任务的人工模式分类器将是一个极端困难的任务。这使得我们对人类能在各种模式分类任务中迅速和灵活的切换更加增加了几分赞美和敬佩之心。

模式分类的目标是寻找这样一种分类器,使得对于同一类别的样本尽可能认为一致,对于不同类的样本的区分尽可能明显。其中如何选择特征是至关重要的一步。在选择或设计特征的过程中,很明显,我们希望发现那些容易提取、对不相关变形保持不变、对噪声不敏感,以及对区分不同类别模式很有效的特征集。

本章小结中提到,模式识别的进展至少从以下三重意义上传达出积极的信息:(1)问题一定可以解决,因为人和生物体的识别能力是最好的“存在性证明”;(2)解决其中很多问题的数学理论已被发展起来;(3)还存在许多吸引人的未解决问题为进一步的研究发展提供了丰富的机遇。对比于自然语言处理,我认为同样可以得到三条积极信息:(1)人的自然语言的处理能力是自然语言处理中问题可解的强有力证明;(2)自然语言处理借助其他学科的成果和语言学特有的规律已经取得了长足的发展,但是现在自然语言处理的能力还非常有限,公认的理论基础建立的还不很完备;(3)自然语言处理中仍然存在大量的未解决的问题使得该领域中存在无数的机遇。可以说自然语言处理是一个不断摸索的领域,其中伴随着各种理论和方法的诞生必将推动其发展。

本书中最难部分是第九章《独立于算法的机器学习》。许多很微妙,而又至关重要的具有理论和实践意义的结论将被讨论。这其中包括偏差-方差关系、自由度问题、设计“简单”分类器的必要性,以及计算复杂度等问题。在某种意义上,只有懂得了本章的结论,才可能透彻的理解和更好的运用其他章节的知识。

如此好书,难求也。唯以踏实学习之!

2004年9月8日

So good book: Pattern Classification

It was so good that I'd like to read and study it intensively. This book had some main features as following:
1. It was clearly elucidating the classical and new approaches, including neural network, stochastic method, genetic algorithm and machine learning theory.
2. There were so many colorized charts using for showing some obscure concepts. I guessed the beautiful pictures were generated by Matlab.
3. There were necessary mathematical foundations knowledge in appendix.

Some exciting feature was that there were some comparison with some schemes in certain application areas. Yeah, this was perfect.

Now I liked it very much.

2004年9月7日

Party of our Graduate English class

At the beginning of this term, lots of the classmates of our graduate English class wanted to get-together at some time. After corresponding with nearly all classmates of our class, we finally decided to have a party this evening.

The number of people had been 27 of our 29 students original. But somebody were busy this evening temporarily. The number of people was 20 finally. It was worth mentioning that our teacher Mrs.Zhang had come.

Although all of us knew each other last term, all of us treasured this friendship very much. We were from different speciality. This class was our final one of our undergraduate life and will contiue for long time.

This was the fourth time of our getting-together. I found out this kind of party has some advantage that speaking to students from different speciality we could study the thinking style of each other. And we helped each other. This form was very good. And we had decided to have another one in this term.

2004年9月6日

Summarization Evaluation

Evaluation can involve various kinds of comparisons. Three kinds of methods are as following:
1. system summaries compared with human summaries
2. system summaries compared with full-text sources
3. system summaries compared with each other

In March this year I had done something on summarization evaluation. At that time I had used four methods. They were all included in the first type method.

As I must realize some evaluation program to simulating the results of human beings. However, my former four methods were not good enough. Now I could use some linear interpolation methods by the three knids of evaluation methods.

Considering lots of the methods of system summaries comparing with human summaries, I could use one of my original four. That was uniting the human summaries by 5-3 principle and calculating the F score with the system sumaries.

To system summaries comparing with full-text sources, I could borrow the methods of documents classifying. Constructing the vectors of the source text and the system summaries, using the cos-similarity for substitute, I could comparing the trend of cos-similarity to the human socring.

Ok. This idea should be realized tomorrow.

2004年9月5日

Preparing presentation on SWCL2004

Dr.Tliu told us to make some presentation about SWCL2004 and the first national computational lingustical seminar. After some arrangment, my task was to introduce something about the papers about SWCL2004.

My original plan was to introduce the papers under the time order of the conference. But when I was preparing the ppt, I felt it was blankness. As Dr.Tliu's request I had classified the papers into the six research areas, machine translation and others. I could introduce them by thees classes.

Jusy now I had finished it. After some simplify I thought it was good.

2004年9月4日

One more research topic

During the discussion of our ACE results, Dr.Tliu told that the entity relation was one more research task of Mr. Wangxiang Che. And Mr Wangxiang Che had another research topic. So this research topic would be transfered to another person. Xiantao Liao and me were all in the Information Extraction Research Group. My research topic was the coreference or anophra relation. It was one kind of entity relation in essence. And Xiantao Liao was concerned with the entity recognition. It was confirming the boundary of the words. So my topic was more related to the eneity relation. The methos of my topic and entity relation extraction were close to each other. So the primary decision was that entity relation was one more research topic.

Mr.Wanxiang Che had researched a lot in this area. The first thing I could do was understanding his works.

2004年9月3日

Combinatorics homework

Our first Combinatorics homework included three subjects. The first was difficult. After some deduction, the problem could be changed to be the following one. Let b1,b2,...,b77,b1+22,b2+22,...,b77+22 be different with each other, and equal to 1,2,...,154,respectively. The prblem is whether b1,b2,...,b77 exist. After some discussion with Xiaopeng Hong and Huipeng Zhang, I decided to simulate the process by some program. But when I finish it, the time computational complexity is too high to calculate.

However, if we list the possible sequence, there is some conflict. So the problem is with no answer.

2004年9月2日

Brief summary of last months

I spent nearly a whole day to write the brief summary of last month. My works of last months could be divided as five parts: learning two ACL2004 papers about summarization evaluation, helping Haibin Yu to arrange the summarization tagging task, taking part in the ACE EDR and EDR coreference evaluation tasks, taking part in SWCL2004, and taking part in the first national computational lingustical seminar.

In the last month I was nery busy. But I obtained lots of good things. This month I will be busy, too. I have made good plan for it. Let me try my best!

2004年9月1日

Graduate for MA.degree

I am a graduate for MA.degree. Yes, it is true. This morning, I had the physical examination.

This afternoon, I had listened the first class of my graduate for ma. degree. It is natural language process.

As a new student for new degree, I must have some new things. Geting plenty of exercise I could be more healthy and study better. Having the habit of sleeping for an hour in the afternoon I could work better in the afternoon and the evening. Eating more vegetable I could have more health body.

There is another suggeation for me. Every morning, when you begin your works list all of them in a paper, and order them by the importance, and then do them one by one as the final sequence. If you think you could not finish this scheme, you could do nothing. Do it, you will fing its effective.