Bill_Lang: 2004-05

2004年5月31日

Update myself

After the English class and after the supper, I fell some gloomy. And after wrote some heart monologue I dicided to go for a walk.

When I passed by the baskatball playground, I was infected by the atmosphere of sport. I came into the playground in spite of myself. I began to play with others. When I played the basketball I thought I found back the progressional motivity. That's sport.

Yes. I had not do some drastic sports for nearly a whole month. Tomorrow, I could do some morning exercise and some morning English reading. Yes. I could call together some classmates of my English class to play basketball or foot ball.

Change to another life. Yes, it must be with sports. Forget any gloomy thing. Try my best to do everything of my task.

Let me begin!!

2004年5月30日

The Buddha Halo

Next week, our English class topic is Natural Wonder. I first think about the Buddha's Halo of Mt.Emei which is my home town.

Sixteen hundred years ago, an Indian monk came to Cinisthana, as China was called by the Indians in those days. He climbed to the top of Emei Mountain and was fascinated by the beautiful scenery. “This is the number one mountain in Cinisthana,” he said.
Emei Mountain rises like a green tower on the western Chengdu Plain. Viewed from a distance, the contour of the mountain looks like a girl’s face with slender eyebrows; hence the name Emei, or tall eyebrows. Emei Mountain rises and falls for more than 200 kilometers before it meets Qionglai Mountain, a part of Asia’s Backbone, or the Kunlun Mountain Range. Emei Mountain consists of Da’e, Er’e, San’e, and Si’e hills. Da’e Hill is a concentration of strangely shaped peaks and places of scenic beauty and historic interest. It is the hill most visited by tourists on Emei Mountain.
Of all the tourist attractions in China, Emei Mountain is the highest. Wanfoding (the Summit of Ten Thousand Buddhas), its highest peak, rises 3,099 meters above sea level, much higher than the Five Sacred Mountains: Mount Taishan in Shandong, Mount Hengshan in Hunan, Mount Huashan in Shaanxi, Mount Hengshan in Shanxi, and Mount Songshan in Henan. Legend has it that the Five Sacred Mountains are where the immortals stay.
The craggy southern side of Emei Mountain is crisscrossed by ravines and covered with a dense growth of plants. The northern side features sheer precipices and waterfalls cascading down the mountain slopes.
The mountain is warm and humid with abundant mist and rain. In spring and summer, flowers blossom luxuriantly among a verdant growth of mountain plants. Refined scholars of the past dynasties visited the mountain and wrote many poems in admiration of the enchanting scenery. One of the poems composed by a man of letters during the Ming Dynasty (1368-1644) reads, “Rising sky high, the lofty Emei Mountain is enveloped in mist and clouds for more than 100 li (50km). Narrow paths zigzag uphill, and the exotic peaks are in the shape of lotus blossoms.”
Jindingxiangguang (the Auspicious Light at the Golden Summit), also called Foguang (Buddha’s Halo), tops the list of the ten principal scenic attractions of Emei Mountain. Buddhist followers say it is the light from Buddha’s forehead, but others say it is a physical phenomenon. Before sunset after a rain or a snowfall, the sunlight penetrates the mist and clouds and forms a circle of seven colors by refraction through the tiny water drops in the mist. One may feel as though caught in the circle, which seems to move in synchronization with one’s own movements, much like one’s shadow. For centuries, this phenomenon was enshrouded in mystery, and Buddhists consider it good fortune to visit Emei Mountain and see Buddha’s Halo.

峨眉佛光 Buddha's halo on Mount Emei

2004年5月29日

Next plan for CR

When I finished the work of extenting the test samples scale, I found out that the accuracy of the test result was nearly the same as the former.

And after some consideration of the next plan for CR, I decided to write a program which could identify all the coreference relation of the text.

There must have some sub modules, such as some attributes inentifiers and the design tree application module. But how to display the relation. I fell I could use the arc module of the dependency analysis module.

Ok. Why not try it now.

2004年5月28日

Coreference Resolution Results Analysis

Just now I had tagged about one hundred and twenty-five examples. And I divided them into two parts: one hundred for training and close test, twenty-five for open test. The experimental result displayed that the accuracy of the close test had direct ratio with that of the open test. The best wxperimental result was as follows:

Decision tree:

IJ全匹配 = T: T (12.2)
IJ全匹配 = F:
:...J是I的抽取 = T: T (4)
J是I的抽取 = F:
:...J是I的子串 = T: T (5.4)
J是I的子串 = F:
:...J的类型 in {H,P,O,T,G}: F (35.1/5.3)
J的类型 = D:
:...I的被修饰数量类型 = T: T (5.7)
I的被修饰数量类型 = F:
:...I的单复数 in {S,P}: T (22.5/2.9)
I的单复数 = U: F (15.2/3.9)

Evaluation on training data (100 cases):

Trial Decision Tree
----- ----------------
Size Errors
5 7 4( 4.0%)

Evaluation on test data (25 cases):

Trial Decision Tree
----- ----------------
Size Errors
5 7 1( 4.0%)

This result was so winderful for me too believe. After my analysis, I found out some reasons. First the close test with very high accuracy was too small. Second, the test samples were not enough.

But during some discuss with Mrs.Qin, she thought my method was right. I only should add twenty-five test amples and then get some new test result.

2004年5月27日

Extension Theory

This morning, Dr.Tliu sent us some mails about Extension Theory. It was said that the creator of this theory would give a speech in our school this afternoon.

The background introduction of the creator, Cai Jing, and Extension Theory was very ample. It was said that Extension was one of the three national theories that were created by Chinese. And I spent soome time to find out what were the other two. And in Soft Sciense Handbook, the other two were Pansystem Methodology and Grey Theory. The creators were admired by all of us.

And on the lecture Mr.Cai Jing gave us the outline of Extension Theory. This theory was very active. It had been extent a lot.

There were not any example of the extension theory applied in NLP and IR. So when I had free time, I would learn it and try to use it with Grey Theory in NLP and IR.

So lucky to learn it.

2004年5月26日

Visiting Arboretum

There was some high need of our English class to have a collective activity. And after some harmony, we only had 27th, May, scilicet today.

This morning, we, about twenty of all classmates, gatherd in front of Dianji Building. By the eighty-one line bus, we came to the arboretum.

Today was the birthday of Sakyamuni. It was the most important festival for each buddhist. And our teacher was a buddhist. She brought about sixty birds and all of us help her to free captive animals.

After the freeing captive animals ceremony, we began to visit the arboretum. We were so delight that we played with water, played card, kicked shuttlecock, skipped rope, cycled and had our lunch.

We all fell delight. When we came back, all of our classmates had a dine together.

I thought when we were playing, we should enjoy ourself; when we were working, we should try our best. Tomorrow, I must finish my ppt task.

2004年5月25日

Design tree for CR

Right now there were ninety-one samples of my NP tagged corpus. And based on them I had obtain the design tree.

Decision tree:

J是I的抽取 = T: T (4.2)
J是I的抽取 = F:
:...J是I的子串 = T: T (5.4)
J是I的子串 = F:
:...IJ全匹配 = T: T (8.6)
IJ全匹配 = F:
:...I的类型 in {P,T,D}: F (6.6)
I的类型 in {H,O,G}:
:...J的类型 in {H,P,O,T,G}: F (29.6/6.1)
J的类型 = D:
:...I的类型 = H: T (8.1)
I的类型 in {O,G}:
:...I的性别 = M: F (0)
I的性别 = F: T (2.2)
I的性别 = U:
:...J的性别 in {M,F}: F (11.9/2.1)
J的性别 = U: T (14.4/2.6)

I thought the tree was beautiful. But there was another that the number of the samples were fewer. So tomorrow, I must tag more to finish my recent work.

2004年5月24日

The CR tagging work

In order to finish my CR task, I had studied a lot and learned much. And the current bottle neck sub tack of my CR task was the CR relation tagging.

Based on the Noun Phrase tagged corpus that resulted from my program, I began to tagging the corpus. Followed by some papers' examples, I chose some attributes to list a table. There were about sixteen attributes. However, when I began to tag some articles, I fell the tagging task not easy.

I should tag them as soon as possible. But at the same time I had lots of other tasks. So I could do them with each crossing with others.

I had tagged about twenty-seven examples.

Let me keep on!!

2004年5月23日

Love knot with the graduating

An old classmate, who had drunk lot of beer, told me a lot about the love knot with the graduating.

Yes, I thought so. When the graduating date was nearar to us, we all had some more special feeling about it.

So complicated to express myself. I should had another quiet time to express it in my blog.

2004年5月22日

Doctoral Machine Learning Lesson

This morning, the doctoral machine learning lesson was held in A Building Room 206. This was the fourth time of this lesson. And I had "stolen" three lessons.

This class was very good. When I first came to this class, the teacher, who was a famous expert, gave us a clearly outline of this area, including international and national situation and famous persons.

This lesson, the teacher gave us some introduction about the Simulated Annealing algorithm and Genetic Algorithm. I found out that doctoral class was very different to undergranduate and postgraduate classes. The teacher would not describe clearly how to use the methods. However, they would tell you what, why, and how about the method. The main content of the class was to introduce the main idea about the methods and teach the students apply these ideas.

So good lessons. But this wwas the last time of this class.

2004年5月21日

针对指代消解的名词短语识别

上午将昨日的方案和金山师兄一起进行了可行性分析。师兄最后认同我的方案。于是从早上开始我一直在完成我的这个程序。刚才（约莫十点左右），我的程序终于实现了昨晚定下来的三个步骤。当然其间也遇到了一些问题，这里记录一些它们的解决方案。

完成第一项任务时的问题和解决方案：

在观察北大语料时，发觉 "]"可能有两种，一种是ns,nz,nt等的标注信息，一种是全角的标点符号。这样区分好后就可以很方便的进行处理了。昨天我一直以为两种符号是一样的。今天已统计频度才发现这个特点。

完成第二个任务时，需要处理文件末尾的边界问题。这种末尾边界不能在读入文件结束时处理，需要额外的处理内存中的信息。

完成第三个任务时，也需要处理文件边界的问题，需要仔细设计程序流程，然后实现之。

此刻我的程序正在将北大57的语料中的全部针对指代消解的名词短语进行识别。很是兴奋亚。明日就可以完成我的下一阶段的名词短语特征向量的构建了。

let me begin!!

2004年5月20日

基本名词短语识别问题

根据上周实验室例会上金山师兄对最长名词短语的讲解，我大致了解到如下内容。

名词短语（Noun Phrase，NP）是指以名词或在功能上相当于名词的词为核心词的短语，其长度大于1。名词短语具有嵌套性，如“[一个于[[半个世纪]之后]重新聚集在[[“[西南联大]”旗帜]下]的奉献活动]开始了！”按照名词短语的结构进行划分，有基本名词短语、一般名词短语、最长名词短语三种类型。

基本名词短语举例如：一个于 [半个世纪 ]之后重新聚集在 “[西南联大]” 旗帜下的 [奉献活动] 开始了！

一般名词短语举例如：一个于 [半个世纪之后 ]重新聚集在 “[西南联大” 旗帜下的奉献活动] 开始了！

最长名词短语举例如：[一个于半个世纪之后重新聚集在 “西南联大” 旗帜下的奉献活动] 开始了！

在这些名词短语的不同类别中，我们的指代消解需要哪一种类型呢？我感觉基本名词短语就可以了。因为就拿上面的最长名词短语的例子来说吧，如此长的一个名词短语无非是在“活动”的前面加上了许多的修饰成分。这些修饰成分的作用也就是要在读者心中留下“活动”的一些描述而已，这些描述的作用就是要强化“活动”在读者心目中的印象。这样一来，当读者在下文中读到有关“活动”的指代语时会想到“活动”的场景，而仔细说来，无非也是完成对“活动”的指代。最长名词短语比起基本名词短语多出的部分是指代消解不需要的。因而可以说，指代消解需要解决的关于名词短语的类属问题就可以直接定义在基本名词短语阶段。

那么，究竟什么是基本名词短语呢？上海交通大学的钱伟、郭以昆、黄萱菁、吴立德等人撰写的《基于最大熵方法的中英文基本名词短语识别》（以下简称《短语识别》）中提到基本名词短语是指非嵌套的名词短语，包括单个名词、没有任何修饰成分的名词短语、难以确定修饰关系的一串名词、并列名词性成分、专有名词、时间、地点等，占语料中所有基本短语的60.8%（用Chinese Treebank作统计）。

那么我们需要采用的名词短语的识别方法究竟应该怎样设计呢？这个问题解决的好坏直接关系到了下一步的工作。我预计的方法是这样的。借用《短语识别》中的定义，我们可以在北大人明日报标注语料的基础上进行基本名词短语的识别，我们需要的基本名词短语包括：

一般独立名词（周围的词语都不能和其连成名词短语）、一串名词构成的短语（争取局部名词范围的最大化）、并列名词性短语（连接词为“和”、“与”或“、”）（仅要这两个就足以）、人名、地名、机构名、时间词。

作为指代消解的指代对候选集的构建，我们先要识别出全部的基本名词短语和代词。大致思路如下：

第一步：识别出全部独立名词短语（词性与名词有关的都算）、代词短语和时间词短语（当然这是在北大语料中的机构名nt的基础上），包含：

--------一般名词组-------------------------------
an 名形词具有名词功能的形容词
Ng 名语素名词性语素
n 名词
nx 英语等其他外语的字母或字符串
nz 其他专名
vn 名动词指具有名词功能的动词
----------------------------------------------

----人名组----------------------------------
nr 人名
---------------------------------------------

----地名组---------------------------------
ns 地名
--------------------------------------------

----机构名组------------------------------
nt 机构团体
---------------------------------------------

----代词组----------------------------------
r 代词
---------------------------------------------

----时间词组-------------------------------
Tg 时语素时间词性语素
t 时间词
--------------------------------------------

第二步：同组短语局部最大化，亦即如果两个相邻的标出的短语的组别相同，那么就合并。

第三步：并列名词性短语合并。出现在“和”、“与”或“、”两侧的组别相同，则将两侧组别连带“和”、“与”或“、”两侧一起合并为一个组别。（这里排除代词组的情况）

经过以上三步，我设想的指代消解的瓶颈问题应该可以得到解决。

算法已经确定，亟待实现。

2004年5月19日

指代消解的进一步理解

经过这一段时间的学习、思考和忙碌，我思考了一些，随便在此写写：

本质上指代消解是一个判别分类问题。因为只要找出了文章中的所有名词、代词、基本名词短语，那么任何两个对象都可能构成共指链，每一对对象的判定范围就是是或否的二值判定。当然如果A与B有共指关系，同时B与C也有共指关系，那么自然A与C也就有了共指关系，那么就自然可以将A、B、C看成一个共指类，该共指类中所有元素都指向同一个实体。这样一来，共指消解也就成了转换为了一个聚类问题。关于这种共指消解的解释已经有人给出了详细的说明（参见： C Cardie, K Wagstaff, Noun Phrase Coreference as Clustering . In proc of the Joint Conf on Empirical Methods in NLP and Very Large Corpora. Maryland, USA, 1999. 82~89）

那么在这种聚类的观点下，我们能够采用什么方法来更加有效的完成我们的工作呢？很显然，我们会马上想到各种各样的聚类算法。那么让我们考虑一下都有哪些聚类算法可以作为候选应用算法呢？翻了一下，主要有：

聚类算法主要分为两大类：层次聚类和非层次聚类。层次聚类又包含单连接和全连接聚类以及组平均聚类；非层次聚类包含K-均值和EM算法。

类的相似度度量主要三种方法: 单连接(两个最近成员的相似度)、全连接(两个最远成员的相似度)、组平均(类成员的平均相似度)。

非层次聚类的一般过程是：随机选择种子，然后进行样本划分、通过迭代将样本进行重新分配直到模型参数估计不再上升或呈下降趋势。

其实反过来想想，共指消解其实也是一个分类问题，就是判别任意一组候选共指对象的类别是共指对呢还是不是共指对，那么这样一来共指消解就又变成了一个分类问题。

常见的分类方法有：决策树、贝叶斯、最大熵模型、K近邻、神经网络等。

现在看来能够在指代消解上应用的方法有很多。

但是现在却存在一个严重限制这些算法使用的因素。那就是现在没有一个值得依赖的基本名词短语识别器。基本名词短语的识别是构成共指候选对的第一先决条件。为此，我这几天和金山师兄讨论了几次，答案是现在他正在进行的最长名词短语效果不是很理想，而现在实验室也没有一个可以使用的基本名词短语分析器。

为了解决这个瓶颈问题，今天我查看了几篇论文，其中查到的中文基本名词短语识别方面的文章只有上海交通大学的黄萱菁、吴立德等人撰写的《基于最大熵方法的中英文基本名词短语识别》。到目前为止还没有发现有BaseNP的公开识别器。金山师兄说他现在也有想要完成这个任务的打算，但是还没有想到好方法。

这个瓶颈是一个很严重的问题，至少我现在是这么认为的。因为别人已经证实了BaseNP的识别准确率直接关系着共指消解正确率的提高。

路将何从？明日定夺。

Let me begin!!

温伯格：给科学家的四条黄金忠告

原文如下：

当我得到大学学位的时候，那是百八十年前的事了。物理文献在我眼里就象一个未经探索的汪洋大海，我必须在勘测了它的每一个部分之后才能开始自己的研究。做任何事情之前怎么能不先了解所有已经做过了的工作呢？万幸的是，在我做研究生的第一年，我碰到了一些资深的物理学家，他们不顾我忧心忡忡的反对，坚持我应该开始进行研究，而在研究的过程中学习所需的东西。这可是生死悠关的事。我惊讶地发现他们的意见是可行的。我设法很快就拿到了一个博士学位。虽然我拿到博士学位时对物理学还几乎是一无所知。不过，我的确得到了一个很大的教益：没有人了解所有的知识，你也不必。
　　
　　另一个忠告就是，如果继续用我的海洋学的比喻的话，当你在大海中搏击而不是沉没时，应该到波涛汹涌的地方去。19世纪60年代末，我在麻省理工大学教书时，一个学生找我说，他想去做广义相对论领域的研究，而不愿意做我所在的领域——“基本粒子物理学”方向的研究，原因是前者的原理已经很清楚，而后者在他看来则是一团乱麻。而在我看来这正是做相反决定的绝好理由。粒子物理学是一个还可以做创造性工作的领域。它在那个时候的确是乱麻一团，但是，从那时起，许多理论物理学家、实验物理学家的工作把这团乱麻梳理出来，将所有的（嗯，几乎所有的）知识纳入一个叫做标准模型的美丽的理论之中。我的忠告是：到混乱的地方去，那里才是行动所在的地方。
　　
　　我的第三个忠告可能是最难被接受的。这就是要原谅自己虚掷时光。要求学生们解决的问题都是教授们知道可以得到解决的问题（除非教授非常地残酷）。而且，这些问题在科学上是否重要是无关紧要的，必须解决他们以通过考试。但是在现实生活中，知道哪些问题重要是非常困难的，而且在历史某一特定时刻你根本无从知道某个问题是否有解。二十世纪初，几个重要的物理学家，包括 Lorentz 和 Abraham, 想创立一种电子理论。部分原因是为了理解为什么探测地球相对以太运动的所有尝试都失败了。我们现在知道，他们研究的问题不对。在当时，没有人能够创立一个成功的电子理论，因为量子力学尚未发现。需要到1905年，天才的爱因斯坦认识到正确的问题是运动在时间空间测量上的效应。沿着这条路线，他创立了相对论。因为你总也不能肯定哪个才是要研究的正确问题，你在实验室里，在书桌前的大部分时间是会虚掷的。如果你想要有创制性，你就必须习惯于大量时间不是创造性的，习惯于在科学知识的海洋上停滞不前。
　　
　　最后，学一点科学史，起码你所研究的学科的历史。至少学习科学史可能在你自己的科学研究中有点用。比如，科学家会不时因相信从培根到库恩、玻普这些哲学家所提出的过分简化的科学模型而受到桎梏。科学史的知识是科学哲学的最好解毒剂。
　　　　　　
　　更重要的是，科学史的知识可以使你觉得自己的工作更有意义。作为一个科学家，你很可能不会太富裕，你的朋友和亲人可能也不理解你正在做的事情。而如果你研究的是象基本粒子物理学这样的领域，你甚至没有是在从事一种马上就有用的工作所带来的满足。但是，认识到你进行的科学工作是历史的一部分则可以给你带来极大的满足。
　　　　　　
　　看看100年前，1903年。谁是1903年大英帝国的首相、谁是1903年美利坚合众国的总统在现在看来有多重要呢？真正凸现出重要性的是1903年Ernest Rutherford 和Frederick Soddy 在McGill 大学揭示了放射性的本质。这一工作（当然！）有实际的应用，但更加重要的是其文化含义。对放射性的理解使物理学家能够解释为什么几百万年以后太阳和地心仍是滚烫的。这样，就清除了许多地质学家和古生物学家认为地球和太阳存在了很长年代的最后一个科学上的障碍。从此以后，基督教徒和犹太教徒就不得不或者放弃圣经的直接真理性或者放弃理性。这只是从加利略到牛顿、达尔文，直到现在削弱宗教教条主义桎梏的一系列步伐中的一步。只要读读今天的任何一张报纸，你都会知道这一工作还没有完成。但是，这是一个文明化的工作，对这一工作科学家是可以感到骄傲的。

看完这篇文章时忍不住想要把它与大家一起分享。里面的四条忠告我自己理解了一下：
1。从事一个领域的科学研究时未必要完全了解该领域的已经存在的一切才能开始对它的研究，可以在研究的过程中学习需要的东西。我想这天忠告对于现在的初入NLP的我们而言也是有用的。的确，我们不必完全知道NLP的现存的一切，而应该遇到问题时不断学习，不断补充。试想如果没有经过我们亲身经历这些研究过程的话，又怎么可能理解前人成果的来之不易呢。当我们在想方设法解决我们所面临的问题时，通过学习已有的成果来发现发掘解决问题的新方法。这一点而言，我感觉人们采用方法更像机器学习更高级一些的方法--学习前人解决问题的方法，从中领悟解决问题之道，然后再将自己已知的东西化学作用一下来想出新的方法。所以我感觉，知识面的宽广是非常重要的。

2。我们研究现存NLP中的问题的意义就在于要解决那些尚未或者正在被人们解决的问题，这些点上的“钻探”才会出新的东西。当然，这样也会出现对问题难以适应的情况。随着我们的经验和知识越来越丰富，我们就可以解决凿出一些火花来。或许这些火花是非常零星的，但是经过大家的努力，在我们面前的问题也会得到一些解决。

3。原文作者的第三个忠告我完全赞同。确实需要原谅自己虚掷时光。我自己曾经体会过创新对个人的含义。创新=坚实的理论基础+勤奋刻苦的工作+活跃的思维火花。毕竟我们的思维火花是有限和短暂的，是一闪即逝的，抓住了也就有了机遇。但是更多的是没有思维反馈的。所以我们的大多数时间都在“虚掷”时光。需要原谅自己。当然这种“虚掷”也是为“不虚掷”做准备的。这就要求我们从事科学研究时的心态一定要端正，不可急于求成，那样可能会适得其反。

4。学一点科学史。这一点很容易被忽略。毕竟它不是我们直接的研究内容。但是了解这些历史可以使我们知道这个方向的来龙去脉，从而更好的理解现在存在的问题，以及该学科可能的发展方向。毕竟前人为我们做出了巨大的贡献，我们可以在他们的科学精神的指引下完成我们的工作。

2004年5月18日

The Choice of the CR method

Yes, I think it is a delimma to me. I wanted to finish the CR system by end
of this month. But there were lots of difficults be front to me.

For example, the design tree's method had been improved a lot. However I had not been familiar with them. I must thank Dr.Tliu. As when I read my notebook I found the hints which he wrote about CR. The memory based method was a good idea. At the same time, I must thank Mrs.Qin. She told me in order to finish my recent research, I could prompt my idea, and hand label some little scale corpus to test my idea. I only should finish a research now, not an applying system.

When I felt without new good idea, I read the electronic papers about CR. Suddenly I found a good idea in a paper. Its main idea was constructing lots of feature vectors with the antecnet and the anaphora, and after lots of positive examples and negative examples training, we can use some cluster algorithm to identify the CR relations.

I found I could find some ideas about the similar degree algorithm and then change that main idea. I could found idea from Grey System Theory.

Let me try!!
I found

2004年5月17日

The third edition of IE Survey report

This afternoon, Mrs.Qin told me that there were some little points of the second edition of IE Survey report. And Yu Haibin had finished his updating task. So I should finish the second improvement.

Originally, I had the improvement plan before some days. As these days I was busying in the CR System. I found out the three papers that I had chosen to consult. I thought they were better than our report at some respect, respectively. But our report was not only the survey report, but also our research scheme. So I integrated the good respects of the three papers into our report.

This was my today's works. And just now I had sent it to Mrs.Qin and the IE team.

And I found out a better guidline for me to finish my two main tasks in recent days. It was that I could my days for IE and CR crossing each other.

Let me begin!!

2004年5月16日

CR system's process

Firstly, I wanted to tag the CR relation in some symbol. But I found out that was not clearly for reading the CR tagging result. So I plan to use HTML highlight to display them.

But as there were some problems that I should learn how to write the html language.

Just now I had complete them. Update it tomorrow.

Let me begin!!

2004年5月15日

Connect each modules

Today's task for me is to connect the Sentences Segment module, Word Segment module, POS tagghing module and NER module. And there were so many little or big problems in this project.

And just now I have complete it. And the Sentences segment module was written in Class by Lee. The Word segment DLL was updated by Victor. The Pos tagging DLL module was updated by Truman. The NER Class module was written by Taozi.

Through these two times connecting several C++ moduls I found that my C++ programming ability had been improved. Glad news for me ^-^

Tomorrow I will analyze the detail CR tagging plan.

Let me begin!!

2004年5月14日

Why did I tag the CR corpus??

Why did I tag the CR corpus?? This was a question in my mind when I had tagged about 20 CR antecedent and anaphora. Whether did I want to use them for trainling corpus or testing corpus? I thought I had not made it clearly.

I thought I was confused by the current task. There was a thesis task for me to use design tree for the CR problem. But I found out I was impatient on this point. The CR result could be judged by humam when the testing samples was not ample.

There were some baseline method for CR problem. For example, the current anaphora could recognize the antecedent closest noun phrase as its antecedent. Right now, I had not any baseline system. So I could say nothing of the design tree-based CR system.

I must constitute a detailed fully experimental plan for my CR system. As the proverb says: A beard well lathered is half shaved.

Right now, I must finish the detailed experimental plan for my CR problem. And then start it.

Let me begin!!

2004年5月13日

Modify the whole IE Survey Report

This morning, I mailed the IE Survey Report to Mrs.Qin. After her review, we had a short discuss about the frame about the report. She thought I should modify it. As it was not integrated smoothly. And I should add a system structure frame about our IE system.

So the modifying the report was my whole day's task.

I smoothed it and add the system frame structure based on my undstanding about our IR-IE SYSTEM.

And just now, I have done it and sent to Mrs.Qin. So tomorrow I can keep on my coreference resolution system.

Let me begin!!

2004年5月12日

Designing the CR Features

Although I have made sure the main algorithm and the main raw corpus for my coreference resolution, I have not designed the CR features. This is an very important step for any CR system. Because somebody had proved that based on the design tree algorithm, how to choose the features about the coreference relation is the most imoortant thing.

So I reviewed the papers related to CR and picked all the features about CR. Then I analysed them one by one. And finally I designed my features for CR.

Good for my CR system.

Let me begin!!

2004年5月11日

IE Survey Report

This week I have another task that is integrating the IE Survey Report. But during the passed days, I was all concentrate on the Coreference Resolution. So today I must begin for it.

There were some survey reports about each research sub modules and a simple frame of the IE survey result. My task was to integrate them together and obtain a beautiful IE Survey Report. Firstly, as the scatter attribute of IE's sub modules, the integrating report was not easy to write. So I wanted to make up the final report by each module's report. And it should be methodic. So it spent nearly my whole day.

2004年5月10日

Changing the strategy

Changing the strategy? Why? Just as the following reseans:

First, I added the pos tagging function to my CRS. It could run right.
Secondly, I changed the interface module of the NE to fit for my claim. Then it could run right.
Thirdly, I began to apply CRS to the fourty chosen documents. Run right, too.
But when I reviewed the final files, I was astonished. There were lots of NE errors. The NEs in some final documents were nearly all wrong. I could hard to undersatnd. So I put the result to Taozi. She could not explain the resons completely. However, she thought maybe the pos tagging module was not good enough, and her NE module depended much on the pos tagging information. Maybe it was the reason. And I could not resolve this problem now. I believe it needed some dayes. And I could not wait for it for my current task.

So I should change my strategy. After finding my IR resources library, I found out the Fujitsu-Beida tagging corpus. It was perfectly for me to complete my current task. As its tagging information were all handed by human. Good enough!!

Let me begin!!

2004年5月9日

Integrate tagging modules

My Coreference Redolution System(CRS) could process the raw document, so I must integrate these tagging modules in my CRS.

My CRS had the following work flow:

Raw document=>Sentence Segment=>Word Segment=>Part-of-Speech tagging=>Name Enitity Recognization.

I got these modules from every module's master. But I found out that except the Word Segment Module all other modules could not supply dll for me. I should integrate their source code in my system.

Just now, I had integrated their classes to be a whole one, and relized the sentence segment and word segment module. Tomorrw I must keep on.

Let me begin!!

2004年5月8日

C5.0 Algorithm

As the original plan, I was to know well C5.0 algorithm.

Last night, I had download some material about it. Among the files, there was a Demo software of C5.0. At first, I thought there must be some restraint about this demo software. After I had read through the describtion document, I found that the software was very ample about C5.0 algorithm.

In order to exercise this software, I constructed the input ffiles following the example in Machine Learning. It was surprising to me that the result was same as the book. More surprising to me was that the software could process Chinese Information.

So good for me to my work. And my first sub-task had been finished. I could start my second sub-task to define the Chinese feature choosing.

Let be begin!!

2004年5月7日

Coreference Resolution simply survey

This afternoon, Friday, our survey group had a meeting as usual. When I made my survey conclusion, I listed lots of researching methods on this topic. But just after this group meeting, I continued to study the papers about CR.

Just the more I had read, I found more about it. There were so many people like to research anaphora, i.e.,Ruslan Mitkov , who was some Anaphora Resolution International Conferences Chairman. Through his homepage, I also found a important international conference about anaphora resolution Discourse Anaphora and Anaphor Resolution Colloquium.

And I also found out that a workshop of ACL 2004 about Refernce Resolution and Its Application.

It seemed that CR and AR was so hot that there were so many researchers. And I should try my best to achieve their standard.

Let me keep on!!

2004年5月6日

步行街

大约晚上七点半，我们走在哈尔滨最繁华的街道--中央大街上。这条步行街还像两年前一样繁华，街上的行人仍就是一些我不认识的面孔。其实在他们眼里，我也是一张陌生的面孔。走着走着，忽然感到了我们的来去匆匆。

是呀！我们都是过客，对中央大街、对整个世界而言都是如此。唯一区别就是停留的时间长短不一而以。想起中学时代学过的前苏联作家奥斯特洛夫斯基的长篇小说《钢铁是怎样炼成的》的主人公保尔的名言：

人最宝贵的是生命。生命每个人只有一次。人的一生应当这样度过：回首往事，他不会因为虚度年华而悔恨，也不会因为卑鄙庸俗而羞愧；临终之际，他能够说：“我的整个生命和全部精力，都献给了世界上最壮丽的事业——为解放全人类而斗争。

初中时，不是很理解这段话的意义。可就在步行街上看着擦肩而过的行人时，我猜想我理解了一些。其实整个世界就像这条著名的街道一样，都是那么的繁美。街道又如世界一样永远存在着，但是我们却不能永存。那么不能永存的我们出现在这些场合的意义又是什么呢？难道仅仅为了感受它的美丽和繁华？当然不是！我们存在于此的意义就在于我们不存在于彼。存在即是意义。那么这仅有的存在我们又应该如何度过呢？难道我们不应该珍惜眼前拥有的一切而努力的奉献吗？是的。我们没有理由暴殄天物，我们没有理由浪费金钱、时间和生命。

一切的感受莫过于忘记你所失去的和曾经拥有的一切，你就是现在的你，珍惜现在的一切，做好你现在必须做好的事情。能做到的尽量做好，能做好的尽量做得至臻完美。

2004年5月5日

Design Tree based Coreference Resolution

This was the first paper about Design Tree based Coreference Resolution.

Title: Using Decision Tree for Coreference Resolution
Author(s): Joseph F.McCarthy and Wendy G.Lehnert
Author Affiliation: Department of Computer Science, University of Massachusetts
Citation: 26(Citeseer statistical indicates)
ConferenceTitle: the Fourteenth Internation Joint Conference on Artificial Intelligence(IJCAI '95)
Language: English
Type: Conference Paper (PA)
Treatment: Practical (P) Experimental (X)

Abstract: This paper describes RESOLVE, a system that uses decision trees to learn how to classify coreferent phrase in the domain of business joint ventures. An experiment is presented in which the performance of RESOLVE is compared to the performance of a mannually engineered set of rules for the same task. The results show that decision trees achieve higher performance than the rules in two of three evaluation metrics developed for the coreference task. In addition to achieving better performancethan the rules.RESOLVE provides a framework that facilitates the exploration of the type of knowledge that are useful for solving the corederence pronlem.

Descriptors: Information Extraction
Identifiers: coreference resolution, design tree

And the personal view about this paper was that I thought we could have a try in Chinese based on and improving this framework.

2004年5月4日

The feeling about sprint

I had come back to the feeling about sprint. That was when I finished my monthly brief summary of last month.

This morning, I found out that there was some emergent need to me to finish some paper about anaphora resolution. The final part of my brief summary was my plan this month. It was included two main parts. The later was to make a paper about anaphora resolution for the meeting.

Let me begin!!

2004年5月3日

The paper submission deadline of SWCL2004

SWCL2004 is the second national students computational linguistics meeting. And this morning I received the email from Che Wanxiang.

The paper submission deadline of SWCL2004 is 31st, May.

I reviewed a email from Dr.Tliu. There was a wonderful papers reading summary about Informatio Retrieval. The outline was perfect, I thought so.
The outline was as follows:

Title:
Author(s):
Author Affiliation:
Citation：
ConferenceTitle:
Publisher:
Publication Date:
Conference Sponsor:
Conference Date:
Language:
Type:
Abstract:
Descriptors:
Identifiers:

I wanted to study the pspers about Anaphora Resolution and then write a paper for SWCL2004.

Let me begin.

2004年5月2日

Welcome Li Bin and the table tennis

Li Bin, one of our lab's graduates, came back to spend his May Day festival. Before he came back, he and all members of our lab wanted a table tennis game.

This afternoon, nearly fifteen people came to the table tennis. Firstly, we had some practice. Later, all members of our lab played with Li Bin one bye one.

The most wonderful game was the final one: Gold vs. Li Bin. It was diamond cut diamond when the two best players met. But the final score was 2:1. Li Bin won.

Li Bin was so good. I thought so.

2004年5月1日

《大染坊》

《大染坊》讲述了陈寿亭从一个叫花子到拥有雄厚资产的印染厂主的创业历程，描写了中国民族工业在20世纪初发生、发展的艰难道路，从中展示了一代人强国梦的诞生与毁灭。

⊿24集电视连续剧《大染坊》剧情介绍

订阅：博文 (Atom)