Difference btw "-g" and "--g-post"?

Asked by Jim White

Hi Linas!

Thank you for the quick action on the bugs I've encountered.

You point out that "-g" (pre-tagging with GATE) and "--g-post" (post-tagging with GATE) don't make sense when used together. Could you explain very briefly what the difference is and/or why they don't go together?

The reason I thought I could/should use both is because that I saw something about pre-tagging helping the Link Grammar Parser by collapsing multi-word named entities that it didn't handle very well. And I thought that post-tagging was needed to get the entity attributes applied to the tree so that RelEx's rules could use them.

Thanks again!

Jim

Question information

Language:
English Edit question
Status:
Solved
For:
RelEx Edit question
Assignee:
No assignee Edit question
Solved by:
linas
Solved:
Last query:
Last reply:
Revision history for this message
Best linas (linasvepstas) said :
#1

2009/12/23 Jim White <email address hidden>:
> New question #95017 on RelEx:
> https://answers.launchpad.net/relex/+question/95017
>
> Hi Linas!
>
> Thank you for the quick action on the bugs I've encountered.
>
> You point out that "-g" (pre-tagging with GATE) and "--g-post" (post-tagging with GATE) don't make sense when used together.  Could you explain very briefly what the difference is and/or why they don't go together?

in pre-tagging, the entities are found, replace by strings such as
ENTITY01, and then parsed, and then the original text is substituted
back in after parsing.

In post-parse tagging, the entities are searched for only after the
parse. Relex doesn't does nothing with this info, other than to
tag the entity with things like "money-ID" or "date-ID", etc. This
code was recently added for a client who wanted this.

> The reason I thought I could/should use both is because that I saw something about pre-tagging helping the Link Grammar Parser by collapsing multi-word named entities that it didn't handle very well.

where did it say this? I need to change this, and get rid of that statement.
My general impression, based on some seat-of-the-pants experience,
is that link-grammar is a much better entity detector than GATE. Basically,
don't use GATE unless you really need the entity tags. And if you really
need entity tags, we should probably modify relex to get these tags
directly from the parse, as often as possible.

> And I thought that post-tagging was needed to get the entity attributes applied to the tree so that RelEx's rules could use them.

Nah, really not needed by relex itself.

--linas

Revision history for this message
Jim White (james-paul-white) said :
#2

Thanks linas, that solved my question.

Revision history for this message
Jim White (james-paul-white) said :
#3

The commentary on why GATE's ANNIE was integrated with RelEx was given in:

Ryan Richardson et al., 2006. Automatic Creation and Translation of Concept Maps for Computer Science-Related Theses and Dissertations. In Proc. of the Second Int. Conference on Concept Mapping. San José , Costa Rica. Available at: http://cmc.ihmc.us/cmc2006Papers/cmc2006-p160.pdf .

I've tested the cases they cite and indeed using GATE improves performance over LGP alone.

1: Donald Macloud Jr. is going to Belo Horizonte to assume the post of dean at Abraxas University.
(S (NP personID1) (VP is (VP going (PP to (NP locationID2)) (S (VP to (VP assume (NP (NP the post) (PP of (NP dean))) (PP at (NP organizationID3))))))) .)

(S (NP Prime Minister Gerhard Schroder) (VP announced (NP (NP the formation) (PP of (NP (NP a committee) (VP led (PP by (NP Dr. Floyd Green)) (S (VP to (VP investigate (NP (NP Cisco Systems) [Inc and (NP other multinationals)))))))))) .)

2: Prime Minister Gerhard Schroder announced the formation of a committee led by Sir Floyd Green, Ph.D. that will investigate Cisco Systems Inc. and other multinationals.
(S (NP personID1) (VP announced (NP (NP the formation) (PP of (NP (NP a committee) (VP led (PP by (NP (NP personID2) , (NP (NP Ph.D) [ (SBAR (WHNP that) (S (VP will (VP investigate (NP (NP organizationID3) and (NP other multinationals)))))) .)))))))))

(S (NP prime Minister Gerhard Schroder) (VP announced (SBAR (S (NP (NP the formation) (PP of (NP a committee))) (VP led (PRT by) (NP (NP Sir Floyd (ADJP Green ,) Ph.D) [ (SBAR (WHNP that) (S (VP will (VP investigate (NP (NP Cisco Systems) [Inc and (NP other multinationals))))))))))) [)

Surprisingly ANNIE doesn't recognize a person's degrees, which is certainly an important item and would help performance further.

Revision history for this message
Jim White (james-paul-white) said :
#4

The non-GATE parse for #1 was supposed to be:

(S (NP Donald Macloud) [Jr (VP is (VP going (PP to (NP Belo Horizonte)) (S (VP to (VP assume (NP (NP the post) (PP of (NP dean))) (PP at (NP Abraxas University))))))) .)

Revision history for this message
linas (linasvepstas) said :
#5

Hey,

2009/12/23 Jim White <email address hidden>:

> The commentary on why GATE's ANNIE was integrated with RelEx was given
> in:
>
> Ryan Richardson et al., 2006. Automatic Creation and Translation of
> Concept Maps for Computer Science-Related Theses and Dissertations.

The key sentences open the 4th paragraph:

 "The link parser often breaks when presented with unknown, multi-word
entities, such as “Prime Minister Gerhard Schroder” or “Cisco Systems
Inc.”. Instead of adapting the link parser to better handle such
phenomena, ..."

I prefer to fix the parser when its broken.

Inc. as a suffix was fixed more than a year ago, although I added
Ph.D. and a bunch of others just now.

> I've tested the cases they cite and indeed using GATE improves
> performance over LGP alone.

I've had the opposite experience .. I don't remember the example
any more, but it was something similar to "Northern California
Mountains" which was converted to "Northern LOCATIONID1"
or "LOCATIONID1 Mountains", or something like that, which
then failed to parse.

> Surprisingly ANNIE doesn't recognize a person's degrees, which is
> certainly an important item and would help performance further.

Again -- I have explicit control over link-grammar, but do not over
ANNIE, and can thus fix bad parses.

My current approach is two-fold:
1) manual fixes
2) automatic discovery of syntactic classes.

Unfortunately, 1) is tedious -- e.g. I should enter the contents of
 http://en.wikipedia.org/wiki/List_of_post-nominal_letters
but am too lazy to.

2) is long-term more robust, but remains very difficult -- its
a research topic. The result can be spotty -- ANNIE's shortfalls
are purely based on the limited corpus that was used to train it ..
(I think ANNIE is statistically trained, right?) and a similar problem
would hold for any other trained system. To really overcome this
limit, one would have to reason about suffixes, which is even
harder to do.

--linas

Revision history for this message
Jim White (james-paul-white) said :
#6

Longer term I'm certainly interested in more general processing solutions and especially automated learning too, but not surprisingly I have a purely short term requirement at the moment.

I've also just made the changes to support more suffixes (with and without commas and allowing multiple suffixes) to ANNIE and found it very easy. But I've not even looked at the source for LGP, so I have no idea which is easier. ANNIE is not statistical but instead uses a gazetteer (plain text files) and rules (in JAPE). The relevant sources are in $GATE_HOME/plugins/ANNIE/resources and do not require recompilation for changes to take effect since they are loaded when the plugin is run.

Although named entities are not the most important thing in my current project, the thing I like about ANNIE is it would be easy to extract Wikipedia entities (Wikipedia articles being our training corpus) and dump them into the gazetteer.

Frames are more important to my current project and I will probably be looking to add more of FrameNet's definitions to RelEx (RelEx has 290 frames compared to about 800 for FN), along with better verb coverage using some existing results for FrameNet/WordNet mapping. I'm also thinking RelEx's concept_vars needs expansion using WordNet or other such source.

Jim

Revision history for this message
linas (linasvepstas) said :
#7

2009/12/25 Jim White <email address hidden>:

> I've also just made the changes to support more suffixes (with and
> without commas and allowing multiple suffixes) to ANNIE and found it
> very easy.  But I've not even looked at the source for LGP,

Its also very easy. Look at 4.0.dict, and search for Ave. St.

> so I have no
> idea which is easier.  ANNIE is not statistical but instead uses a
> gazetteer (plain text files) and rules (in JAPE).

Could you provide a simple JAPE rule as an example, e.g.
for handling suffixes?

> Frames are more important to my current project

So what is your project?

--linas

Revision history for this message
Jim White (james-paul-white) said :
#8

In order to allow commas in suffixes, in name.jape I changed this:

Macro: PERSONENDING
(
 {Lookup.majorType == person_ending}
)

to this:

Macro: PERSONENDING
(
 ({Token.string == ","})? {Lookup.majorType == person_ending}
)

I didn't check the manual and just guessed at the syntax, but it seems to work fine.

And to allow multiple suffixes, in the places where that macro was used, I changed one line in this:

Rule: PersonTitle
Priority: 35
// Mr. Jones
// Mr Fred Jones
// note we only allow one first and surname,
// but we can add more in a final phase if we find adjacent unknowns

(
 {Token.category == DT}|
 {Token.category == PRP}|
 {Token.category == RB}
)?
(
 (TITLE)+
 ((FIRSTNAME | FIRSTNAMEAMBIG | INITIALS2)
 )?
  (PREFIX)*
  (UPPER)
 (PERSONENDING)?
)
:person -->

to this:

 (PERSONENDING)*

The "J" in JAPE is Java of course. What follows the arrow is Java code that fires when the rule is applied, which I omit for brevity. Java is hardly the nicest language for such actions. I use a lot of Groovy because of that. Even Scala might be better.

Some stuff on GATE/ANNIE/JAPE:

http://videolectures.net/gate06_tablan_gaclj/
http://textanalytics.wikidot.com/gate
http://semweb.weblog.ub.rug.nl/node/150

I'm an undergraduate at UC Irvine working on Eric Baumer's Computational Metaphor Identification (CMI) project. The version that he developed for his doctorate uses Stanford Typed Dependencies as the relations to match, and I am replacing that with a parser that does Semantic Role Labeling in order to generalize the syntactic forms that can be identified as employing the same metaphor.

http://www.calit2.uci.edu/calit2-newsroom/itemdetail.aspx?cguid=3d48fa4c-043e-4fe3-b9fe-5b58a44b5b90
http://ericbaumer.com/projects/computational-metaphor-identif.html

Jim