Skip to content

Latest commit

 

History

History
1052 lines (1044 loc) · 53.7 KB

scrumph.org

File metadata and controls

1052 lines (1044 loc) · 53.7 KB

Scrumph : Internal Scrum

2009/12/2

  • I re-arranged hypotheses. I marked up the methods intro with section numbers.
  • I will rewrite the headings of the methods section. I will re-arrange methods intro so the section numbers look less stupid. I will mail Sandra to set up a time on Friday to talk about the new methods sections.
  • Hypotheses section still sucks pretty bad. There is a lot of noise text left to be excised before giving a draft to Sandra.

2009/12/3

  • I rewrote the headings of the method sections. I re-arranged the methods intro so that at least the section 3 references are in order (though nested within section 5 references). I mailed Sandra and set up a time around 1:30. Maybe earlier; I will probably go to the reading group.
  • I will write stubs for unfinished method sections. I will rewrite the hypotheses section to remove the noise. I will clean up the whole thing for noise and send a copy to Sandra.

2009/12/4

  • I wrote stubs for the unfinished method sections. I rewrote the hypotheses sections to remove noise, along with the whole thing, and sent a copy to Sandra.
  • I will meet with Sandra to ask her about the appropriateness of the new subsections. I will expand the ones that I keep as much as I can, then do additional research to find out what to put in the others.
  • I have a ton of errands to run. Probably I should turn in my Swedish book soon and try to find one that actually has linguists in mind.

2009/12/7

  • I met with Sandra, she gave me some advice. I did research on kinds of backoff and wrote up a couple of the sections.
  • I will finish research for alternate distance measures section, write up all sections, and maybe start making them sound good.

2009/12/8

  • I finished research alternate distance measures section and wrote up all sections. None of them sound particularly good.
  • I will make all sections sound good.
  • Some of the section sstill need a little research and some citation (textbooks, mainly, though)

2009/12/9

  • I made all the sections sound good, except for the last sentence of each one. ugh. I added citations from the appropriate papers where they were missing.
  • I will double-check the last stupid-sounding sentences and re-read the whole methods section, then send to Sandra. The rest of the day I will work on converting to git (installing on peregrin if needed), resolving unicode problems in Consts.hs and investigate the lexicalisation of the Berkeley parser.

2009/12/10

  • I added more than I thought I’d have to to the proposal, then sent it off to Sandra. I switched to the more reliable way of storing non-standard diacritic sequences in Haskell in Consts.hs.
  • I will start testing the build process with tagPos, because it calls Swedia.extractTnt, which I’m working on. I will verify that both the Python and Haskell versions reproduce all the relevant words from the interviews.
  • Lexicalisation of the Berkeley parser (trainCfg) is delayed until testing tagPos and tagDep are tested. Also I need to figure out a programmatic way to verify the extractTnt output.

2009/12/11

  • I finished the switch to git with a massive cleanup and addition of some overlooked files to source control. I fixed a couple of bugs with Swedia.hs and then found a couple more that I missed.
  • I need to write some tests for Swedia. But I also need to move on to the other parts of the experiment.
  • I should probably (re?)download HUnit and QuickCheck

2009/12/14

  • I wrote some tests for Swedia. They aren’t very good and I haven’t integrated them into any kind of build I updated HUnit and QuickCheck to newest versions.
  • I will make sure my entire experiment still builds. I will figure out what test data to work with, either fake files or fake literals or a real file/directory.
  • Pretty soon I need to work on ConvertTalbankenToPTB so I can run Berkeley results. Also I need to make sure banks can build my entire experiment; it’s a lot faster than jones and all of my corpora are free so I might as well copy them over.

2009/12/16

  • I checked that tagPos builds and got testing of swedia to run. I did not figure out what test data to use.
  • I will figure out what test data to work with. I will re-read Convert..PTB code and some Talbanken files to remember how to get lexical items from the XML.
  • Need to set up connection to jones so I can test the build.

2009/12/17

  • I copied an example .cha for testing. I wrote a skeleton for testing Convert code, downloaded an exmple *.tiger.xml, and looked at the types to see which functions were pure vs impure.
  • I will write some tests for ConvertTalbankenToPTB, and keep an eye out for how to modify to include lexical items. I should test the entire build today too, in the background.

2009/12/28

  • I wrote some tests, modified ConvertTagsToTxt to use lexical items, and re-ran the Berkeley parser on this. Haven’t got the results yet.
  • I guess I will test my lexical subset code for talbanken. Except I don’t have talbanken so I guess I’ll just use some txt file.
  • I need to check the parser results and start running R on them.
  • Note: I fixed a critical (?) bug in ConvertTagsToTxt. Before lexicalising, sentenceEnd was wrong, meaning that (groupBy sentenceEnd) was wrong, meaning that everything was one long sentence??

2009/12/29

  • I wrote a bunch of tests for ConvertTagsToTxt. Maybe others? I can’t remember.
  • I will write more tests, mostly for ConvertTagsTo[Conll|Txt]
  • I wish I knew how to lift all from Bool to QuickCheck Property.

2009/12/30

  • I wrote a bunch of tests for TestConvertTags. I got the skeleton for TestPath up.
  • I will write tests for DepPath. I might start planning my dissertation chapter layout.
  • I am blocked on testing Path and testing modifications to the entire build; I need network access for that.

2010/01/04

  • I wrote tests for DepPath and planned my dissertation writing effort. It is below under *Plan.
  • I will fix DepPath and research Latex chapters. I will rerun everything from the top once I find out how long Yuyin’s lopar will run. (I should e-mail her). I might need to prioritise the switch to banks.
  • It appears that I am getting significant results and that I didn’t understand the output. ‘.’ contributes to significance; ‘*’ does not. HOWEVER, I was getting significant results on erroneous Dep formatted stuff, so who knows. This is like the 3rd time this has happened ok.

2010/01/05

  • I mostly fixed DepPath and downloaded the IU dissertation style sheet (as of 1999, in the Math department). I was misreading the output, and I am getting significant results! I am not sure what the distances are, though. It’s only printing ‘1’ for all, which I do not think is an average. I need to check the code again.
  • Today I will only test only running the experiment on banks if I have time. If I have still more time, I’ll start figuring out how to parellelise it using multi-run.
  • Still need to make DepPath print the region name at the top of the file.

2010/01/06

  • I got multi-run working and it is super fast. I also fixed DepPath for real.
  • Today I will (1) reorganise icectrl.cpp so that only the currently used code is in the main file, and everthing else is in an include file. I will also chop up my proposal into dissertation chapters and paste the pieces in.
  • Still need to re-run DepPath.

2010/01/07

  • I reorganise icectrl into 3 files, changed the name to icesig.cpp, and added icedist.cpp. I dumped the distances and loaded them into Excel. I pasted a bunch of proposal text into my dissertation, plus some qual paper.
  • I will dump R tables and generate histograms. I will download a map from bing of Sweden and annotate it with pixel locations so I can figure out the distances in order to have geographic distnace matrix also.
  • Later I also need to write C++ to dump the features comparing an entire region to an entire other region. Then I can read them with something decent like Python or Haskell and do some analysis. I should also paste some of the detailed methods from my qual paper into the methods chapter. There is some detailed discussion of R, normalisation and leaf-ancestor paths.

2010/01/08

  • I annotated a list of locations with their relative geographical location. I dumped all data to R-format text tables. I generated PDF cluster diagrams (Ward’s method worked best).
  • I will run the correlations and establish which are significant. I will write them up. I will make a nice diagram for visualising the clusters on the map of Sweden. I will write code to dump the diffs between the features of each set.
  • Tomorrow I should start putting all the diagrams into the dissertation.

2010/01/11

  • I ran correlations, found the significant ones and put them in a Latex table. I downloaded OmniGraffle and marked up the SVG/PDF map of Sweden from Wikipedia (available under CC GPL-Like licence). I wrote the code to dump per-feature diffs in C++. Then I wrote some Haskell code that reads in each file and (currently) prints the header plus the top and bottom 5 features. I ran it but there are way too many features to look at.
  • I will create a dict in consts of the Agree or Dep clusters, write some code in Norte to concat files sans the header (which I now think is a bad idea), then run the feature extraction on them. This will make only 10 comparisons to look at instead of 528.
  • Need To pay for Omnigraffle–find out how to get academic discount

2010/01/12

  • I created a consts.py dict to map agree-clusters to sites, then wrote some code in norte.py to produce cluster feature analyses. Since I don’t delete tmp files, RankFeatures extracts every comparison: 528 + 10. I cut the last 10 out and pasted them in results.tex. I also reformatted the Swedia cluster map to use larger fonts and coloured dots.
  • I will figure out a way to visualise and analyse the top feature differences between clusters (maybe Excel?). I will format them nicely in the results section. I will re-format the clusters so that they fit the page. I will work on pasting some more methods in and then look at the intro/hypotheses–I’m still not sure how that should go.

2010/01/13

  • I put the top 5 top/bottom features from each cluter in Excel and wrote some short analysis on the strength of each cluster. I reformatted clusters. I pasted in some more method detail from my qual paper and wrote some more on the intro.
  • I will download RuG-L04’s newest version and get it to generate nice diagrams. I will re-read my intro and figure out what goes in between the introductory material and the hypotheses. Or after them. I should look at Heeringa’s thesis again.

2010/01/14

  • I downloaded and figured out L04’s basics. I didn’t reread my intro. I don’t have borders to my map, but I did include it already in the dissertation.
  • I will spend a couple of hours trying to add simple borders to the L04 maps and then re-read my intro.
  • I should upload a copy of dissertation.tex so that I can show Sandra. Also I need to print an invoice and check that AFP still works on jones. Also I should ask Sandra when /how much I should start sending on to Henrik Rosenkvist.

2010/01/15

  • I got simple borders added to the MDS diagrams from L04. I started an overview section at the end of the background chapter. I tweaked the wording earlier in the chapter, but didn’t touch the hypotheses from the proposal.
  • I will meet with Sandra to find out what I still need to do on the proposal. I will also ask her about variations on the results, general layout of the dissertation, organisation of the first chapter (maybe) and when/how much results to send to Henrik Rosenkvist.
  • Still not sure what to do about hypotheses in the background chapter. Later today: create hypotheses separate chapter, generate figures for POS run and see how good they are. Revise proposal with Sandra’s revisions.

2010/01/18

  • I met with Sandra and got the changes to make for the proposal, plus advice of who to talk to, plus some suggestions for dissertation layout and experimental paths.
  • I will make the proposal revisions today. I will probably have enough time to implement Sandra’s redep idea too.
  • I need to draft e-mails to Henrik Rosenkvist and Rex Sprouse asking for (1) inspection of my results and Swedish syntactic dialectology papers and (2) inspection of TnT/Berkeley/Malt’s annotation outputs. I will probably need a better way to visualise them because he is a german syntactician and may (may not?) have even worked with dependencies. Actually this holds for Rosenkvist too…

2010/01/19

  • I made the proposal revisions. I implemented (badly/mostly) the redep idea. It appears to be working.
  • I will run redep. I will edit the later steps to account for redep. I will mail my committee members. I will look at the hypotheses chapter and figure out what to do with it.
  • Still need to mail Rosenkvist and Sprouse.

2010/01/20

  • I mailed my committee members. I wrote KL divergence and JS divergence, and ran them along with redep after editing build.py to account for 3x4 variations. I finished generating r-redep figures. I did not look at the hypotheses chapter.
  • I will continue mailing committee members to find out whether next week or the week after that is good. I will make a presentation of Work So Far and To Do, divided into Analysis and Writing. I will outline a bing presentation for the end of the month too.
  • Remember to mail Rachel, take out the trash, and find a present for Mom (and maybe Daniel)! Still need to mail Rosenkvist and Sprouse. Maybe should run r^2 too, now that I’m running all these others.

2010/01/21

  • I mailed committee members some more. Looks like it will be in the afternoon. I made a proposal defence presentation (draft) and a bing presentation. Looks like I will give them the same day.
  • Need to mail one last time. I will start composing a mail to Rosenkvist. I will add r^2 and dependency arc labels and rerun everything with them. I will start doing SOMETHING to questions.tex.
  • Need to mail my Friendspeak partner too. Argh, Mail!

2010/01/22

  • I mailed Rosenkvist. I wrote a couple of pages of stuff for the first question. I added r^2 and dependency arc labels and reran everything. I made a little chart of significances. They are not consistent. I did not mail the committee members.
  • I will mail Rex Sprouse. Include an example sentence to get him interested. I will mail my committee members. I will write a couple of pages for the second hypothesis and a couple for alternate measures.

2010/01/25

  • I wrote a few pages for the second hypothesis and some more for alternate measures. I made some diagrams for constituency and dependency parses and I don’t need Rex’s help to see how bad they are.
  • I will implement real trigrams and call the current retrigram. I will implement unigram/reunigram for comparison. I will filter the http://uit.no/scandiasyn/bib2003 list down to Swedish ones. I will bug my committee to finalise the proposal defence.
  • I should also reply to Henrik.

2010/01/26

  • I implemented real trigrams and unigrams and re-ran everything. I filtered the uit.no list down to Swedish ones (probably). I wrote an R script that does some of the correlation/significance/figure work for me. I wrote a status e-mail to Sandra.
  • I will work a whole bunch on Questions and maybe a little on the intro (although the summary is pretty hard to write until the rest is written). I will look over the proposal defence again and see if anything more occurs to me.
  • Still need to track down those articles and start reading them. Should also reply to Henrik soon. Not sure what to say.

2010/01/27

  • I wrote second draft on the first question and sstarted alternate measures. I looked at the proposal defence for like 20 seconds. I added some results instead of . I read a little more about R.
  • I will finish Questions with the Features question and finishing Alternate Measures. I will try to learn a bit more about R so I can finish automating the significance tests. I will look more at the proposal defence.
  • I should STILL mail Rosenkvist and start reading those papers. (besides the apparent cleft one)

2010/01/28

  • I finished Alternate Measures and Distance Terminology (with real math!). I started a little on Question 2. I wrote a script to properly create ‘all’ features and reran sigs/dists. I finished automating the R correlations (though I may need to revise it, it works now). I didn’t look at the proposal defence.
  • I will finish question 2 of the questions.tex. I will re-run genAnalysis and genMaps to include the All feature set.
  • Yeah yeah, mail Rosenkvist. Tomorrow I will revise bing presentation and defence presentation.

2010/01/29

  • I almost finished question 2. I re-ran genAnalysis and genMaps. (I wrote a bunch on Chu-Liu-Edmonds too…)
  • I will finish question 2 (hopefully almost done), send a draft off to Sungkyoung, and move on to revising the defence presentation, especially the Questions part (now that I understand it better).
  • I think I need to say to Rosenkvist something like, I don’t have a good write-up of the features yet and am still producing results. If you are interested, I can send you an update in a few weeks. Also thank you for the paper links. If I have trouble finding some I may ask you how to obtain them. So far your papers have been useful, especially the SSAC one.

2010/02/01

  • I basically finished question 2 and revised the defence presetation.
  • I will practise my defence presentation. I will copy the presentation (and the bing one) to jones. I will read Therese’s paper and the dissertation she linked. (I will also work on taxes and finishing the CLE example).

2010/02/02

  • I mailed Therese back. I practised my defence presentation. I read the Groningen work but not yet Schaeffler’s dissertation.
  • I will give my presentation, I will read Schaeffler’s dissertation.

2010/02/03

  • I gave my presentation. I skimmed Schaeffler, should go back and read the second half more closely soon.
  • I will attend the parsing group and get more suggestions. I will order the suggestions from the defence and the group and figure out which to do.
  • I already implemented two more feature set suggestions so I should run those today while I am figuring out what to do.

2010/02/04

  • I attended parsing group and got more suggestions. I ordered suggestions and figured out some to do. I found out that I was included some unglossed transcriptions in my tagging input. I re-ran dist/sig for the new feature sets.
  • I will work on fixing TnT’s unknown words. I will e-mail Henrik about SOMETHING. I will read a Swedish dialectology paper. I will find out about requesting money from the Householder Fund. I will re-run the analysis/map generation for the new features.
  • Reprioritise mails to Henrik since the 3 northern sites went away. I wrote a draft a few days ago. Also mails to other people.

2010/02/08

  • I improved parts of speech. I e-mailed Henrik. I read another paper by Wybo. I don’t know yet if I’ll need Swedish help for annotation. So far it seems easy enough with a dictionary.
  • I will look at the dependency parses. I will finish tagging the top 200 words. I will re-read the Berkeley parser documentation to find out about unlexicalised parsing.
  • Should also reply to Therese soon, and ask Margaret about Householder Fund tomorrow. I need to update/check The Plan too.

2010/02/09

  • I looked at the dependency parses. I finished translating the top 200 words. (This doesn’t mean they’re properly tagged, much less added to the training.) I did not read the Berkeley parser documentation. Instead I played with MaltParser to try to get a good parse.
  • I will figure out how to fix the MaltParser parses. I will re-run everything. I will start reading about the Berkeley parser options.
  • Shoudl mail Therese too eh.

2010/02/10

  • I figured out how to fix the MaltParser parsing. I started re-running everything. I read about the Berkeley parser but the documentation is mostly incomplete. Something about unknown words falling back to a text file, though.
  • I will waste most of today re-running dist/sig and piddling around. I need to count the number of fine vs coarse POS tags. I need to cluster things in path/dep common clusters, then look at the best tags to tell apart the clusters.
  • Upcoming: need to rewrite first section of Questions. Need to mail Therese. Tomorrow I need to ask Margaret about Householder fund. I also need to finish the refactoring move of MEASURES and FEATURES to consts.py

2010/02/11

  • I re-ran everything, included syntaxFeatures, because I decided only to use BIG-THREE clusters (plus slop) and redid consts.py accordingly. dep features are better, but still questionable. Maybe I need more normalisation.
  • I will force Berkeley parser to -useGoldPOS. I will test move of MEASURES/FEATURES to consts.py. I will write e-mails. I will ask Jan about Householder award. I will analyse the most important tags. I will write up a list of tips for the undocumented parts of the Berkeley parser (ie how to make sure grammatical functions are used, their incorrect Conll parser, and -useGoldPOS)

2010/02/12

  • I forced Berkeley to -useGoldPOS and wrote code to generate its fake Conll representation. I looked at the features but couldn’t come up with anything conclusive except that the big groups have pretty generic, pretty important features. I wrote a tip list for the Berkeley parser.
  • I will write cosine similarity, add it and remove redep/rearc. Then I will regenerate .dats and rerun syntaxDist/sig/features and then genAnalysis/maps. I will add entire stages to build.py to make running everything easier (even if it IS distributed haphazardly across 3 machines). I will reply to Jan and Sandra about the Householder award
  • Ask Jan about hourly wage. Need to compare just the furthest site (J"amshog) to everything else and see what I get. Make a budget based on 9-10 $/h

2010/02/15

  • I wrote cosine similarity. I removed redep and deparc. I regenerated features and reran my syntax distances and analysis. I grouped build.py’s functions into per-machine stages and set up SSH so that the correct files can be copied by build.py (password is stored on the machine). I replied to Jan and Sandra but not with a proposal yet.
  • I will reword the intro to the questions chapter. Lengthier and less (no?) first person. I will hunt down the Swedish syntax references. I will write a plan for paying Swedish students and send it to Sandra for checking.
  • Still should compare J"amshog to every other site. (Also mail Jiaan, Che-Hui, Deborah and maybe the other MS recruiter I talked to)

2010/02/16

  • I rewrote the first third of the questions chapter. I re-read Wybo’s 2008 paper. I should cite it, but it still doesn’t look like I need the 3rd normalisation because I don’t sample per-speaker before sampling per-sentence. I did not write a plan or hunt down any new Swedish papers.
  • I will find the (possibly) relevant papers and download or figure out how to get paper copies. I will write up a proposal for paying Swedish students.
  • I should probably do some other stuff but I’m not sure what. Maybe compare Jamshog. Also read up on consensus dendrograms.

2010/02/17

  • I found as many relevant papers as I could. A couple are only available on paper, one is in the library so I need to recall it. I wrote up a proposal for Swedish annotation and sent it to Stuart.
  • I will get people to look at the Jamshog features. I will get suggestions on how to train annotators. I will see if anyone else knows about consensus trees, and if not, go back to reading web pages on them.
  • LSA-style whole-feature-space optimisation? Not sure this has the right shape. Also check out Emms paper on comparing trees? Try old version TIMBL+old version MaltParser for training w/o SVMs. Or at least PCA if not LSA.

2010/02/18

  • blah. I kind of did it.
  • I will add stubs to methods.tex. to remind myself: PCA, LSA, alternate training for timbl+malt, alternate params for berkeley training. there is only one that I see for berkeley (viterbi) and it’s for parsing but whatever. Anyway, I’ll need a Big Red account for timbl+malt. Also write Sandra to ask her about training annotators? Maybe. Of course there’s a lot of other things I’ve already added like phrase-structure rules and KL divergence.

2010/02/22

  • I added stubs for everything. I spent a LOT of time trying to read a C++ implementation of consensus trees before finally re-implementing them myself in Haskell by reading a paper with a two-sentence description. I got a Big Red account but Sandra didn’t send me the binaries yet.
  • I will write Sandra and Ariel reminding them. I will write Rex asking him about who to contact in the German department. I will write a reader/writer so my consensus tree code can talk to R (or maybe port it to R–gulp). LSA and Malt+Timbl
  • I should probably blog the consensus trees now that I can and finish my ICE post.

2010/02/23

  • I wrote Sandra and Rex. I didn’t write Ariel. I got a reader for R’s consensus trees working, but the algorithm crashes.
  • blah

2010/02/24

  • I got the files from Sandra. I I updated her on my promblems but she was feeling pretty sick and didn’t give any ideas. I fixed the consensus tree algorithm.
  • I will output consensus trees to qtree. I will get Big Red started on Malt+Timbl
  • LSA program

2010/02/25

  • I output consensus trees via qutree. I got Big Red started. malt doesn’t work.
  • I will get malt working. I will investigate academic-style LSA-style programs.
  • Bother I forgot that old-malt trained on old-timbl needs old-timbl around for parsing too. So I will have to port that part of build.py over to whatever crappy Python Big Red has. Yup. 2.3.3. What. A surprise. Matches GCC 3.3.3 perfectly.

2010/02/26

  • I got malt working. I investigated academic-style LSA programs and couldn’t find any. They are all commercial, so I am trying to get a faster knock-off to run instead. Probably gives garbage but whatever.
  • I will rerun the timbl-trained dependencies as redep (along with re-adding deparc). I will get random semantic vectors running. I will read about Timbl’s options and decide on what to vary.
  • Register with ACL 2010 demo reviewer site. Publish Consensus.hs post. Also I need to reply to Rex and Hashim and bug somebody ELSE at Microsoft. (and Therese and Henrik? argh!)
  • Tomorrow: I definitely have to write e-mail. I also need to take a look at the most recent round of results in Excel like.

2010/03/01

  • I re-ran everything from timbl on down. Haven’t looked at the results. I figured out how to get semantic vectors working (untested). But the theory doesn’t look appropriate. I will ask about it on Wednesday. I reviewed the ACL demos on Saturday (first draft, on peregrin).
  • Still need to reply to Rex. And Hashim. And mail Ross. (and Therese and Henrik?) I will try another set of parameters for Malt+Timbl training. I will alter icecore to allow full-corpus comparisons, using #ifdef. I will look at the first set of Malt+Timbl results. I will add these other methods to the subsections in methods.tex and start writing them up.
  • A lot of mail! Try that C++ Mersenne twister from the consensus tree code too.

2010/03/02

  • I started another Timbl run. Hasn’t finished yet. I ran a full-corpus run. Haven’t looked at the numbers yet
  • I will alter build.py/norte.py/genAnalysis.R to handle fullcorpus (use the ‘interview’ slot since I’m never going to do larger a priori groupings) (Consensus.hs needs updated too because it’s so hardcoded). I will work on methods.tex.
  • I wrote some e-mail! Still need to mail Therese and Henrik. Probably ask Sandra what is a good delay for this to be polite.

2010/03/04

  • I did what I said I would on Wednesday. Plus a meeting with parsing group. Got plenty of suggestions, stored on paper.
  • I will clean up dissertation.tex for Sandra. I will get full/1000 variations both running (tonight). I will get geo/travel analysis working.
  • MaltEval is what I want for Malt+Timbl hold-out w/Talbanken. Also having troubles with recruiting Swedish speakers.

2010/03/05

  • I cleaned up dissertation.tex. I got full/1000 variations running. I got geo/travel analysis working.
  • I will read Up On Malteval. I will learn about MDS (Kruskal 1694). I will check to see if any of my measures satisfy the triangle inequality. It would be cool if they did but very unlikely.
  • Still bad troubles recruiting Swedish speakers.

2010/03/08

  • I read about MaltEval and it looks very easy. I read Kruskal’s papers abotu MDS and I can explain it at two levels of detail. (ie with or without equations.) I checked to see how often I violate the triangle inequality (about 3k of 27k 3-point combos). I asked Don about Swedish speakers.
  • I will mail Don about Swedish speakers. I will mail Sandra about same. I will start the ten-fold cross validation of different Malt+Timbl settings. It will take a long time. I will start writing about MDS.
  • Also need to mail Ross again and reply to Therese and that Russian guy.

2010/03/09

  • I mailed Don and Sandra. I started 10-fold cross-validation. I wrote about MDS and a little about consensus trees.
  • I will mostly work on diagrams and explanation for clustering.
  • Still Swedish problems, reply to Therese and Russian guy etc etc.

2010/03/10

  • I worked on diagrams and explanation for clustering and consensus trees. The 10-fold cross validation crashed after running for a day (there is a 100ks hard limit on Big Red boooo). I mailed another person in the OIS after .
  • I will finish diagrams and explanation for feature ranking. Correlation and combining features too, if I can think of anything. Also I will test random number generation. I will restart the Malt+Timbl test for the third time.
  • Should reply to Therese. Yes indeed.

2010/03/11

  • I finished diagrams and explanation for feature ranking, except for comparison to Wybo’s 2009 paper.
  • I will mail Therese to thank her. I will read Wybo’s paper and decide if his third normalisation is applicable and useful to my ranking problem. I will bulk up Combining Feature Sets with the hoky Giant Bag method that I actually used. I may just axe Feature Backoff since it doesn’t look like I need it. Of course it might help but the results now are decent, so…
  • Wybo’s second normalisation is optional. So I can justify making mine optional too. But this requires more work to do so, plus 2x computer time to re-run results. Boo. Oh well, I am changing stuff SLOWER than before. So that’s good. Soon I should start running MaltEval on the Malt+Timbl settings that have already finished. Also also, I just noticed that I really need to rewrite the intro to the methods section, since it’s still written in future-tense proposal style.

2010/03/22

  • I mailed Therese and read Wybo’s paper. Pretty sure his

within-group normalisation doesn’t apply because I pretend that it’s OK to combine people (since I only have 4-per-site anyway).

  • I will work toward finishing 1st draft of methods.
  • Still need to (1) gather Malt+Timbl results (2) re-run everything (though maybe AFTER running redep with the best Malt+Timbl parameters)

2010/03/23

  • I got a little done on the methods.
  • I will finish distance measures in methods. I will prepare for parsing group tomorrow by comparing the Malt+Timbl results. Maybe re-run everything too?
  • I should probably mail some people ugh like Alicia Bower or so. Still need to rewrite intro to Methods By the way here is the malteval command line: java -Xmx1G -jar lib/MaltEval.jar -s result-Overlap-3-?.conll -g test-?.conll > eval-Overlap-3.txt (for 10-fold cross-validation)

2010/03/25

  • I finished distance measures. I presented at parsing group and got no (!) new suggestions.
  • I will run some new 10-fold variations on Jeffrey’s classifier. I will start the second pass on the methods chapter. I will finish modifying C++/Python/R code to generate freq/ratio difference (ie 2nd normalisation or not)

2010/03/26

  • I started the Jeffrey 10-fold training. I wrote the C++/Python/R to switch on/off the ratio normalisation. I am about a quarter through the second pass on the methods chapter.
  • More writing on the second pass of methods. Restarting 10-fold testing when needed.

2010/03/29

  • I got 2/3 done with the second draft of the methods chapter. I ran 3 of the 4 Jeffrey methods, and one failed. I guess it was an invalid combination of parameters or something.
  • I will run the best Jeffrey parser and document the parameters in Methods. I will try to get as close to finishing the second draft as possible.
  • I see that the digit of the month in April is 2. I guess it’s 1x in March, then 2x in April. We had to scrounge out the 0x days in early March and late February.

2010/03/30

  • I ran the best Jeffrey parser and documented the parameters. I just got TnT documented, nothing else.
  • I will regenerate features and then start running the (hopefully, nearly last) run of distance/syntax. I will read Petrov/Klein 08 on the Berkeley parser so I can explain how it works.
  • I ran some errands today too so that could explain my laziness. But not really.

2010/03/31

  • I wrote up the berkerley parser. I ran syntaxDist/syntaxSig
  • I will write up MaltParser. Hope to finish some other stuff in methods too. Maybe I will run genAnalysis too if I get bored and want to debug R code…

2010/04/02

  • I wrote code in build.py to generate composite cluster maps. I read about PCA in R, but I’m not convinced. I wrote about MaltParser. I wrote up, kind of, Wiersma’s normalisation for measuring feature overuse.
  • I will write an example for Wiersma’s normalisation (this should help finally prove to me if it’s useful.) I will write code I guess through #ifdefs in icecore.h, if it is.
  • Bah humbug. FParsec sucks. Hey! Remember to write something to filter out non-significant distances, assuming that clusters, MDS, etc can work with some distances removed.

2010/04/05

  • I finished the second draft of Methods, and sent it to Sandra. I fixed the bugs in generation (space in #end if, as well as incorrect output file name for feature extraction) and ran stage 4 and 5 over the weekend. I finished the outline for Results.
  • I will generate analysis and start pasting in the results I guess.
  • The outline is kind of sketchy near the end where I got bored. Also I need to run the single feature extraction wrt Jamshog.

2010/04/06

  • I generated analysis and got the significances formatted nicely. Some analysis too.
  • I will put in correlation and cluster results.
  • Still need to worry about filtering out only the non-significant distances instead of keeping only combos that have ALL significants. Maybe not such a problem; the sigs are pretty good for the 1000-sample side.

2010/04/08

  • I got correlation and cluster results in, plus self-correlations. Clusters are not well presented so probably not done yet. (Also the writing is a bit lacking.)
  • I will add the appropriate consensus trees plus some writing.

2010/04/09

  • Consensus trees are kind of mostly in.
  • Need to improve appearance and include mapped versions. Should also start of composite cluster maps.

2010/04/12

  • Improved appearance a little in Omnigraffle and included mapped versions, plus a key which looks OK, but out of place..too colourful for a dendrogram. Also put in composite cluster maps, but they look bad. Also put in some MDS maps, which also also look bad.
  • I will integrate Therese’s maps so that my RuG diagrams look better. I will mail my committee today too.

2010/04/14

  • I skipped a day because I was trying to figure out which feature rankings to include. I finally have some idea, but the results are bad and probably wrong.
  • I will check for size/distance correlation. I will check for bugs in the overuse normalisation since all results are the same.

2010/04/15

  • I checked for size/distance correlation. I went back and forth, deciding that size1 + size2 is right, not abs(size1-size2). This makes ratio norm look bad. BUT, size/travel also correlates significantly at 0.32. Also I threw out single-region uniques in icefeat.cpp. Then I improved formatting a tiny bit.
  • I will put in the new results from icefeat.cpp. I will rewrite the text in Results. I will format the feature ranks to be easier to read.

2010/04/16

  • I put in the new results from icefeat.cpp. I wrote some text for the feature ranking to make it easier to follow (though it is still hard). I did not rewrite the rest of Results yet. I reformatted the feature rank labels but not the figures themselves.
  • I will rewrite Results. That’s probably it.
  • Otherwise I need to start addressing Sandra’s comments in the earlier chapters. Also, I need to see if Kaleb wants to car pool to Toli’s wedding. Probably should take my car this time…

2010/04/19

  • I finished rewriting Resulsts. There was more new stuff needed than I thought. But at least I got it sent off. Over the weekend I wrote some summary and beginning of analysis. Not sure if it’s useful.
  • I will start writing up my analses. Probably just the stuff I’ve already come up with. The idea is to have a complete dissertation sooner, even if it’s unpolished. “Great artists ship” and all that
  • I should mail Therese this week to see if I can get the current version of her dissertation.

2010/04/20

  • I wrote up discussion of significances and some on correlation. Discussion of self-correlation and size correlation isn’t done.
  • I will look at the rest of correlation and decide if I can say anything useful about it. If not I will move on to cluster dendrograms (probably focussing on the consensus tree actually)
  • Remember to mail Therese. I will probably attend Alex and Mike’s talk and also ask Sandra if. Maybe I will find out if Mike will be here during the week of syntaxfest.

2010/04/21

  • I finished correlation (mostly) and started on clusters. Especially consensus trees.
  • I will finish (?) talking about clusters. Or try to.
  • Remember to mail Therese. See if my stupid idea of just dividing by the number of sentences will work. See if I need to mail committee a prospective defence date. Also, I can map consensus clusters onto Therese’s maps with a simple .vec file passed into maprgb. (Later: It was easy!)

2010/04/22

  • I went to a parser group meeting. They suggested colours for my feature ranking diagrams instead of left<->right opposition. Also different colours for different features.
  • I will finish talking about composite cluster maps. That should be easy because I don’t have much to say. I will start on the discussion of MDS. That will be hard because there are so many and I’m not sure how to explain their differences.
  • Remember to mail Therese. Learn about pstricks instead of the picture environment.

2010/04/23

  • I finished composite cluster maps and MDS. I started a little on features but didn’t get anything useful done. I found a good pstricks example and got it to compile.
  • I will try to get SOMETHING done with features. I will get pstricks version of the feature ranking diagrams going.
  • Still should mail Therese. Doesn’t need to be complicated, just “if you are done or nearly there, I would appreciate seeing your dissertation so I can compare my work with it. I am writing my discussion chapter right now.”

2010/04/26

  • I got SOMETHING done with features, or at least half done. I realised that feature ranking is independent of method (it’s a non-abs R/R^2) so I don’t need to present as many diagrams as a I thought). I got pstricks working but not yet integrated into the build process. I also discovered that (1) freq norm correlates highly with sentence size and (2) 1 round of norming gives MUCH more significant results than 5 rounds. The results are also much different.
  • I will add analysis and map generation of both 1-round and 5-round norming. I will edit the results chapter to contain many fewer diagrams of freq norming and many more comparing 1-round a 5-round normalisation of ratio. I will change the names of norming to within-comparison sentence-length norming and cross-comparison/cross-corpus norming.
  • Bah. Should still write to Therese. Need to make an R file for sentence-length so I can correlate it formally.

2010/04/27

  • I did all that stuff I said I’d do, up to MDS diagrams.
  • I will replace the MDS diagrams. I will generate colour feature diagrams with pstricks. I will start rewriting the discussion.
  • Mail Therese!

2010/04/28

  • I did that stuff, although only a little discussion.
  • I will rewrite the discussion, at least up to the features again.
  • Mail Therese!

2010/04/29

  • So I got to halfway through MDS. I just need to add figure references and add another paragraph or so.
  • I will finish MDS and HOPEFULLY feature diagrams. If I do I will spend the rest of the time translating relevant artikelern av dialektboka to English
  • Therese’s thesis will be done in a couple of weeks….

2010/04/30

  • I finished feature diagrams. It was kind of sucky, but not much more than before. Overuse is better than ratio even if it’s noisier. I got the first 20 pages of dialectboka translated, enough to remember that Delsing’s Syntaktisk variation i nordiska nominalfraser is the relevant paper in there.
  • I will get the rest of Delsing’s paper translated. Somehow.
  • Maybe I need a way to chop PDFs into 20 pages segments so Google Translate will do it? Maybe I should try Bing. (Also remember to reply to various e-mails today.)

2010/05/03

  • I got through two of Delsing’s examples. One region of one example worked; the first example required a better tagset.
  • I will get through two more of Delsing’s examples. Maybe three?
  • I am just using trigrams right now. It’s simple but all the examples use left-right context and I don’t want to mess with dependencies yet. Remember to mail Stuart about defence and summer registritaion. When I mail Henrik again I should offer him the tagged versions of my code and mention software to search it.

2010/05/04

  • I got through ALL of Delsing’s examples, but with no writing to speak of. Also got halfway done with Rosenkvist’s.
  • I will finish Rosenkvist and start writing it all up. I will run errands: find out how much 1 summer credit costs, get a summer parking tag, and set a defence date in Kirkwood hall. Also a ‘graduation lunch’.
  • I should park cackhanded.com/cack-handed.com pretty soon.

2010/05/05

  • Blah. Rosenkvist and errands. Also rewrote
  • Summer register. Finish writing Delsing and Rosenvkist. Maybe a little on the abstract.
  • Still need to register for summer. Need to get Sandra a draft by Monday. Need to get Sandra to sign the defence announcement before she leaves on Monday (which entails a 300-word general-audience abstract).

2010/05/06

  • Forgot to summer register. Mostly finished abstract for defence announcement. FInished writing up Delsing and Rosenkvist analysis.
  • Finish writing up comparison to Therese’s work and general dialectology. Also cut abstract down to 300 words and send it to Sandra.
  • Summer register. Reply to Steve Chin. Link to the public version of my paper. Finish background check AGAIN. Sheesh.

2010/05/07

  • Finished comparison to phonology (Therese) and general dialectology. Cut down abstract. Moved analysis to results chapter but didn’t integrate it yet.
  • Integrate analysis into results chapter? Finish discussion: comparison to syntax dialectometry, future work, final summary. Also a paragraph or two at the beginning.
  • Keep checking the stupid onestart site. Maybe with Firefox? Probably not. I should PROBABLY mention Spruit after Goebl but before Wybo & Wiersma. I don’t have time right now but maybe before the defence. Did I ever put up a table showing corpus sizes? I should probably put it in somewhere…

2010/05/10

  • Demi yaya
  • Yah yah
  • I must remember to add a note in the discussion that even though trigrams are better here because of the annotation error, my work from English shows that with manual annotation leaf-ancestors have a small advantage.

2010/05/11

  • Demi yaya
  • I guess this is the day that the draft is technically due. I thought it was tomorrow. Hmf.

Plan

January

Get results

Try lexicalised vs POS

Lexicalised doesn’t work. The end.

Try paths of dependency labels instead of POSs

Try lexical cleaning

Try different parameters to parsers

Move experiment to Banks

Try different distance measures

This should be pretty simple. Right? Just different for loops. Even C++ can’t muck this up too much.

Try different regions

the current ones are smallest and don’t need the additional precision

Introduction

Figure out how to structure latex chapters

Introduction

Need to write the overview of the dissertation. Need about a page here explaining what each chapter is.

Hypotheses

Pretty good rough draft, needs a lot of work to be presentable.

February

Revise Hypotheses

Get Sandra (and others?) to look over it and critique.

It probably isn’t nearly what is needed.

Start Analysis

Read up on Swedish [syntactic] dialectology

Work with Rosenkvist to find out how my results compare to it

Started, need to tell him how I’m working and get some papers.

March

Methods

Finish discussion of Wybo’s feature ranking

Add discussion of PCA and Cluster Boundary Diagrams (maybe later)

Expand the explanation of the parsers so each has a subsubsection

April

Results

Discussion

May/June

Cleanup and Defence

http://www.indiana.edu/~grdschl/electronic-method.php

Abstract

double-spaced, 350 words

Cleanup

From Sandra mostly. Redo the feature extraction figures to use more space and proper $→$ instead of . and -> Possible ‘future work’ moved up to the present

Presentation

(how long?)

Formatting

  • list of figures, appendices, abbrevs and other materials
  • Vita page

ProQuest submission

  • Do you has? The $65 for MICROFILM???
  • I DON’T think we’ll completely lose the tech to read hard drives, but whatEVER.

Defence

  • Get accceptance page and abstract signed
  • June 11th, 1 PM
  • Make a bound copy for the department

June

Revisions

Remember to upgrade blog software

ask local Swedish speaker to look over automatic annotation quality esp Berkeley parser and its POS tagging quality

How to run SemanticVector

To retrieve document-document similarity:

> cd …\lucene-2.9.2 > java -cp lucene-core-2.9.2.jar:lucene-demos-2.9.2.jar:semanticvectors-1.26.jar pitt.search.semanticvectors.CompareTerms -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase /Users/zackman/Downloads/lucene-2.9.2/src/demo/org/apache/lucene/demo/HTMLDocument.java /Users/zackman/Downloads/lucene-2.9.2/src/demo/org/apache/lucene/demo/html/Test.java

Note -searchvectorfile doesn’t seem to make a difference here. (Both termvectors.bin and docvectors.bin give the same number) Perhaps we are not hitting it.

To retrieve important features for a document:

> cd …\lucene-2.9.2 > java -cp lucene-core-2.9.2.jar:lucene-demos-2.9.2.jar:semanticvectors-1.26.jar pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase /Users/zackman/Downloads/lucene-2.9.2/src/demo/org/apache/lucene/demo/HTMLDocument.java

To train with my features

This is Java! Inherit from org.apache.lucene.analysis.Analyzer override TokenStream tokenStream(String fieldName, Reader reader) (override Bork bork(Borker borker, AbstractBorkCreator borkborkbork)) oh wait you don’t need override in Java I think

Proposal:

Pay 2 native speakers of Swedish for 10 hours of annotation each. The annotators will annotate sentences from the SwediaSyn that have been selected based on the presence of unknown words. Each sentence will be annotated with parts of speech and a constituency parse. The annotators will receive 2 hours of training on part-of-speech and syntactic annotation, then spend 8 hours annotating. The annotators will be paid $10/hour.

This annotation is necessary because my dissertation uses Talbanken for training automatic annotators, such as parsers and a part-of-speech tagger. Talbanken is a corpus of written and spoken Swedish collected in the 1970s. Each sentence is annotated with a phrase-structure parse, and later versions also provide a dependency parse for each sentence. However, Talbanken’s vocabulary does not cover daily life very well; its main source is newspapers, and its focus is on business. Therefore, a number of common words, such as familial relations, are under-represented in the training. This causes problems for the automatic annotators when run on interview data from the SwediaSyn, a dialect corpus collected around the turn of the century. Training the automatic annotators on manually annotated sentences containing words not found in the Talbank will allow them to identify these words in the interview data.

The total budget is $200: $100 for each annotator at $10/hour.

How to get travel distances rather than straight distances (from Bing)

key is AuexsVqUAV7QT1qWjMsLyyJcnAd82hQScKM06xR_M1H4qGOH_DeAFZleMO7BbBue make a web ref project that points to: http://dev.virtualearth.net/webservices/v1/routeservice/routeservice.svc Then copy the example below and excise the for loop http://msdn.microsoft.com/en-us/library/cc981072.aspx instead print routeResponse.Result.Summary.Length (I think)

Notes from defence and parsing group

feature sets

Cut everything above nested S for leaf-ancestor paths

This should help focus on intra-sentence grammar, not the frequency and type of embedded clauses (though this might also be interesting, normal leaf-ancestor paths would help with this.)

Phrase-structure rules, plus grandparent:

used by Collins and Koo – parse reranking (around 2005)

Swedish

Find out how householder fund works.

How to get money from it? Pay a swedish undergrad to annotate either dialect sentences or parts of speech

Analysis

Read about how Thomas Zastrow does GPS/GIS systems for dialectology

Check on internal consistency in one of the papers I read recently.

It would be nice if I have enough data. I think it was Schaeffler’s dissertation, but it might be Therese’s paper. Or maybe Jelena and John’s survey paper? Not sure. Maybe even the 06 correlation paper

Find out how to do agreement dendrograms (or have a program do them for me)

Ask Therese for improved eps outline of Sweden

Switch mds diagrams to use Therese’s improved outline

Do subgroups in a dendrogram reflect diachronic subgroups?

It would be cool, but I think the subgroups are too unstable to be real

Divergences might actually be preferable for psychological reality

But I think they mess with a bunch of the existing assumptions about how to analyse the data. eg mds and hierarchical clusters both go away or else get a lot more complicated. I’m not sure which.