HunkBlame.py Operation #103

mfenner1 · 2011-07-22T19:10:11Z

Hi folks, a couple questions:

It appears that HunkBlame.py assigns the blame of a particular (fixing) hunk to a particular commit. Any thoughts on getting from the commit down to a file (and maybe down to a hunk)? (From a fix_hunk F, we know the file so we should be able to look it up in the commit. Finding a specific bug_hunk B within that file might be more difficult?)

Is HunkBlame only assigning a blame for the first hunk in a given commit? At least in one test project, that seems to be what it is doing for me. Thoughts on getting it to work through all hunks in a commit? I'm looking at the break (near line 48) in the BlameContentHandler.line and wondering if that drops out too soon.

Thanks and best,
Mark

cflewis · 2011-07-22T20:48:45Z

@linzhp knows best about this stuff.

As far as your first point, yes, it only goes down to the commit. This was all we needed it to do for our experiment. If you have a way of keeping the current functionality but going down to file level, that is a pull request I would take :)

The second question I don't have the answer to.

linzhp · 2011-07-22T23:12:09Z

The initial implementation of HunkBlame was able to trace from a bug fixing hunk to bug introducing hunks: https://github.com/SoftwareIntrospectionLab/cvsanaly/blob/5a3a0edf6fc108d102351e457520a11db173dc5f/pycvsanaly2/extensions/HunkBlame.py
but it was replaced by the current implementation because we only need bug introducing commit, and storing bug hunks would give us extra queries to find out bug introducing commit. If you need that feature, you could put that code back and refactor the hunk_blames table like this way:

hunk_blames

id: int
hunk_id: int
bug_commit_id: int
bug_hunk_id: int

Though the coexistence of bug_commit_id and bug_hunk_id is a redundant, this structure won't break the code @lewisham is experimenting with.

With regard to the second question, HunkBlame assign blames for all hunks in a given commit, the query in the code is:

select distinct h.file_id, h.commit_id
from hunks h, scmlog s
where h.commit_id=s.id and s.repository_id=?
and s.is_bug_fix=1
and h.old_start_line is not null
and h.old_end_line is not null
and h.file_id is not null
and h.commit_id is not null

mfenner1 · 2011-07-23T12:30:57Z

Hi guys, thanks for getting back to me! I appreciate it and your time answering me. @linzhp, I'll throw that code back in and take a look at the result. Glad to know that functionality exists.

With the current version HunkBlame.py, here's what I'm looking at ... I just need a little help interpreting what/why/how it is operating:

Here's some example code from the voldemort repository (it turns out that commit "3" 46a5b fixes a bug in the initial commit "1" fdb0f). Here's the diff for commit 3:

git diff -U1 46a5b^ 46a5b | more

diff --git a/.classpath b/.classpath
index 3c765c8..e393f40 100644
--- a/.classpath
+++ b/.classpath
@@ -9,3 +9,2 @@
        <classpathentry kind="lib" path="lib/catalina-ant.jar"/>
-       <classpathentry kind="lib" path="lib/colt.jar"/>
        <classpathentry kind="lib" path="lib/commons-codec-1.3.jar"/>
@@ -31,2 +30,3 @@
        <classpathentry kind="lib" path="lib/xerces.jar"/>
+       <classpathentry kind="lib" path="lib/colt-1.2.0.jar"/>
        <classpathentry kind="output" path="classes"/>
diff --git a/src/java/voldemort/serialization/json/JsonTypeSerializer.java b/src/java/voldemort/seri
alization/json/JsonTypeSerializer.java
index 9fa1fc0..a28ad4b 100644
--- a/src/java/voldemort/serialization/json/JsonTypeSerializer.java
+++ b/src/java/voldemort/serialization/json/JsonTypeSerializer.java
@@ -133,2 +133,3 @@ public class JsonTypeSerializer implements Serializer<Object> {
                         writeBoolean(output, (Boolean) object);
+                        break;
                     default:
@@ -480,3 +481,3 @@ public class JsonTypeSerializer implements Serializer<Object> {
                                                  + " but got " + object);
-            for(Map.Entry<String, Object> entry : type.entrySet()) {
+            for(Map.Entry<String, Object> entry: type.entrySet()) {
                 if(!object.containsKey(entry.getKey()))
@@ -493,3 +494,3 @@ public class JsonTypeSerializer implements Serializer<Object> {
         Map<String, Object> m = new HashMap<String, Object>(type.size());
-        for(String property : type.keySet())
+        for(String property: type.keySet())
             m.put(property, read(stream, type.get(property)));
@@ -508,3 +509,3 @@ public class JsonTypeSerializer implements Serializer<Object> {
             output.writeShort(objects.size());
-            for(Object o : objects)
+            for(Object o: objects)
                 write(output, o, entryType);

And here's what's in the database (after running the prior extensions and HunkBlame.py) for these hunks:

sqlite> select * from hunks where commit_id = 3;
307|1|3|10|10|||0|
308|1|3|||31|31|0|
309|155|3|||134|134|0|
310|155|3|481|481|482|482|0|
311|155|3|494|494|495|495|0|
312|155|3|509|509|510|510|0|
sqlite> select * from files where id = 1 or id = 155;
1|.classpath|1
155|JsonTypeSerializer.java|1
sqlite> select * from hunk_blames where (hunk_id >= 307) and (hunk_id <= 312);
1|307|1
219|312|1
220|310|1
221|311|1

Let me see if I can work through the hunks:

307 is the delection of line 10 in classpath
308 is the addition of line 31 in classpath (which doesn't have an old_line b/c it is new)
309 is the addition of "break" to JsonTypeSerializer (again, a brand new line)
310,311,312 are the line modifications in other spots in JsonTypeSerializer

Ok, that seems to work the way I would expect. I'm not sure why I couldn't follow it through before -- part of it might have been that I was using the sqlitebrowser tool which seems to only handle queries correctly when it wants to. I redid the queries with the sqlite3 cmd line tool.

Best,
Mark

mfenner1 · 2011-07-23T17:40:18Z

So, I loaded the linked version of HunkBlame:

https://github.com/SoftwareIntrospectionLab/cvsanaly/blob/5a3a0edf6fc108d102351e457520a11db173dc5f/pycvsanaly2/extensions/HunkBlame.py

and it seems to be unhappy about the content handler's use of a db connection/cursor that was created in the (parent?) thread:

line 41, in init
self.cursor = job.cnn.cursor()
ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 140454185363200 and this is thread id 140454098433792

Do you think it would be easier for me (1) to make the old version (of HunkBlame) use connections created within the handler (or at least within the BlameJob?) or (2) to make the new version (of HunkBlame) use the processing of the old version to grab the hunk to hunk blames? Or is there a better way?

The recent versions of HunkBlame seem to do all the db querying outside the job, so the queries are all within the parent/root thread. Is that right?

Best,
Mark

linzhp · 2011-07-23T18:13:37Z

If SQLite didn't allow you to share a database connection between different threads, try creating a connection for each thread, but remember to close the connection once that thread is done.

mfenner1 · 2011-07-25T13:04:25Z

So, I ran the Content extension and then the old HunkBlame extension. It seems to want a "content" member variable from the blame_line instance. And, I can't find where that might get assigned. Also, Content.py threw many exceptions (lots of :path specifiers being either (1) "not a valid object name" or (2) "path does not exist in ). . I'm going to dig into Content.py and the content table today and try to figure out what's going on.

Here's the current show stopper:

File "/usr/lib64/python2.7/site-packages/cvsanaly-2.4-py2.7.egg/pycvsanaly2/extensions/HunkBlameH2H.py", line 95, in line
if blame_line.content == line.strip():
File "/usr/lib64/python2.7/site-packages/guilty-2.0.1-py2.7.egg/guilty/Blame.py", line 38, in getattr
return self.dict[name]
KeyError: 'content'

Best,
Mark

Ok, so I'm practicing my "git-fu" and I found all occurances of ".contents" in the 5a3a0e commit:

mfenner [599] % git grep ".content" 5a3a0ed | more
5a3a0ed:pycvsanaly2/PatchParser.py: self.contents = contents
5a3a0ed:pycvsanaly2/PatchParser.py: if self.contents == "\n" and leadchar == " " and
False:
5a3a0ed:pycvsanaly2/PatchParser.py: if not self.contents.endswith('\n'):
5a3a0ed:pycvsanaly2/PatchParser.py: return leadchar + self.contents + terminator
5a3a0ed:pycvsanaly2/PatchParser.py: yield hunk_line.contents
5a3a0ed:pycvsanaly2/PatchParser.py: if orig_line != hunk_line.contents:
5a3a0ed:pycvsanaly2/extensions/Content.py: LENGTH(REPLACE(c.content, x'0a', ''))
as loc
5a3a0ed:pycvsanaly2/extensions/HunkBlame.py: if blame_line.content ==
line.strip():

That leads me to look at PatchParser and/or go outside cvsanaly and look at repohandler ... or am I getting farther away from the problem?

linzhp · 2011-07-25T15:55:53Z

In order to have "content" attribute of blame_line, apply this patch to your Guilty: linzhp/guilty@4e7f2cc

mfenner1 · 2011-07-25T16:12:37Z

Nice, thank you. I had found that BlameLine instantiation and added this on my own (I like very verbose regexs :) ) ... this is just for git):

    line_pattern = re.compile(r"""^
                                 ([^ ]+)[\t ]+     # commit id
                                 ([^ ]+)[\t ]+     # filename
                                 \(                # (
                                 (.+)              # author
                                 \                 # space
                                 ([0-9]+)          # epoch secs
                                 \ [+-][0-9][0-9][0-9][0-9][\t ]+
                                 ([0-9]+)          # edited line no
                                 \)                # )
                                 \                 # space
                                 (.*)$""", re.VERBOSE)

I've been working on a couple other issues to get the old HunkBlame to work with the newer code around it. I "fixed" (I hope!) one piece in HunkBlame.run(): self.fp.get_path() -> self.fp.get_path_from_database(). I've now got problems with Blame.process_finished_jobs (the BlameJob isn't assigning failed ... so I'm looking into that):
if not job.failed:
AttributeError: 'HunkBlameH2HJob' object has no attribute 'failed'

It's very quick & dirty, but I put the DB connection and cursor creation in the ContentHandler.init. Is it safe to put the DB tear-down in the ContentHandler.end_file() method?

Best,
Mark

linzhp · 2011-07-25T16:18:40Z

Closing DB connection in end_file is OK off the top of my mind, the only concern is if it needs to do other db operation after end_file is called. You could try that.

mfenner1 · 2011-07-26T16:10:20Z

So, I think I have this "working" now. I renamed it HunkBlameH2H (hunk to hunk) to avoid confusion, mangling, etc. I need to clean it up quite a bit, test it quite a bit, and b/c of the connection creation inside jobs I may need to look at performance so it can run on a "real" repository. But, it's moving forward nicely.

Thank you @linzhp. If I'm on the west coast, I'll buy you your choice of beer, coffee, tea, smoothie, etc. :)

linzhp · 2011-07-26T18:57:20Z

You are welcome. Good to know it's working

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HunkBlame.py Operation #103

HunkBlame.py Operation #103

mfenner1 commented Jul 22, 2011

cflewis commented Jul 22, 2011

linzhp commented Jul 22, 2011

mfenner1 commented Jul 23, 2011

mfenner1 commented Jul 23, 2011

linzhp commented Jul 23, 2011

mfenner1 commented Jul 25, 2011

linzhp commented Jul 25, 2011

mfenner1 commented Jul 25, 2011

linzhp commented Jul 25, 2011

mfenner1 commented Jul 26, 2011

linzhp commented Jul 26, 2011

HunkBlame.py Operation #103

HunkBlame.py Operation #103

Comments

mfenner1 commented Jul 22, 2011

cflewis commented Jul 22, 2011

linzhp commented Jul 22, 2011

mfenner1 commented Jul 23, 2011

mfenner1 commented Jul 23, 2011

linzhp commented Jul 23, 2011

mfenner1 commented Jul 25, 2011

linzhp commented Jul 25, 2011

mfenner1 commented Jul 25, 2011

linzhp commented Jul 25, 2011

mfenner1 commented Jul 26, 2011

linzhp commented Jul 26, 2011