Finding all bugs that affects two or more projects

Asked by John Brondum

Hi there

I'm Ph.D. student with the Computer Science and Engineering school at University of New South Wales (Sydney, Australia). My research interest is in the cross project (software) relationships, and as such the launchpad bugs database is an interesting source of data for me.

But I'd like to find a way to do two things:
1. Extract only the bugs that affect two or more projects. I haven't be able to identify a way to find these other than 'brute' force (look at every bug).
2. Once I have found those bugs, download the bug information in either HTML or other format for offline analysis.

Would you be able to help?

Thanks
John

Question information

Language:
English Edit question
Status:
Solved
For:
Launchpad itself Edit question
Assignee:
No assignee Edit question
Solved by:
John Brondum
Solved:
Last query:
Last reply:

This question was reopened

Revision history for this message
Deryck Hodge (deryck) said :
#1

Hi, John.

You cannot currently do a cross project search of this kind. See bug 534369, which gets at something similar. I'm not sure this would enable the sort of "how many bugs only affect two projects" type of question you're asking. It's an interesting use case, though.

As for offline processing, have you looked into using the Launchpad API[1] to do any of this work? You still can't do cross-project search from the API, but once you know about a list of bugs, you can do any queries or processing you need from the API and process the data returned by the API however you like.

[1] https://help.launchpad.net/API

Cheers,
deryck

Revision history for this message
John Brondum (johnbrondum) said :
#2

Thanks Deryck

Revision history for this message
John Brondum (johnbrondum) said :
#3

I have created a SQL statement which should be able to extract the bugs that affects two or more projects. I have tested the script against a locally installed Launchpad version.

I was hoping that maybe one of the Launchpad developers might be able to run it during low load periods - and I'm happy for it to run against a backup / redundant server where the data is not necessarily 100% up-to-date.

Happy to pay for an assorted box of krispy kreme donuts or similar in return if helping science isn't sufficient motivation ;-)

You can see the script here:
http://phdprogressnotes.files.wordpress.com/2010/06/sql.pdf

Thanks

Revision history for this message
Steve McInerney (spm) said :
#4

Have had a PM with John on this. For the record:
* per Derek's suggestion, using the API is preferred
* did an explain run on the query on staging, and the cost of same is VERY high.
* as it's running directly, we'd need to be careful about accidental release of information that should be protected. Having been down this path before, this is not trivial.

Is there scope to possibly do some random statistical analysis? ie rather than brute force every bug; pick, say, 1-3000 at random, query those via the API and draw conclusions from that limited set?

???

Revision history for this message
John Brondum (johnbrondum) said :
#5

Fully understand the concerns - I'll work with what I can get extract using the APIs

thanks again for helping.

:)