Apache OpenOffice (AOO) Bugzilla – Issue 53779
querying DISTINCT values from a dBase table is incredibly slow with larger data sets
Last modified: 2008-09-21 16:01:59 UTC
- extract the attached archive to a location of your choice. It contains a database document (.odb) and some dBase files - open the database document - Edit|Database|Properties -> adjust the path to the dBase files to the location you just extracted the archive to - open the query "performance_distinct", which does a SELECT DISTINCT * FROM performance where the "performance" table contains 5 columns and 500 records => this takes about 5:30 minutes on a 3GHz Athlon XP This is incredibly slow, IMO, and we should investigate where this can be improved
Created attachment 29052 [details] document(s) to reproduce the bug case
this blocks a sensible fix for issue 51151, which would require that Calc moves to SELECT DISTINCT when collecting data for the data pilot. At the moment, this is not possible becausse our dBase implementation (actually, our file-base implementation) is so slow with this.
Such a serious performance issue should be fixed earlier than "OOo Later", /me thinks
I've been taking a look at this, and I think I've got a suitable patch that I will attach for testing. I've removed the old O(n^2) code for finding distinct values and rearranged things such that we: sort on all fields, make one pass through the rows and eliminate duplicates (relying on the full sort), then re-sort if necessary. This is my first OOo patch, so it's likely that I've missed something important. I'm certainly open to criticism.
Created attachment 53295 [details] Proposed patch for testing
fs->jcottrell: Thanks for the patch! Changing issue type to PATCH then. Ocke, can you please review the patch, and work with jcottrell to improve it, if necessary? Thanks.
Thank you for the patch. No it only takes some seconds to show the result :-) If you got more patches, they are very very welcome ;-)
Fixed in cws dba30c
Great, glad I could be of some help. And thanks for the quick feedback; that's a nice welcome to the project. I'll be trying to get more familiar with this codebase and handle issues where possible, so I'm looking forward to working with you guys.
Please verify. Thanks.
Sorry, is that directed at me or QA?
QA
verified in CWS dba30c find more information about this CWS, like when it is available in the master builds, in EIS, the Environment Information System: http://eis.services.openoffice.org/EIS2/cws.ShowCWS?Path=DEV300%2Fdba30c
Verified on 3.0.0rc2 Linux X64: OK (result comes in less than one second on a standard laptop)
closing then. gibi, thanks for the feedback.