[DCRM-L] FYI: Updates to OCLC's matching/deduplicating software

Dooley,Jackie dooleyj at oclc.org
Thu Aug 13 10:54:29 MDT 2015


Occasionally questions have arisen on DCRM-L about the sophistication (or perceived lack thereof) of OCLC's software for deduplication of bib records. The reality is that staff in our Metadata Quality group are constantly improving the software based on user feedback and quality review of incoming and de-duped records. Tweaking is so regular that they meet more than once/week to discuss details.

As an example of what goes on, the latest round of changes appears below. In effect, this is a random group of additions to what is already a really complex list of criteria. Note that the same routine is used for matching batchloaded records against WorldCat as for dedup-ing existing records.

Remember that none of this applies to pre-1800 records. They are not de-duped.

Best wishes to all, Jackie


-


Jackie Dooley


Program Officer, OCLC Research


647 Camino de los Mares, Suite 108-240

San Clemente, CA 92673


office/home 949-492-5060
mobile 949-295-1529
dooleyj at oclc.org<mailto:dooleyj at oclc.org>

[OCLC]<http://www.oclc.org/home.en.html?cmpid=emailsig_logo>


OCLC.org<http://www.oclc.org/home.en.html?cmpid=emailsig_link>/research


1.     MATCH-2974 (https://jira.oclc.org/browse/MATCH-2974):  “Confirm Matches on Both Bracketed and Un-Bracketed Data in Extent.”

In the past, square-bracketed data ([]) in Extent (field 300) had generally been ignored for comparison, but because of RDA changes that have resulted in greatly reduced use of square brackets, extent matching now includes both bracketed and un-bracketed data.  Data within angle brackets (<>) are temporary data, a practice that has not changed under RDA, so treatment of these data has not changed.

2.     MATCH-2990 (https://jira.oclc.org/browse/MATCH-2990):  “Drop Duration Statements as a Point of Comparison in Extent.”

Parenthetical duration statements that appear in Extent (field 300) for both remotely-accessed and tangible sound recordings are no longer included in extent matching.

3.     MATCH-2992 (https://jira.oclc.org/browse/MATCH-2992):  “Repeatable 250 Edition Statement.”

The edition statement field 250 was made repeatable in OCLC-MARC in 2014.  Matching now takes into consideration all iterations of field 250 for comparison.

4.     MATCH-2995 (https://jira.oclc.org/browse/MATCH-2995):  “Publisher Comparison is Mismatching on 533:533 but Failing to Stop Trying.”

Once publishers in field 533 are determined not to match, further attempts to match publishers are stopped.  The only exceptions are where subfield $5 is present in field 533, in which case publishers in field 533 are disregarded and matching goes on to compare publishers in fields 260/264.

5.     MATCH-2996 (https://jira.oclc.org/browse/MATCH-2996):  “Extent Comparison with ‘(various pagings)’ Needs Changes to Expect ‘volume.’”

The extent comparison was not taking into consideration the spelled-out “volume/volumes” in addition to the abbreviated “v.” in the context of extents that included the designation “(various pagings).”  This was resulting in incorrect matching of single volumes with multiple volumes.  Incidentally, this JIRA also added several additional equivalents for the publisher/distributor National Technical Information Service.

6.     MATCH-2999 (https://jira.oclc.org/browse/MATCH-2999):  “Equivalent Edition Statements Not Equated (Commonly Occurring Case).”

Additional variants for “first edition” statements have been added as equivalents.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserver.lib.byu.edu/pipermail/dcrm-l/attachments/20150813/1f846f0d/attachment-0001.html>


More information about the DCRM-L mailing list