2
\$\begingroup\$

I tried to create a peice of code that helps me remove duplicates from my sql database and reset the ID's as well. I am not sure that it's error proof.

import mysql.connector # Database connection parameters host = 'host database = 'database' user = 'user' password = 'password' port = numbers try: # Establishing the connection conn = mysql.connector.connect( host=host, database=database, user=user, password=password, port=port ) if conn.is_connected(): print('Connected to the database') cursor = conn.cursor() # Remove duplicates delete_duplicates_query = """ DELETE t1 FROM articles t1 INNER JOIN articles t2 WHERE t1.ID < t2.ID AND t1.headline = t2.headline AND t1.description = t2.description AND t1.link = t2.link AND t1.date = t2.date AND t1.hour = t2.hour; """ cursor.execute(delete_duplicates_query) conn.commit() print('Duplicates removed') # Reset primary key reset_primary_key_query = """ SET @new_id = 0; UPDATE articles SET ID = (@new_id := @new_id + 1); ALTER TABLE articles AUTO_INCREMENT = 1; """ cursor.execute(reset_primary_key_query) conn.commit() print('Primary key reset') except mysql.connector.Error as e: print(f"Error: {e}") finally: if conn.is_connected(): cursor.close() conn.close() print('Database connection closed') 

For reference

Table name : articles

ID

headline

description

link

date

hour

\$\endgroup\$

    1 Answer 1

    1
    \$\begingroup\$

    UNIQUE index

    The OP statement to DELETE dups looks fine.

    It's unclear why we would want to INSERT those rows in the first place, only to change our mind and rip them out later. Consider creating a UNIQUE articles index on (headline, description, link, date, hour). Then errant INSERT attempts will immediately report an error, and there will be no need for any DELETE cleanup.

    timestamp

    The separate {date, hour} columns seem inconvenient. Consider storing a single timestamp column. You can always extract a date or an hour from it later if you need that.

    It should store UTC time in seconds since the epoch (1970), with no timezone information. Worry about zone offsets on the way in and the way out, preferring to store just UTC on disk.

    reset primary key

    This seems ill advised. Recommend you not attempt to change a PKid after it has been stored.

    Suppose a line of code syslog's an id, and then you change it. Now the logfile no longer speaks truth.

    Suppose you wind up with FK relationships. Now UPDATE articles SET id ... will be updating random rows on those other tables.

    Prefer to think of each id as immutable.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.