How to generate unique id for each record spark

+1 vote
asked Sep 13, 2017 by rashmi-mardur

I have a huge datasets with MM+ records and I am trying to assign unique id to each record. I tried below code but it takes lot of time as row id is sequential. I have tried tweak memory parameters to optimize job, couldn't gain much performance.

sample snippet:

JavaRDD<String> rawRdd=......
rawRdd.zipWithIndex()
.mapToPair(t->new Tuple2<Long,String>(t._2,t._1))

Are there any better way to assign unique id? thanks

1 Answer

+1 vote
answered Sep 13, 2017 by user1970832

Approach 1: if you requirement is just to assign unique id, you may use UUID as unique row id:

rawRdd.mapToPair(t->new Tuple2<String,String>(t,UUID.randomUUID().toString()));

Only drawback is that the id length is 36 bytes.

Approach 2: Create a centralize system to assign unique id. I use REST based API which follow a pattern to generate id and each map operation calls REST service to get unique id.

2nd approach gives you full control to design the pattern for id.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...